nn#

(s3prl.nn)

Common model and loss in pure torch.nn.Module with torch dependency only

s3prl.nn.beam_decoder

The beam search decoder of flashlight

s3prl.nn.common

Common probing models

s3prl.nn.hear

The probing model following Hear Benchmark

s3prl.nn.interface

Model interfaces

s3prl.nn.linear

Common linear models

s3prl.nn.pit

Permutation Invariant Training (PIT) loss

s3prl.nn.pooling

Common pooling methods

s3prl.nn.rnn

RNN models used in Superb Benchmark

s3prl.nn.speaker_loss

Speaker verification loss

s3prl.nn.speaker_model

Speaker verification models

s3prl.nn.specaug

Specaug modules

s3prl.nn.upstream

S3PRL Upstream Collection and some utilities

S3PRLUpstream#

class s3prl.nn.S3PRLUpstream(name: str, path_or_url: Optional[str] = None, refresh: bool = False, normalize: bool = False, extra_conf: Optional[dict] = None, randomize: bool = False)[source][source]#

Bases: Module

This is an easy interface for using all the models in S3PRL. See S3PRL Upstream Collection for the example usage and all the supported models.

Parameters:
  • name (str) – can be “apc”, “hubert”, “wav2vec2”. See available_names for all the supported names

  • path_or_url (str) – The source of the checkpoint. Might be a local path or a URL

  • refresh (bool) – (default, False) If false, only downlaod checkpoint if not yet downloaded before. If true, force to re-download the checkpoint.

  • extra_conf (dict) – (default, None) The extra arguments for each specific upstream, the available options are shown in each upstream section

  • randomize (bool) – (default, False) If True, randomize the upstream model

Note

When using S3PRLUpstream with refresh=True and multiprocessing (e.g. DDP), the checkpoint will only be downloaded once, and the other processes will simply re-use the newly downloaded checkpoint, instead of re-downloading on every processes, which can be very time/bandwidth consuming.

Example:

>>> import torch
>>> from s3prl.nn import S3PRLUpstream
...
>>> model = S3PRLUpstream("hubert")
>>> model.eval()
...
>>> with torch.no_grad():
...     wavs = torch.randn(2, 16000 * 2)
...     wavs_len = torch.LongTensor([16000 * 1, 16000 * 2])
...     all_hs, all_hs_len = model(wavs, wavs_len)
...
>>> for hs, hs_len in zip(all_hs, all_hs_len):
...     assert isinstance(hs, torch.FloatTensor)
...     assert isinstance(hs_len, torch.LongTensor)
...
...     batch_size, max_seq_len, hidden_size = hs.shape
...     assert hs_len.dim() == 1
classmethod available_names(only_registered_ckpt: bool = False) List[str][source][source]#

All the available names supported by this S3PRLUpstream

Parameters:

only_registered_ckpt (bool) – ignore entry names which require to give path_or_url. That is, the entry names without the registered checkpoint sources. These names end with _local (for local path), _url (for URL) or _custom (auto-determine path or URL)

property num_layers: int[source]#

Number of hidden sizes. All the upstream have a deterministic number of layers. That is, layer drop is turned off by default.

property downsample_rates: List[int][source]#

Downsampling rate from 16000 Hz audio of each layer. Usually, all layers have the same downsampling rate, but might not be the case for some advanced upstreams.

property hidden_sizes: List[int][source]#

The hidden size of each layer

forward(wavs: FloatTensor, wavs_len: LongTensor)[source][source]#
Parameters:
  • wavs (torch.FloatTensor) – (batch_size, seqlen) or (batch_size, seqlen, 1)

  • wavs_len (torch.LongTensor) – (batch_size, )

Returns:

List[torch.FloatTensor], List[torch.LongTensor]

  1. all the layers of hidden states: List[ (batch_size, max_seq_len, hidden_size) ]

  2. the valid length for each hidden states: List[ (batch_size, ) ]

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#

Featurizer#

class s3prl.nn.Featurizer(upstream: S3PRLUpstream, layer_selections: Optional[List[int]] = None, normalize: bool = False)[source][source]#

Bases: Module

Featurizer take the S3PRLUpstream’s multiple layer of hidden_states and reduce (standardize) them into a single hidden_states, to connect with downstream NNs.

This basic Featurizer expects all the layers to have same stride and hidden_size When the input upstream only have a single layer of hidden states, use that directly. If multiple layers are presented, add a trainable weighted-sum on top of those layers.

Parameters:
  • upstream (S3PRLUpstream) – the upstream to extract features, this upstream is used only for initialization and will not be kept in this Featurizer object

  • layer_selections (List[int]) – To select a subset of hidden states from the given upstream by layer ids (0-index) If None (default), than all the layer of hidden states are selected

  • normalize (bool) – Whether to apply layer norm on all the hidden states before weighted-sum This can help convergence in some cases, but not used in SUPERB to ensure the fidelity of each upstream’s extracted representation.

Example:

>>> import torch
>>> from s3prl.nn import S3PRLUpstream, Featurizer
...
>>> model = S3PRLUpstream("hubert")
>>> model.eval()
...
>>> with torch.no_grad():
...     wavs = torch.randn(2, 16000 * 2)
...     wavs_len = torch.LongTensor([16000 * 1, 16000 * 2])
...     all_hs, all_hs_len = model(wavs, wavs_len)
...
>>> featurizer = Featurizer(model)
>>> hs, hs_len = featurizer(all_hs, all_hs_len)
...
>>> assert isinstance(hs, torch.FloatTensor)
>>> assert isinstance(hs_len, torch.LongTensor)
>>> batch_size, max_seq_len, hidden_size = hs.shape
>>> assert hs_len.dim() == 1
property output_size: int[source]#

The hidden size of the final weighted-sum output

property downsample_rate: int[source]#

The downsample rate (from 16k Hz waveform) of the final weighted-sum output

forward(all_hs: List[FloatTensor], all_lens: List[LongTensor])[source][source]#
Parameters:
  • all_hs (List[torch.FloatTensor]) – List[ (batch_size, seq_len, hidden_size) ]

  • all_lens (List[torch.LongTensor]) – List[ (batch_size, ) ]

Returns:

torch.FloatTensor, torch.LongTensor

  1. The weighted-sum result, (batch_size, seq_len, hidden_size)

  2. the valid length of the result, (batch_size, )

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#

FrameLevel#

class s3prl.nn.FrameLevel(input_size: int, output_size: int, hidden_sizes: Optional[List[int]] = None, activation_type: Optional[str] = None, activation_conf: Optional[dict] = None)[source][source]#

Bases: Module

The common frame-to-frame probing model

Parameters:
  • input_size (int) – input size

  • output_size (int) – output size

  • hidden_sizes (List[int]) – a list of hidden layers’ hidden size. by default is [256] to project all different input sizes to the same dimension. set empty list to use the vanilla single layer linear model

  • activation_type (str) – the activation class name in torch.nn. Set None to disable activation and the model is pure linear. Default: None

  • activation_conf (dict) – the arguments for initializing the activation class. Default: empty dict

property input_size: int[source]#
property output_size: int[source]#
forward(x, x_len)[source][source]#
Parameters:
  • x (torch.FloatTensor) – (batch_size, seq_len, input_size)

  • x_len (torch.LongTensor) – (batch_size, )

Returns:

tuple

  1. ys (torch.FloatTensor): (batch_size, seq_len, output_size)

  2. ys_len (torch.LongTensor): (batch_size, )

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#

UtteranceLevel#

class s3prl.nn.UtteranceLevel(input_size: int, output_size: int, hidden_sizes: Optional[List[int]] = None, activation_type: Optional[str] = None, activation_conf: Optional[dict] = None, pooling_type: str = 'MeanPooling', pooling_conf: Optional[dict] = None)[source][source]#

Bases: Module

Parameters:
  • input_size (int) – input_size

  • output_size (int) – output_size

  • hidden_sizes (List[int]) – a list of hidden layers’ hidden size. by default is [256] to project all different input sizes to the same dimension. set empty list to use the vanilla single layer linear model

  • activation_type (str) – the activation class name in torch.nn. Set None to disable activation and the model is pure linear. Default: None

  • activation_conf (dict) – the arguments for initializing the activation class. Default: empty dict

  • pooling_type (str) – the pooling class name in s3prl.nn.pooling. Default: MeanPooling

  • pooling_conf (dict) – the arguments for initializing the pooling class. Default: empty dict

property input_size: int[source]#
property output_size: int[source]#
forward(x, x_len)[source][source]#
Parameters:
  • x (torch.FloatTensor) – (batch_size, seq_len, input_size)

  • x_len (torch.LongTensor) – (batch_size, )

Returns:

torch.FloatTensor

(batch_size, output_size)

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#

FrameLevelLinear#

class s3prl.nn.FrameLevelLinear(input_size: int, output_size: int, hidden_size: int = 256)[source][source]#

Bases: FrameLevel

The frame-level linear probing model used in SUPERB Benchmark

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
forward(x, x_len)[source]#
Parameters:
  • x (torch.FloatTensor) – (batch_size, seq_len, input_size)

  • x_len (torch.LongTensor) – (batch_size, )

Returns:

tuple

  1. ys (torch.FloatTensor): (batch_size, seq_len, output_size)

  2. ys_len (torch.LongTensor): (batch_size, )

property input_size: int[source]#
property output_size: int[source]#
training: bool[source]#

MeanPoolingLinear#

class s3prl.nn.MeanPoolingLinear(input_size: int, output_size: int, hidden_size: int = 256)[source][source]#

Bases: UtteranceLevel

The utterance-level linear probing model used in SUPERB Benchmark

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
forward(x, x_len)[source]#
Parameters:
  • x (torch.FloatTensor) – (batch_size, seq_len, input_size)

  • x_len (torch.LongTensor) – (batch_size, )

Returns:

torch.FloatTensor

(batch_size, output_size)

property input_size: int[source]#
property output_size: int[source]#
training: bool[source]#

MeanPooling#

class s3prl.nn.MeanPooling(input_size: int)[source][source]#

Bases: Module

Computes Temporal Average Pooling (MeanPooling over time) Module

property input_size: int[source]#
property output_size: int[source]#
forward(xs: Tensor, xs_len: LongTensor)[source][source]#
Parameters:
  • xs (torch.Tensor) – Input tensor (#batch, frames, input_size).

  • xs_len (torch.LongTensor) – with the lengths for each sample

Returns:

Output tensor (#batch, input_size)

Return type:

torch.Tensor

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#

TemporalAveragePooling#

s3prl.nn.TemporalAveragePooling[source]#

alias of MeanPooling

TemporalStatisticsPooling#

class s3prl.nn.TemporalStatisticsPooling(input_size: int)[source][source]#

Bases: Module

Paper: X-vectors: Robust DNN Embeddings for Speaker Recognition Link: http://www.danielpovey.com/files/2018_icassp_xvectors.pdf

property input_size: int[source]#
property output_size: int[source]#
forward(xs, xs_len)[source][source]#

Computes Temporal Statistics Pooling Module

Parameters:
  • xs (torch.Tensor) – Input tensor (#batch, frames, input_size).

  • xs_len (torch.LongTensor) – with the lengths for each sample

Returns:

Output tensor (#batch, output_size)

Return type:

torch.Tensor

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#

RNNEncoder#

class s3prl.nn.RNNEncoder(input_size: int, output_size: int, module: str = 'LSTM', proj_size: int = 1024, hidden_size: List[int] = [1024], dropout: List[float] = [0.0], layer_norm: List[bool] = [False], proj: List[bool] = [True], sample_rate: List[int] = [1], sample_style: str = 'drop', bidirectional: bool = False)[source][source]#

Bases: AbsFrameModel

RNN Encoder for sequence to sequence modeling, e.g., ASR.

Parameters:
  • input_size (int) – Input size.

  • output_size (int) – Output size.

  • module (str, optional) – RNN module type. Defaults to “LSTM”.

  • hidden_size (List[int], optional) – Hidden sizes for each layer. Defaults to [1024].

  • dropout (List[float], optional) – Dropout rates for each layer. Defaults to [0.0].

  • layer_norm (List[bool], optional) – Whether to use layer norm for each layer. Defaults to [False].

  • proj (List[bool], optional) – Whether to use projection for each layer. Defaults to [True].

  • sample_rate (List[int], optional) – Downsample rates for each layer. Defaults to [1].

  • sample_style (str, optional) – Downsample style (“drop” or “concat”). Defaults to “drop”.

  • bidirectional (bool, optional) – Whether RNN layers are bidirectional. Defaults to False.

forward(x: Tensor, x_len: LongTensor)[source][source]#
Parameters:
  • xs (torch.FloatTensor) – (batch_size, seq_len, input_size)

  • xs_len (torch.LongTensor) – (batch_size, )

Returns:

  1. ys (torch.FloatTensor): (batch_size, seq_len, output_size)

  2. ys_len (torch.LongTensor): (batch_size, )

Return type:

tuple

property input_size: int[source]#
property output_size: int[source]#
call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#

SuperbDiarizationModel#

class s3prl.nn.SuperbDiarizationModel(input_size: int, output_size: int, rnn_layers: int, hidden_size: int)[source][source]#

Bases: AbsFrameModel

The exact RNN model used in SUPERB Benchmark for Speaker Diarization

Parameters:
  • input_size (int) – input_size

  • output_size (int) – output_size

  • rnn_layers (int) – number of rnn layers

  • hidden_size (int) – the hidden size across all rnn layers

property input_size: int[source]#
property output_size: int[source]#
call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
forward(xs, xs_len)[source][source]#
Parameters:
  • xs (torch.FloatTensor) – (batch_size, seq_len, input_size)

  • xs_len (torch.LongTensor) – (batch_size, )

Returns:

  1. ys (torch.FloatTensor): (batch_size, seq_len, output_size)

  2. ys_len (torch.LongTensor): (batch_size, )

Return type:

tuple

training: bool[source]#

amsoftmax#

class s3prl.nn.amsoftmax(input_size: int, output_size: int, margin: float = 0.2, scale: float = 30)[source][source]#

Bases: Module

AMSoftmax

Parameters:
  • input_size (int) – The input feature size

  • output_size (int) – The output feature size

  • margin (float) – Hyperparameter denotes the margin to the decision boundry

  • scale (float) – Hyperparameter that scales the cosine value

property input_size[source]#
property output_size[source]#
forward(x: Tensor, label: LongTensor)[source][source]#
Parameters:
  • x (torch.Tensor) – (batch_size, input_size)

  • label (torch.LongTensor) – (batch_size, )

Returns:

loss (torch.float) logit (torch.Tensor): (batch_size, )

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#

softmax#

class s3prl.nn.softmax(input_size: int, output_size: int)[source][source]#

Bases: Module

The standard softmax loss in an unified interface for all speaker-related softmax losses

property input_size[source]#
property output_size[source]#
forward(x: Tensor, label: LongTensor)[source][source]#
Parameters:
  • x (torch.Tensor) – (batch_size, input_size)

  • label (torch.LongTensor) – (batch_size, )

Returns:

loss (torch.float) logit (torch.Tensor): (batch_size, )

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#

SuperbXvector#

class s3prl.nn.SuperbXvector(input_size: int, output_size: int = 512, hidden_size: int = 512, aggregation_size: int = 1500, dropout_p: float = 0.0, batch_norm: bool = False)[source][source]#

Bases: Module

The Xvector used in the SUPERB Benchmark with the exact default arguments.

Parameters:
  • input_size (int) – The input feature size, usually is the output size of upstream models

  • output_size (int) – (default, 512) The size of the speaker embedding

  • hidden_size (int) – (default, 512) The major hidden size in the network

  • aggregation_size (int) – (default, 1500) The output size of the x-vector, which is usually large

  • dropout_p (float) – (default, 0.0) The dropout rate

  • batch_norm (bool) – (default, False) Use batch norm for TDNN layers

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#
property input_size: int[source]#
property output_size: int[source]#
forward(x, x_len)[source][source]#
Parameters:
  • x (torch.FloatTensor) – (batch_size, seq_len, input_size)

  • x_len (torch.LongTensor) – (batch_size, )

Returns:

(batch_size, output_size)

Return type:

torch.FloatTensor

XVectorBackbone#

class s3prl.nn.XVectorBackbone(input_size: int, output_size: int = 1500, dropout_p: float = 0.0, batch_norm: False = True)[source][source]#

Bases: Module

The TDNN layers the same as in https://danielpovey.com/files/2018_odyssey_xvector_lid.pdf.

Parameters:
  • input_size (int) – The input feature size, usually is the output size of upstream models

  • output_size (int) – (default, 1500) The size of the speaker embedding

  • dropout_p (float) – (default, 0.0) The dropout rate

  • batch_norm (bool) – (default, False) Use batch norm for TDNN layers

property input_size: int[source]#
property output_size: int[source]#
forward(x: Tensor)[source][source]#
Parameters:

x (torch.FloatTensor) – (batch, seq_len, input_size)

output:

torch.FloatTensor: (batch, seq_len, output_size)

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#

BeamDecoder#

class s3prl.nn.BeamDecoder(token: str = '', lexicon: str = '', lm: str = '', nbest: int = 1, beam: int = 5, beam_size_token: int = -1, beam_threshold: float = 25.0, lm_weight: float = 2.0, word_score: float = -1.0, unk_score: float = -inf, sil_score: float = 0.0)[source][source]#

Bases: object

Beam decoder powered by flashlight.

Parameters:
  • token (str, optional) – Path to dictionary file. Defaults to “”.

  • lexicon (str, optional) – Path to lexicon file. Defaults to “”.

  • lm (str, optional) – Path to KenLM file. Defaults to “”.

  • nbest (int, optional) – Returns nbest hypotheses. Defaults to 1.

  • beam (int, optional) – Beam size. Defaults to 5.

  • beam_size_token (int, optional) – Token beam size. Defaults to -1.

  • beam_threshold (float, optional) – Beam search log prob threshold. Defaults to 25.0.

  • lm_weight (float, optional) – language model weight. Defaults to 2.0.

  • word_score (float, optional) – score for words appearance in the transcription. Defaults to -1.0.

  • unk_score (float, optional) – score for unknown word appearance in the transcription. Defaults to -math.inf.

  • sil_score (float, optional) – score for silence appearance in the transcription. Defaults to 0.0.

get_tokens(idxs: Iterable) LongTensor[source][source]#

Normalize tokens by handling CTC blank, ASG replabels, etc.

Parameters:

idxs (Iterable) – Token ID list output by self.decoder

Returns:

Token ID list after normalization.

Return type:

torch.LongTensor

get_timesteps(token_idxs: List[int]) List[int][source][source]#

Returns frame numbers corresponding to every non-blank token.

Parameters:

token_idxs (List[int]) – IDs of decoded tokens.

Returns:

Frame numbers corresponding to every non-blank token.

Return type:

List[int]

decode(emissions: Tensor) List[List[dict]][source][source]#

Decode sequence.

Parameters:

emissions (torch.Tensor) – Emission probabilities (in log scale).

Returns:

Decoded hypotheses.

Return type:

List[List[dict]]