nn#

(s3prl.nn)

Common model and loss in pure torch.nn.Module with torch dependency only

`s3prl.nn.beam_decoder`	The beam search decoder of flashlight
`s3prl.nn.common`	Common probing models
`s3prl.nn.hear`	The probing model following Hear Benchmark
`s3prl.nn.interface`	Model interfaces
`s3prl.nn.linear`	Common linear models
`s3prl.nn.pit`	Permutation Invariant Training (PIT) loss
`s3prl.nn.pooling`	Common pooling methods
`s3prl.nn.rnn`	RNN models used in Superb Benchmark
`s3prl.nn.speaker_loss`	Speaker verification loss
`s3prl.nn.speaker_model`	Speaker verification models
`s3prl.nn.specaug`	Specaug modules
`s3prl.nn.upstream`	S3PRL Upstream Collection and some utilities

S3PRLUpstream#

class s3prl.nn.S3PRLUpstream(name: str, path_or_url: str = None, refresh: bool = False, normalize: bool = False, extra_conf: dict = None, randomize: bool = False)[source][source]#

Bases: Module

This is an easy interface for using all the models in S3PRL. See S3PRL Upstream Collection for the example usage and all the supported models.

Parameters:

name (str) – can be “apc”, “hubert”, “wav2vec2”. See available_names for all the supported names
path_or_url (str) – The source of the checkpoint. Might be a local path or a URL
refresh (bool) – (default, False) If false, only downlaod checkpoint if not yet downloaded before. If true, force to re-download the checkpoint.
extra_conf (dict) – (default, None) The extra arguments for each specific upstream, the available options are shown in each upstream section
randomize (bool) – (default, False) If True, randomize the upstream model

Note

When using S3PRLUpstream with refresh=True and multiprocessing (e.g. DDP), the checkpoint will only be downloaded once, and the other processes will simply re-use the newly downloaded checkpoint, instead of re-downloading on every processes, which can be very time/bandwidth consuming.

Example:

>>> import torch
>>> from s3prl.nn import S3PRLUpstream
...
>>> model = S3PRLUpstream("hubert")
>>> model.eval()
...
>>> with torch.no_grad():
...     wavs = torch.randn(2, 16000 * 2)
...     wavs_len = torch.LongTensor([16000 * 1, 16000 * 2])
...     all_hs, all_hs_len = model(wavs, wavs_len)
...
>>> for hs, hs_len in zip(all_hs, all_hs_len):
...     assert isinstance(hs, torch.FloatTensor)
...     assert isinstance(hs_len, torch.LongTensor)
...
...     batch_size, max_seq_len, hidden_size = hs.shape
...     assert hs_len.dim() == 1

classmethod available_names(only_registered_ckpt: bool = False) → List[str][source][source]#

All the available names supported by this S3PRLUpstream

Parameters:: only_registered_ckpt (bool) – ignore entry names which require to give path_or_url. That is, the entry names without the registered checkpoint sources. These names end with _local (for local path), _url (for URL) or _custom (auto-determine path or URL)

property num_layers: int[source]#: Number of hidden sizes. All the upstream have a deterministic number of layers. That is, layer drop is turned off by default.

property downsample_rates: List[int][source]#: Downsampling rate from 16000 Hz audio of each layer. Usually, all layers have the same downsampling rate, but might not be the case for some advanced upstreams.

property hidden_sizes: List[int][source]#: The hidden size of each layer

forward(wavs: FloatTensor, wavs_len: LongTensor)[source][source]#

Parameters:

wavs (torch.FloatTensor) – (batch_size, seqlen) or (batch_size, seqlen, 1)
wavs_len (torch.LongTensor) – (batch_size, )

Returns:

List[torch.FloatTensor], List[torch.LongTensor]

all the layers of hidden states: List[ (batch_size, max_seq_len, hidden_size) ]
the valid length for each hidden states: List[ (batch_size, ) ]

Featurizer#

class s3prl.nn.Featurizer(upstream: S3PRLUpstream, layer_selections: List[int] = None, normalize: bool = False)[source][source]#

Bases: Module

Featurizer take the S3PRLUpstream’s multiple layer of hidden_states and reduce (standardize) them into a single hidden_states, to connect with downstream NNs.

This basic Featurizer expects all the layers to have same stride and hidden_size When the input upstream only have a single layer of hidden states, use that directly. If multiple layers are presented, add a trainable weighted-sum on top of those layers.

Parameters:

upstream (S3PRLUpstream) – the upstream to extract features, this upstream is used only for initialization and will not be kept in this Featurizer object
layer_selections (List[int]) – To select a subset of hidden states from the given upstream by layer ids (0-index) If None (default), than all the layer of hidden states are selected
normalize (bool) – Whether to apply layer norm on all the hidden states before weighted-sum This can help convergence in some cases, but not used in SUPERB to ensure the fidelity of each upstream’s extracted representation.

Example:

>>> import torch
>>> from s3prl.nn import S3PRLUpstream, Featurizer
...
>>> model = S3PRLUpstream("hubert")
>>> model.eval()
...
>>> with torch.no_grad():
...     wavs = torch.randn(2, 16000 * 2)
...     wavs_len = torch.LongTensor([16000 * 1, 16000 * 2])
...     all_hs, all_hs_len = model(wavs, wavs_len)
...
>>> featurizer = Featurizer(model)
>>> hs, hs_len = featurizer(all_hs, all_hs_len)
...
>>> assert isinstance(hs, torch.FloatTensor)
>>> assert isinstance(hs_len, torch.LongTensor)
>>> batch_size, max_seq_len, hidden_size = hs.shape
>>> assert hs_len.dim() == 1

property output_size: int[source]#: The hidden size of the final weighted-sum output

property downsample_rate: int[source]#: The downsample rate (from 16k Hz waveform) of the final weighted-sum output

forward(all_hs: List[FloatTensor], all_lens: List[LongTensor])[source][source]#

Parameters:

all_hs (List[torch.FloatTensor]) – List[ (batch_size, seq_len, hidden_size) ]
all_lens (List[torch.LongTensor]) – List[ (batch_size, ) ]

Returns:

torch.FloatTensor, torch.LongTensor

The weighted-sum result, (batch_size, seq_len, hidden_size)
the valid length of the result, (batch_size, )

FrameLevel#

class s3prl.nn.FrameLevel(input_size: int, output_size: int, hidden_sizes: List[int] = None, activation_type: str = None, activation_conf: dict = None)[source][source]#

Bases: Module

The common frame-to-frame probing model

Parameters:

input_size (int) – input size
output_size (int) – output size
hidden_sizes (List[int]) – a list of hidden layers’ hidden size. by default is [256] to project all different input sizes to the same dimension. set empty list to use the vanilla single layer linear model
activation_type (str) – the activation class name in torch.nn. Set None to disable activation and the model is pure linear. Default: None
activation_conf (dict) – the arguments for initializing the activation class. Default: empty dict

property input_size: int[source]#

property output_size: int[source]#

forward(x, x_len)[source][source]#

Parameters:

x (torch.FloatTensor) – (batch_size, seq_len, input_size)
x_len (torch.LongTensor) – (batch_size, )

Returns:

tuple

ys (torch.FloatTensor): (batch_size, seq_len, output_size)
ys_len (torch.LongTensor): (batch_size, )

UtteranceLevel#

class s3prl.nn.UtteranceLevel(input_size: int, output_size: int, hidden_sizes: List[int] = None, activation_type: str = None, activation_conf: dict = None, pooling_type: str = 'MeanPooling', pooling_conf: dict = None)[source][source]#

Bases: Module

Parameters:

input_size (int) – input_size
output_size (int) – output_size
hidden_sizes (List[int]) – a list of hidden layers’ hidden size. by default is [256] to project all different input sizes to the same dimension. set empty list to use the vanilla single layer linear model
activation_type (str) – the activation class name in torch.nn. Set None to disable activation and the model is pure linear. Default: None
activation_conf (dict) – the arguments for initializing the activation class. Default: empty dict
pooling_type (str) – the pooling class name in s3prl.nn.pooling. Default: MeanPooling
pooling_conf (dict) – the arguments for initializing the pooling class. Default: empty dict

property input_size: int[source]#

property output_size: int[source]#

forward(x, x_len)[source][source]#

Parameters:

x (torch.FloatTensor) – (batch_size, seq_len, input_size)
x_len (torch.LongTensor) – (batch_size, )

Returns:

torch.FloatTensor

(batch_size, output_size)

FrameLevelLinear#

class s3prl.nn.FrameLevelLinear(input_size: int, output_size: int, hidden_size: int = 256)[source][source]#

Bases: FrameLevel

The frame-level linear probing model used in SUPERB Benchmark

forward(x, x_len)[source]#

Parameters:

x (torch.FloatTensor) – (batch_size, seq_len, input_size)
x_len (torch.LongTensor) – (batch_size, )

Returns:

tuple

ys (torch.FloatTensor): (batch_size, seq_len, output_size)
ys_len (torch.LongTensor): (batch_size, )

property input_size: int[source]#

property output_size: int[source]#

MeanPoolingLinear#

class s3prl.nn.MeanPoolingLinear(input_size: int, output_size: int, hidden_size: int = 256)[source][source]#

Bases: UtteranceLevel

The utterance-level linear probing model used in SUPERB Benchmark

forward(x, x_len)[source]#

Parameters:

x (torch.FloatTensor) – (batch_size, seq_len, input_size)
x_len (torch.LongTensor) – (batch_size, )

Returns:

torch.FloatTensor

(batch_size, output_size)

property input_size: int[source]#

property output_size: int[source]#

MeanPooling#

class s3prl.nn.MeanPooling(input_size: int)[source][source]#

Bases: Module

Computes Temporal Average Pooling (MeanPooling over time) Module

property input_size: int[source]#

property output_size: int[source]#

forward(xs: Tensor, xs_len: LongTensor)[source][source]#

Parameters:

xs (torch.Tensor) – Input tensor (#batch, frames, input_size).
xs_len (torch.LongTensor) – with the lengths for each sample

Returns:

Output tensor (#batch, input_size)

Return type:

torch.Tensor

TemporalAveragePooling#

s3prl.nn.TemporalAveragePooling[source]#: alias of MeanPooling

TemporalStatisticsPooling#

class s3prl.nn.TemporalStatisticsPooling(input_size: int)[source][source]#

Bases: Module

Paper: X-vectors: Robust DNN Embeddings for Speaker Recognition Link: http://www.danielpovey.com/files/2018_icassp_xvectors.pdf

property input_size: int[source]#

property output_size: int[source]#

forward(xs, xs_len)[source][source]#

Computes Temporal Statistics Pooling Module

Parameters:

xs (torch.Tensor) – Input tensor (#batch, frames, input_size).
xs_len (torch.LongTensor) – with the lengths for each sample

Returns:

Output tensor (#batch, output_size)

Return type:

torch.Tensor

RNNEncoder#

class s3prl.nn.RNNEncoder(input_size: int, output_size: int, module: str = 'LSTM', proj_size: int = 1024, hidden_size: List[int] = [1024], dropout: List[float] = [0.0], layer_norm: List[bool] = [False], proj: List[bool] = [True], sample_rate: List[int] = [1], sample_style: str = 'drop', bidirectional: bool = False)[source][source]#

Bases: AbsFrameModel

RNN Encoder for sequence to sequence modeling, e.g., ASR.

Parameters:

input_size (int) – Input size.
output_size (int) – Output size.
module (str, optional) – RNN module type. Defaults to “LSTM”.
hidden_size (List[int], optional) – Hidden sizes for each layer. Defaults to [1024].
dropout (List[float], optional) – Dropout rates for each layer. Defaults to [0.0].
layer_norm (List[bool], optional) – Whether to use layer norm for each layer. Defaults to [False].
proj (List[bool], optional) – Whether to use projection for each layer. Defaults to [True].
sample_rate (List[int], optional) – Downsample rates for each layer. Defaults to [1].
sample_style (str, optional) – Downsample style (“drop” or “concat”). Defaults to “drop”.
bidirectional (bool, optional) – Whether RNN layers are bidirectional. Defaults to False.

forward(x: Tensor, x_len: LongTensor)[source][source]#

Parameters:

xs (torch.FloatTensor) – (batch_size, seq_len, input_size)
xs_len (torch.LongTensor) – (batch_size, )

Returns:

ys (torch.FloatTensor): (batch_size, seq_len, output_size)
ys_len (torch.LongTensor): (batch_size, )

Return type:

tuple

property input_size: int[source]#

property output_size: int[source]#

SuperbDiarizationModel#

class s3prl.nn.SuperbDiarizationModel(input_size: int, output_size: int, rnn_layers: int, hidden_size: int)[source][source]#

Bases: AbsFrameModel

The exact RNN model used in SUPERB Benchmark for Speaker Diarization

Parameters:

input_size (int) – input_size
output_size (int) – output_size
rnn_layers (int) – number of rnn layers
hidden_size (int) – the hidden size across all rnn layers

property input_size: int[source]#

property output_size: int[source]#

forward(xs, xs_len)[source][source]#

Parameters:

xs (torch.FloatTensor) – (batch_size, seq_len, input_size)
xs_len (torch.LongTensor) – (batch_size, )

Returns:

ys (torch.FloatTensor): (batch_size, seq_len, output_size)
ys_len (torch.LongTensor): (batch_size, )

Return type:

tuple

amsoftmax#

class s3prl.nn.amsoftmax(input_size: int, output_size: int, margin: float = 0.2, scale: float = 30)[source][source]#

Bases: Module

AMSoftmax

Parameters:

input_size (int) – The input feature size
output_size (int) – The output feature size
margin (float) – Hyperparameter denotes the margin to the decision boundry
scale (float) – Hyperparameter that scales the cosine value

property input_size[source]#

property output_size[source]#

forward(x: Tensor, label: LongTensor)[source][source]#

Parameters:

x (torch.Tensor) – (batch_size, input_size)
label (torch.LongTensor) – (batch_size, )

Returns:

loss (torch.float) logit (torch.Tensor): (batch_size, )

softmax#

class s3prl.nn.softmax(input_size: int, output_size: int)[source][source]#

Bases: Module

The standard softmax loss in an unified interface for all speaker-related softmax losses

property input_size[source]#

property output_size[source]#

forward(x: Tensor, label: LongTensor)[source][source]#

Parameters:

x (torch.Tensor) – (batch_size, input_size)
label (torch.LongTensor) – (batch_size, )

Returns:

loss (torch.float) logit (torch.Tensor): (batch_size, )

SuperbXvector#

class s3prl.nn.SuperbXvector(input_size: int, output_size: int = 512, hidden_size: int = 512, aggregation_size: int = 1500, dropout_p: float = 0.0, batch_norm: bool = False)[source][source]#

Bases: Module

The Xvector used in the SUPERB Benchmark with the exact default arguments.

Parameters:

input_size (int) – The input feature size, usually is the output size of upstream models
output_size (int) – (default, 512) The size of the speaker embedding
hidden_size (int) – (default, 512) The major hidden size in the network
aggregation_size (int) – (default, 1500) The output size of the x-vector, which is usually large
dropout_p (float) – (default, 0.0) The dropout rate
batch_norm (bool) – (default, False) Use batch norm for TDNN layers

property input_size: int[source]#

property output_size: int[source]#

forward(x, x_len)[source][source]#

Parameters:

x (torch.FloatTensor) – (batch_size, seq_len, input_size)
x_len (torch.LongTensor) – (batch_size, )

Returns:

(batch_size, output_size)

Return type:

torch.FloatTensor

XVectorBackbone#

class s3prl.nn.XVectorBackbone(input_size: int, output_size: int = 1500, dropout_p: float = 0.0, batch_norm: False = True)[source][source]#

Bases: Module

The TDNN layers the same as in https://danielpovey.com/files/2018_odyssey_xvector_lid.pdf.

Parameters:

input_size (int) – The input feature size, usually is the output size of upstream models
output_size (int) – (default, 1500) The size of the speaker embedding
dropout_p (float) – (default, 0.0) The dropout rate
batch_norm (bool) – (default, False) Use batch norm for TDNN layers

property input_size: int[source]#

property output_size: int[source]#

forward(x: Tensor)[source][source]#

Parameters:: x (torch.FloatTensor) – (batch, seq_len, input_size)

output:: torch.FloatTensor: (batch, seq_len, output_size)

BeamDecoder#

class s3prl.nn.BeamDecoder(token: str = '', lexicon: str = '', lm: str = '', nbest: int = 1, beam: int = 5, beam_size_token: int = -1, beam_threshold: float = 25.0, lm_weight: float = 2.0, word_score: float = -1.0, unk_score: float = -inf, sil_score: float = 0.0)[source][source]#

Bases: object

Beam decoder powered by flashlight.

Parameters:

token (str, optional) – Path to dictionary file. Defaults to “”.
lexicon (str, optional) – Path to lexicon file. Defaults to “”.
lm (str, optional) – Path to KenLM file. Defaults to “”.
nbest (int, optional) – Returns nbest hypotheses. Defaults to 1.
beam (int, optional) – Beam size. Defaults to 5.
beam_size_token (int, optional) – Token beam size. Defaults to -1.
beam_threshold (float, optional) – Beam search log prob threshold. Defaults to 25.0.
lm_weight (float, optional) – language model weight. Defaults to 2.0.
word_score (float, optional) – score for words appearance in the transcription. Defaults to -1.0.
unk_score (float, optional) – score for unknown word appearance in the transcription. Defaults to -math.inf.
sil_score (float, optional) – score for silence appearance in the transcription. Defaults to 0.0.

get_tokens(idxs: Iterable) → LongTensor[source][source]#

Normalize tokens by handling CTC blank, ASG replabels, etc.

Parameters:: idxs (Iterable) – Token ID list output by self.decoder
Returns:: Token ID list after normalization.
Return type:: torch.LongTensor

get_timesteps(token_idxs: List[int]) → List[int][source][source]#

Returns frame numbers corresponding to every non-blank token.

Parameters:: token_idxs (List[int]) – IDs of decoded tokens.
Returns:: Frame numbers corresponding to every non-blank token.
Return type:: List[int]

decode(emissions: Tensor) → List[List[dict]][source][source]#

Decode sequence.

Parameters:: emissions (torch.Tensor) – Emission probabilities (in log scale).
Returns:: Decoded hypotheses.
Return type:: List[List[dict]]