nn#
(s3prl.nn)
Common model and loss in pure torch.nn.Module with torch dependency only
The beam search decoder of flashlight |
|
Common probing models |
|
The probing model following Hear Benchmark |
|
Model interfaces |
|
Common linear models |
|
Permutation Invariant Training (PIT) loss |
|
Common pooling methods |
|
RNN models used in Superb Benchmark |
|
Speaker verification loss |
|
Speaker verification models |
|
Specaug modules |
|
S3PRL Upstream Collection and some utilities |
S3PRLUpstream#
- class s3prl.nn.S3PRLUpstream(name: str, path_or_url: Optional[str] = None, refresh: bool = False, normalize: bool = False, extra_conf: Optional[dict] = None, randomize: bool = False)[source][source]#
Bases:
Module
This is an easy interface for using all the models in S3PRL. See S3PRL Upstream Collection for the example usage and all the supported models.
- Parameters:
name (str) – can be “apc”, “hubert”, “wav2vec2”. See
available_names
for all the supported namespath_or_url (str) – The source of the checkpoint. Might be a local path or a URL
refresh (bool) – (default, False) If false, only downlaod checkpoint if not yet downloaded before. If true, force to re-download the checkpoint.
extra_conf (dict) – (default, None) The extra arguments for each specific upstream, the available options are shown in each upstream section
randomize (bool) – (default, False) If True, randomize the upstream model
Note
When using S3PRLUpstream with
refresh=True
and multiprocessing (e.g. DDP), the checkpoint will only be downloaded once, and the other processes will simply re-use the newly downloaded checkpoint, instead of re-downloading on every processes, which can be very time/bandwidth consuming.Example:
>>> import torch >>> from s3prl.nn import S3PRLUpstream ... >>> model = S3PRLUpstream("hubert") >>> model.eval() ... >>> with torch.no_grad(): ... wavs = torch.randn(2, 16000 * 2) ... wavs_len = torch.LongTensor([16000 * 1, 16000 * 2]) ... all_hs, all_hs_len = model(wavs, wavs_len) ... >>> for hs, hs_len in zip(all_hs, all_hs_len): ... assert isinstance(hs, torch.FloatTensor) ... assert isinstance(hs_len, torch.LongTensor) ... ... batch_size, max_seq_len, hidden_size = hs.shape ... assert hs_len.dim() == 1
- classmethod available_names(only_registered_ckpt: bool = False) List[str] [source][source]#
All the available names supported by this S3PRLUpstream
- Parameters:
only_registered_ckpt (bool) – ignore entry names which require to give path_or_url. That is, the entry names without the registered checkpoint sources. These names end with
_local
(for local path),_url
(for URL) or_custom
(auto-determine path or URL)
- property num_layers: int[source]#
Number of hidden sizes. All the upstream have a deterministic number of layers. That is, layer drop is turned off by default.
- property downsample_rates: List[int][source]#
Downsampling rate from 16000 Hz audio of each layer. Usually, all layers have the same downsampling rate, but might not be the case for some advanced upstreams.
The hidden size of each layer
- forward(wavs: FloatTensor, wavs_len: LongTensor)[source][source]#
- Parameters:
wavs (torch.FloatTensor) – (batch_size, seqlen) or (batch_size, seqlen, 1)
wavs_len (torch.LongTensor) – (batch_size, )
- Returns:
List[torch.FloatTensor], List[torch.LongTensor]
all the layers of hidden states: List[ (batch_size, max_seq_len, hidden_size) ]
the valid length for each hidden states: List[ (batch_size, ) ]
Featurizer#
- class s3prl.nn.Featurizer(upstream: S3PRLUpstream, layer_selections: Optional[List[int]] = None, normalize: bool = False)[source][source]#
Bases:
Module
Featurizer take the
S3PRLUpstream
’s multiple layer of hidden_states and reduce (standardize) them into a single hidden_states, to connect with downstream NNs.This basic Featurizer expects all the layers to have same stride and hidden_size When the input upstream only have a single layer of hidden states, use that directly. If multiple layers are presented, add a trainable weighted-sum on top of those layers.
- Parameters:
upstream (
S3PRLUpstream
) – the upstream to extract features, this upstream is used only for initialization and will not be kept in this Featurizer objectlayer_selections (List[int]) – To select a subset of hidden states from the given upstream by layer ids (0-index) If None (default), than all the layer of hidden states are selected
normalize (bool) – Whether to apply layer norm on all the hidden states before weighted-sum This can help convergence in some cases, but not used in SUPERB to ensure the fidelity of each upstream’s extracted representation.
Example:
>>> import torch >>> from s3prl.nn import S3PRLUpstream, Featurizer ... >>> model = S3PRLUpstream("hubert") >>> model.eval() ... >>> with torch.no_grad(): ... wavs = torch.randn(2, 16000 * 2) ... wavs_len = torch.LongTensor([16000 * 1, 16000 * 2]) ... all_hs, all_hs_len = model(wavs, wavs_len) ... >>> featurizer = Featurizer(model) >>> hs, hs_len = featurizer(all_hs, all_hs_len) ... >>> assert isinstance(hs, torch.FloatTensor) >>> assert isinstance(hs_len, torch.LongTensor) >>> batch_size, max_seq_len, hidden_size = hs.shape >>> assert hs_len.dim() == 1
- property downsample_rate: int[source]#
The downsample rate (from 16k Hz waveform) of the final weighted-sum output
- forward(all_hs: List[FloatTensor], all_lens: List[LongTensor])[source][source]#
- Parameters:
all_hs (List[torch.FloatTensor]) – List[ (batch_size, seq_len, hidden_size) ]
all_lens (List[torch.LongTensor]) – List[ (batch_size, ) ]
- Returns:
torch.FloatTensor, torch.LongTensor
The weighted-sum result, (batch_size, seq_len, hidden_size)
the valid length of the result, (batch_size, )
FrameLevel#
- class s3prl.nn.FrameLevel(input_size: int, output_size: int, hidden_sizes: Optional[List[int]] = None, activation_type: Optional[str] = None, activation_conf: Optional[dict] = None)[source][source]#
Bases:
Module
The common frame-to-frame probing model
- Parameters:
input_size (int) – input size
output_size (int) – output size
hidden_sizes (List[int]) – a list of hidden layers’ hidden size. by default is [256] to project all different input sizes to the same dimension. set empty list to use the vanilla single layer linear model
activation_type (str) – the activation class name in
torch.nn
. Set None to disable activation and the model is pure linear. Default: Noneactivation_conf (dict) – the arguments for initializing the activation class. Default: empty dict
UtteranceLevel#
- class s3prl.nn.UtteranceLevel(input_size: int, output_size: int, hidden_sizes: Optional[List[int]] = None, activation_type: Optional[str] = None, activation_conf: Optional[dict] = None, pooling_type: str = 'MeanPooling', pooling_conf: Optional[dict] = None)[source][source]#
Bases:
Module
- Parameters:
input_size (int) – input_size
output_size (int) – output_size
hidden_sizes (List[int]) – a list of hidden layers’ hidden size. by default is [256] to project all different input sizes to the same dimension. set empty list to use the vanilla single layer linear model
activation_type (str) – the activation class name in
torch.nn
. Set None to disable activation and the model is pure linear. Default: Noneactivation_conf (dict) – the arguments for initializing the activation class. Default: empty dict
pooling_type (str) – the pooling class name in
s3prl.nn.pooling
. Default: MeanPoolingpooling_conf (dict) – the arguments for initializing the pooling class. Default: empty dict
FrameLevelLinear#
- class s3prl.nn.FrameLevelLinear(input_size: int, output_size: int, hidden_size: int = 256)[source][source]#
Bases:
FrameLevel
The frame-level linear probing model used in SUPERB Benchmark
MeanPoolingLinear#
- class s3prl.nn.MeanPoolingLinear(input_size: int, output_size: int, hidden_size: int = 256)[source][source]#
Bases:
UtteranceLevel
The utterance-level linear probing model used in SUPERB Benchmark
MeanPooling#
- class s3prl.nn.MeanPooling(input_size: int)[source][source]#
Bases:
Module
Computes Temporal Average Pooling (MeanPooling over time) Module
TemporalAveragePooling#
- s3prl.nn.TemporalAveragePooling[source]#
alias of
MeanPooling
TemporalStatisticsPooling#
- class s3prl.nn.TemporalStatisticsPooling(input_size: int)[source][source]#
Bases:
Module
Paper: X-vectors: Robust DNN Embeddings for Speaker Recognition Link: http://www.danielpovey.com/files/2018_icassp_xvectors.pdf
RNNEncoder#
- class s3prl.nn.RNNEncoder(input_size: int, output_size: int, module: str = 'LSTM', proj_size: int = 1024, hidden_size: List[int] = [1024], dropout: List[float] = [0.0], layer_norm: List[bool] = [False], proj: List[bool] = [True], sample_rate: List[int] = [1], sample_style: str = 'drop', bidirectional: bool = False)[source][source]#
Bases:
AbsFrameModel
RNN Encoder for sequence to sequence modeling, e.g., ASR.
- Parameters:
input_size (int) – Input size.
output_size (int) – Output size.
module (str, optional) – RNN module type. Defaults to “LSTM”.
hidden_size (List[int], optional) – Hidden sizes for each layer. Defaults to [1024].
dropout (List[float], optional) – Dropout rates for each layer. Defaults to [0.0].
layer_norm (List[bool], optional) – Whether to use layer norm for each layer. Defaults to [False].
proj (List[bool], optional) – Whether to use projection for each layer. Defaults to [True].
sample_rate (List[int], optional) – Downsample rates for each layer. Defaults to [1].
sample_style (str, optional) – Downsample style (“drop” or “concat”). Defaults to “drop”.
bidirectional (bool, optional) – Whether RNN layers are bidirectional. Defaults to False.
SuperbDiarizationModel#
- class s3prl.nn.SuperbDiarizationModel(input_size: int, output_size: int, rnn_layers: int, hidden_size: int)[source][source]#
Bases:
AbsFrameModel
The exact RNN model used in SUPERB Benchmark for Speaker Diarization
- Parameters:
input_size (int) – input_size
output_size (int) – output_size
rnn_layers (int) – number of rnn layers
hidden_size (int) – the hidden size across all rnn layers
amsoftmax#
- class s3prl.nn.amsoftmax(input_size: int, output_size: int, margin: float = 0.2, scale: float = 30)[source][source]#
Bases:
Module
AMSoftmax
- Parameters:
input_size (int) – The input feature size
output_size (int) – The output feature size
margin (float) – Hyperparameter denotes the margin to the decision boundry
scale (float) – Hyperparameter that scales the cosine value
softmax#
- class s3prl.nn.softmax(input_size: int, output_size: int)[source][source]#
Bases:
Module
The standard softmax loss in an unified interface for all speaker-related softmax losses
SuperbXvector#
- class s3prl.nn.SuperbXvector(input_size: int, output_size: int = 512, hidden_size: int = 512, aggregation_size: int = 1500, dropout_p: float = 0.0, batch_norm: bool = False)[source][source]#
Bases:
Module
The Xvector used in the SUPERB Benchmark with the exact default arguments.
- Parameters:
input_size (int) – The input feature size, usually is the output size of upstream models
output_size (int) – (default, 512) The size of the speaker embedding
hidden_size (int) – (default, 512) The major hidden size in the network
aggregation_size (int) – (default, 1500) The output size of the x-vector, which is usually large
dropout_p (float) – (default, 0.0) The dropout rate
batch_norm (bool) – (default, False) Use batch norm for TDNN layers
XVectorBackbone#
- class s3prl.nn.XVectorBackbone(input_size: int, output_size: int = 1500, dropout_p: float = 0.0, batch_norm: False = True)[source][source]#
Bases:
Module
The TDNN layers the same as in https://danielpovey.com/files/2018_odyssey_xvector_lid.pdf.
- Parameters:
input_size (int) – The input feature size, usually is the output size of upstream models
output_size (int) – (default, 1500) The size of the speaker embedding
dropout_p (float) – (default, 0.0) The dropout rate
batch_norm (bool) – (default, False) Use batch norm for TDNN layers
BeamDecoder#
- class s3prl.nn.BeamDecoder(token: str = '', lexicon: str = '', lm: str = '', nbest: int = 1, beam: int = 5, beam_size_token: int = -1, beam_threshold: float = 25.0, lm_weight: float = 2.0, word_score: float = -1.0, unk_score: float = -inf, sil_score: float = 0.0)[source][source]#
Bases:
object
Beam decoder powered by flashlight.
- Parameters:
token (str, optional) – Path to dictionary file. Defaults to “”.
lexicon (str, optional) – Path to lexicon file. Defaults to “”.
lm (str, optional) – Path to KenLM file. Defaults to “”.
nbest (int, optional) – Returns nbest hypotheses. Defaults to 1.
beam (int, optional) – Beam size. Defaults to 5.
beam_size_token (int, optional) – Token beam size. Defaults to -1.
beam_threshold (float, optional) – Beam search log prob threshold. Defaults to 25.0.
lm_weight (float, optional) – language model weight. Defaults to 2.0.
word_score (float, optional) – score for words appearance in the transcription. Defaults to -1.0.
unk_score (float, optional) – score for unknown word appearance in the transcription. Defaults to -math.inf.
sil_score (float, optional) – score for silence appearance in the transcription. Defaults to 0.0.
- get_tokens(idxs: Iterable) LongTensor [source][source]#
Normalize tokens by handling CTC blank, ASG replabels, etc.
- Parameters:
idxs (Iterable) – Token ID list output by self.decoder
- Returns:
Token ID list after normalization.
- Return type:
torch.LongTensor