speaker_model#
(s3prl.nn.speaker_model)
Speaker verification models
- Authors:
Haibin Wu 2022
TDNN#
- class s3prl.nn.speaker_model.TDNN(input_size: int, output_size: int, context_size: int, dilation: int, dropout_p: float = 0.0, batch_norm: bool = True)[source][source]#
Bases:
Module
TDNN as defined by https://www.danielpovey.com/files/2015_interspeech_multisplice.pdf.
Context size and dilation determine the frames selected (although context size is not really defined in the traditional sense).
For example:
context size 5 and dilation 1 is equivalent to [-2,-1,0,1,2]
context size 3 and dilation 2 is equivalent to [-2, 0, 2]
context size 1 and dilation 1 is equivalent to [0]
- Parameters:
input_size (int) – The input feature size
output_size (int) – The output feature size
context_size (int) – See example
dilation (int) – See example
dropout_p (float) – (default, 0.0) The dropout rate
batch_norm (bool) – (default, False) Use batch norm for TDNN layers
XVectorBackbone#
- class s3prl.nn.speaker_model.XVectorBackbone(input_size: int, output_size: int = 1500, dropout_p: float = 0.0, batch_norm: False = True)[source][source]#
Bases:
Module
The TDNN layers the same as in https://danielpovey.com/files/2018_odyssey_xvector_lid.pdf.
- Parameters:
input_size (int) – The input feature size, usually is the output size of upstream models
output_size (int) – (default, 1500) The size of the speaker embedding
dropout_p (float) – (default, 0.0) The dropout rate
batch_norm (bool) – (default, False) Use batch norm for TDNN layers
ECAPA_TDNN#
- class s3prl.nn.speaker_model.ECAPA_TDNN(input_size: int = 80, output_size: int = 1536, C: int = 1024, **kwargs)[source][source]#
Bases:
Module
ECAPA-TDNN model as in https://arxiv.org/abs/2005.07143.
Reference code: https://github.com/TaoRuijie/ECAPA-TDNN.
- Parameters:
input_size (int) – The input feature size, usually is the output size of upstream models
output_size (int) – (default, 1536) The size of the speaker embedding
C (int) – (default, 1024) The channel dimension
SpeakerEmbeddingExtractor#
- class s3prl.nn.speaker_model.SpeakerEmbeddingExtractor(input_size: int, output_size: int = 1500, backbone: str = 'XVector', pooling_type: str = 'TemporalAveragePooling')[source][source]#
Bases:
Module
The speaker embedding extractor module.
- Parameters:
input_size (int) – The input feature size, usually is the output size of upstream models
output_size (int) – (default, 1500) The size of the speaker embedding
backbone (str) – (default, XVector) Use which kind of speaker model
pooling_type (str) – (default, TAP) Use which kind of pooling method
SuperbXvector#
- class s3prl.nn.speaker_model.SuperbXvector(input_size: int, output_size: int = 512, hidden_size: int = 512, aggregation_size: int = 1500, dropout_p: float = 0.0, batch_norm: bool = False)[source][source]#
Bases:
Module
The Xvector used in the SUPERB Benchmark with the exact default arguments.
- Parameters:
input_size (int) – The input feature size, usually is the output size of upstream models
output_size (int) – (default, 512) The size of the speaker embedding
hidden_size (int) – (default, 512) The major hidden size in the network
aggregation_size (int) – (default, 1500) The output size of the x-vector, which is usually large
dropout_p (float) – (default, 0.0) The dropout rate
batch_norm (bool) – (default, False) Use batch norm for TDNN layers