speaker_model#

(s3prl.nn.speaker_model)

Speaker verification models

Authors:

Haibin Wu 2022

TDNN#

class s3prl.nn.speaker_model.TDNN(input_size: int, output_size: int, context_size: int, dilation: int, dropout_p: float = 0.0, batch_norm: bool = True)[source][source]#

Bases: Module

TDNN as defined by https://www.danielpovey.com/files/2015_interspeech_multisplice.pdf.

Context size and dilation determine the frames selected (although context size is not really defined in the traditional sense).

For example:

context size 5 and dilation 1 is equivalent to [-2,-1,0,1,2]

context size 3 and dilation 2 is equivalent to [-2, 0, 2]

context size 1 and dilation 1 is equivalent to [0]

Parameters:

input_size (int) – The input feature size
output_size (int) – The output feature size
context_size (int) – See example
dilation (int) – See example
dropout_p (float) – (default, 0.0) The dropout rate
batch_norm (bool) – (default, False) Use batch norm for TDNN layers

property input_size: int[source]#

property output_size: int[source]#

forward(x: Tensor)[source][source]#

Parameters:: x (torch.FloatTensor) – (batch, seq_len, input_size)
Returns:: (batch, seq_len, output_size)
Return type:: torch.FloatTensor

XVectorBackbone#

class s3prl.nn.speaker_model.XVectorBackbone(input_size: int, output_size: int = 1500, dropout_p: float = 0.0, batch_norm: False = True)[source][source]#

Bases: Module

The TDNN layers the same as in https://danielpovey.com/files/2018_odyssey_xvector_lid.pdf.

Parameters:

input_size (int) – The input feature size, usually is the output size of upstream models
output_size (int) – (default, 1500) The size of the speaker embedding
dropout_p (float) – (default, 0.0) The dropout rate
batch_norm (bool) – (default, False) Use batch norm for TDNN layers

property input_size: int[source]#

property output_size: int[source]#

forward(x: Tensor)[source][source]#

Parameters:: x (torch.FloatTensor) – (batch, seq_len, input_size)

output:: torch.FloatTensor: (batch, seq_len, output_size)

ECAPA_TDNN#

class s3prl.nn.speaker_model.ECAPA_TDNN(input_size: int = 80, output_size: int = 1536, C: int = 1024, **kwargs)[source][source]#

Bases: Module

ECAPA-TDNN model as in https://arxiv.org/abs/2005.07143.

Reference code: https://github.com/TaoRuijie/ECAPA-TDNN.

Parameters:

input_size (int) – The input feature size, usually is the output size of upstream models
output_size (int) – (default, 1536) The size of the speaker embedding
C (int) – (default, 1024) The channel dimension

property input_size[source]#

property output_size[source]#

forward(x: FloatTensor)[source][source]#

Parameters:: x (torch.FloatTensor) – size (batch, seq_len, input_size)
Returns:: size (batch, seq_len, output_size)
Return type:: x (torch.FloatTensor)

SpeakerEmbeddingExtractor#

class s3prl.nn.speaker_model.SpeakerEmbeddingExtractor(input_size: int, output_size: int = 1500, backbone: str = 'XVector', pooling_type: str = 'TemporalAveragePooling')[source][source]#

Bases: Module

The speaker embedding extractor module.

Parameters:

input_size (int) – The input feature size, usually is the output size of upstream models
output_size (int) – (default, 1500) The size of the speaker embedding
backbone (str) – (default, XVector) Use which kind of speaker model
pooling_type (str) – (default, TAP) Use which kind of pooling method

property input_size: int[source]#

property output_size: int[source]#

forward(x: Tensor, xlen: LongTensor = None)[source][source]#

Parameters:

x (torch.Tensor) – size (batch, seq_len, input_size)
xlen (torch.LongTensor) – size (batch, )

Returns:

size (batch, output_size)

Return type:

x (torch.Tensor)

SuperbXvector#

class s3prl.nn.speaker_model.SuperbXvector(input_size: int, output_size: int = 512, hidden_size: int = 512, aggregation_size: int = 1500, dropout_p: float = 0.0, batch_norm: bool = False)[source][source]#

Bases: Module

The Xvector used in the SUPERB Benchmark with the exact default arguments.

Parameters:

input_size (int) – The input feature size, usually is the output size of upstream models
output_size (int) – (default, 512) The size of the speaker embedding
hidden_size (int) – (default, 512) The major hidden size in the network
aggregation_size (int) – (default, 1500) The output size of the x-vector, which is usually large
dropout_p (float) – (default, 0.0) The dropout rate
batch_norm (bool) – (default, False) Use batch norm for TDNN layers

property input_size: int[source]#

property output_size: int[source]#

forward(x, x_len)[source][source]#

Parameters:

x (torch.FloatTensor) – (batch_size, seq_len, input_size)
x_len (torch.LongTensor) – (batch_size, )

Returns:

(batch_size, output_size)

Return type:

torch.FloatTensor