speaker_model#

(s3prl.nn.speaker_model)

Speaker verification models

Authors:
  • Haibin Wu 2022

TDNN#

class s3prl.nn.speaker_model.TDNN(input_size: int, output_size: int, context_size: int, dilation: int, dropout_p: float = 0.0, batch_norm: bool = True)[source][source]#

Bases: Module

TDNN as defined by https://www.danielpovey.com/files/2015_interspeech_multisplice.pdf.

Context size and dilation determine the frames selected (although context size is not really defined in the traditional sense).

For example:

context size 5 and dilation 1 is equivalent to [-2,-1,0,1,2]

context size 3 and dilation 2 is equivalent to [-2, 0, 2]

context size 1 and dilation 1 is equivalent to [0]

Parameters:
  • input_size (int) – The input feature size

  • output_size (int) – The output feature size

  • context_size (int) – See example

  • dilation (int) – See example

  • dropout_p (float) – (default, 0.0) The dropout rate

  • batch_norm (bool) – (default, False) Use batch norm for TDNN layers

property input_size: int[source]#
property output_size: int[source]#
forward(x: Tensor)[source][source]#
Parameters:

x (torch.FloatTensor) – (batch, seq_len, input_size)

Returns:

(batch, seq_len, output_size)

Return type:

torch.FloatTensor

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#

XVectorBackbone#

class s3prl.nn.speaker_model.XVectorBackbone(input_size: int, output_size: int = 1500, dropout_p: float = 0.0, batch_norm: False = True)[source][source]#

Bases: Module

The TDNN layers the same as in https://danielpovey.com/files/2018_odyssey_xvector_lid.pdf.

Parameters:
  • input_size (int) – The input feature size, usually is the output size of upstream models

  • output_size (int) – (default, 1500) The size of the speaker embedding

  • dropout_p (float) – (default, 0.0) The dropout rate

  • batch_norm (bool) – (default, False) Use batch norm for TDNN layers

property input_size: int[source]#
property output_size: int[source]#
forward(x: Tensor)[source][source]#
Parameters:

x (torch.FloatTensor) – (batch, seq_len, input_size)

output:

torch.FloatTensor: (batch, seq_len, output_size)

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#

ECAPA_TDNN#

class s3prl.nn.speaker_model.ECAPA_TDNN(input_size: int = 80, output_size: int = 1536, C: int = 1024, **kwargs)[source][source]#

Bases: Module

ECAPA-TDNN model as in https://arxiv.org/abs/2005.07143.

Reference code: https://github.com/TaoRuijie/ECAPA-TDNN.

Parameters:
  • input_size (int) – The input feature size, usually is the output size of upstream models

  • output_size (int) – (default, 1536) The size of the speaker embedding

  • C (int) – (default, 1024) The channel dimension

property input_size[source]#
property output_size[source]#
forward(x: FloatTensor)[source][source]#
Parameters:

x (torch.FloatTensor) – size (batch, seq_len, input_size)

Returns:

size (batch, seq_len, output_size)

Return type:

x (torch.FloatTensor)

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#

SpeakerEmbeddingExtractor#

class s3prl.nn.speaker_model.SpeakerEmbeddingExtractor(input_size: int, output_size: int = 1500, backbone: str = 'XVector', pooling_type: str = 'TemporalAveragePooling')[source][source]#

Bases: Module

The speaker embedding extractor module.

Parameters:
  • input_size (int) – The input feature size, usually is the output size of upstream models

  • output_size (int) – (default, 1500) The size of the speaker embedding

  • backbone (str) – (default, XVector) Use which kind of speaker model

  • pooling_type (str) – (default, TAP) Use which kind of pooling method

property input_size: int[source]#
property output_size: int[source]#
forward(x: Tensor, xlen: Optional[LongTensor] = None)[source][source]#
Parameters:
  • x (torch.Tensor) – size (batch, seq_len, input_size)

  • xlen (torch.LongTensor) – size (batch, )

Returns:

size (batch, output_size)

Return type:

x (torch.Tensor)

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#

SuperbXvector#

class s3prl.nn.speaker_model.SuperbXvector(input_size: int, output_size: int = 512, hidden_size: int = 512, aggregation_size: int = 1500, dropout_p: float = 0.0, batch_norm: bool = False)[source][source]#

Bases: Module

The Xvector used in the SUPERB Benchmark with the exact default arguments.

Parameters:
  • input_size (int) – The input feature size, usually is the output size of upstream models

  • output_size (int) – (default, 512) The size of the speaker embedding

  • hidden_size (int) – (default, 512) The major hidden size in the network

  • aggregation_size (int) – (default, 1500) The output size of the x-vector, which is usually large

  • dropout_p (float) – (default, 0.0) The dropout rate

  • batch_norm (bool) – (default, False) Use batch norm for TDNN layers

call_super_init: bool = False[source]#
dump_patches: bool = False[source]#
training: bool[source]#
property input_size: int[source]#
property output_size: int[source]#
forward(x, x_len)[source][source]#
Parameters:
  • x (torch.FloatTensor) – (batch_size, seq_len, input_size)

  • x_len (torch.LongTensor) – (batch_size, )

Returns:

(batch_size, output_size)

Return type:

torch.FloatTensor