superb_asr#

(s3prl.problem.asr.superb_asr)

The setting of Superb ASR

Authors

Heng-Jui Chang 2022
Leo 2022

SuperbASR#

class s3prl.problem.asr.superb_asr.SuperbASR[source][source]#

Bases: ASR

default_config() → dict[source][source]#

The default arguments for run in yaml. Note that for the fields with inner values, like build_model, the outer field name corresponds to a method name, so you can find the method build_model. Furthermore, the values inside that field will be directly passed into the method. So by changing these inner values, you can directly affect the behavior of the corresponding method. See the method documentation for all the supported arguments and their meanings.

The methods affected by the following config are: prepare_data prepare_tokenizer_data build_tokenizer build_dataset build_batch_sampler build_upstream build_featurizer build_downstream build_model build_task build_optimizer build_scheduler save_model save_task train

start: 0
stop: null
target_dir: ???
cache_dir: null
remove_all_cache: false
prepare_data:
  dataset_root: ???
  train_sets:
  - train-clean-100
  valid_sets:
  - dev-clean
  test_sets:
  - test-clean
prepare_tokenizer_data: {}
build_tokenizer:
  vocab_type: character
build_dataset: {}
build_batch_sampler:
  train:
    batch_size: 32
    max_length: 2000
    shuffle: true
  valid:
    batch_size: 1
  test:
    batch_size: 1
build_upstream:
  name: ???
build_featurizer:
  layer_selections: null
  normalize: false
build_downstream:
  model_conf:
    module: LSTM
    proj_size: 1024
    hidden_size:
    - 1024
    - 1024
    dropout:
    - 0.2
    - 0.2
    layer_norm:
    - false
    - false
    proj:
    - false
    - false
    sample_rate:
    - 1
    - 1
    sample_style: concat
    bidirectional: true
  specaug_conf:
    freq_mask_width_range: !!python/tuple
    - 0
    - 50
    num_freq_mask: 4
    time_mask_width_range: !!python/tuple
    - 0
    - 40
    num_time_mask: 2
build_model:
  upstream_trainable: false
build_task:
  log_metrics:
  - cer
  - wer
build_optimizer:
  name: Adam
  conf:
    lr: 0.0001
build_scheduler:
  name: ExponentialLR
  gamma: 0.9
save_model:
  extra_conf:
    build_downstream_conf: ${build_downstream}
save_task: {}
train:
  total_steps: 200000
  log_step: 100
  eval_step: 2000
  save_step: 500
  gradient_clipping: 1.0
  gradient_accumulate: 1
  valid_metric: wer
  valid_higher_better: false
  auto_resume: true
  resume_ckpt_dir: null

prepare_data(prepare_data: dict, target_dir: str, cache_dir: str, get_path_only: bool = False)[source][source]#

Prepare the task-specific data metadata (path, labels…). By default call prepare_librispeech with **prepare_data

Parameters:

prepare_data (dict) – same in default_config, support arguments in prepare_librispeech
target_dir (str) – Parse your corpus and save the csv file into this directory
cache_dir (str) – If the parsing or preprocessing takes too long time, you can save the temporary files into this directory. This directory is expected to be shared across different training sessions (different hypers and target_dir)
get_path_only (str) – Directly return the filepaths no matter they exist or not.

Returns:

tuple

train_path (str)
valid_path (str)
test_paths (List[str])

Each path (str) should be a csv file containing the following columns:

column	description
id	(str) - the unique id for this data point
wav_path	(str) - the absolute path of the waveform file
transcription	(str) - a text string

prepare_tokenizer_data(prepare_tokenizer_data: dict, target_dir: str, cache_dir: str, train_csv: str, valid_csv: str, test_csvs: List[str], get_path_only: bool = False)[source][source]#

Prepare the text file used for training tokenizer. By default only use the transcription in the train_csv returned from prepare_data The default prepare_tokenizer_data build the character-based tokenizer

Parameters:

prepare_tokenizer_data (dict) – same in default_config, no supported argument for now
target_dir (str) – Save the text file into this directory
cache_dir (str) – If the parsing or preprocessing takes too long time, you can save the temporary files into this directory. This directory is expected to be shared across different training sessions (different hypers and target_dir)
train_csv (str) – The train data given by prepare_data
get_path_only (str) – Directly return the filepaths no matter they exist or not.

Returns:

str

The text file path, the text file should be in the format

This is the first line
This is the second line
These are all text used for training tokenizer

build_tokenizer(build_tokenizer: dict, target_dir: str, cache_dir: str, tokenizer_data_path: str, get_path_only: bool = False)[source][source]#

Build the tokenizer from the data prepared by prepare_tokenizer_data By default call prepare_common_tokenizer with **build_tokenizer

Parameters:

build_tokenizer (dict) – same in default_config, arguments for prepare_common_tokenizer
target_dir (str) – Current experinment directory
cache_dir (str) – If the parsing or preprocessing takes too long time, you can save the temporary files into this directory. This directory is expected to be shared across different training sessions (different hypers and target_dir)
tokenizer_data_path (str) – The text file from prepare_tokenizer_data
get_path_only (str) – Directly return the filepaths no matter they exist or not.

Returns:

str

filepath of the pickled s3prl.dataio.encoder.tokenizer.Tokenizer

build_dataset(build_dataset: dict, target_dir: str, cache_dir: str, mode: str, data_csv: str, tokenizer_path: str)[source][source]#

Build the dataset for train/valid/test.

Parameters:

build_dataset (dict) – same in default_config, not used
target_dir (str) – Current experiment directory
cache_dir (str) – If the preprocessing takes too long time, you can save the temporary files into this directory. This directory is expected to be shared across different training sessions (different hypers and target_dir)
mode (str) – train/valid/test
data_csv (str) – The metadata csv file for the specific mode
tokenizer_path (str) – The pickled tokenizer path for encoding transcription

Returns:

torch Dataset

For all train/valid/test mode, the dataset should return each item as a dictionary containing the following keys:

key	description
x	(torch.FloatTensor) - the waveform in (seq_len, 1)
x_len	(int) - the waveform length `seq_len`
class_ids	(torch.LongTensor) - the encoded class ids of a transcription (sentence)
labels	(str) - the text transcription
unique_name	(str) - the unique id for this datapoint

build_batch_sampler(build_batch_sampler: dict, target_dir: str, cache_dir: str, mode: str, data_csv: str, dataset: Dataset)[source][source]#

Return the batch sampler for torch DataLoader.

Parameters:

build_batch_sampler (dict) –
same in default_config

key

description

train

(dict) - arguments for SortedBucketingSampler

valid

(dict) - arguments for FixedBatchSizeBatchSampler

test

(dict) - arguments for FixedBatchSizeBatchSampler
target_dir (str) – Current experiment directory
cache_dir (str) – If the preprocessing takes too long time, save the temporary files into this directory. This directory is expected to be shared across different training sessions (different hypers and target_dir)
mode (str) – train/valid/test
data_csv (str) – the mode specific csv from prepare_data
dataset – the dataset from build_dataset

Returns:

key	description
train	(dict) - arguments for `SortedBucketingSampler`
valid	(dict) - arguments for `FixedBatchSizeBatchSampler`
test	(dict) - arguments for `FixedBatchSizeBatchSampler`

batch sampler for torch DataLoader

build_downstream(build_downstream: dict, downstream_input_size: int, downstream_output_size: int, downstream_input_stride: int)[source][source]#

Return the task-specific downstream model. By default build the RNNEncoder model wrapped with ModelWithSpecaug

Parameters:

build_downstream (dict) – same in default_config, has two keys: model_conf is the arguments for RNNEncoder; specaug_conf is the arguments for ModelWithSpecaug
downstream_input_size (int) – the required input size of the model
downstream_output_size (int) – the required output size of the model
downstream_input_stride (int) – the input feature’s stride (from 16 KHz)

Returns:

s3prl.nn.interface.AbsFrameModel

build_collate_fn(build_collate_fn: dict, mode: str)[source]#

By default returns s3prl.dataset.base.default_collate_fn

Parameters:

build_collate_fn (dict) – same in default_config, no argument supported for now
mode (str) – train, valid, or test

Returns:

callable

the collate_fn for torch DataLoader in train/valid/test mode

build_featurizer(build_featurizer: dict, upstream)[source]#

By default build the featurizer with s3prl.nn.Featurizer

Parameters:

build_featurizer (dict) – same in default_config, arguments for s3prl.nn.Featurizer
upstream (AbsUpstream) – the upstream model built by build_upstream

Returns:

s3prl.nn.interface.AbsFeaturizer

Return the featurizer model. The featurizer is used to reduce the multiple hidden states returned from the upstream model (built by build_upstream) into a single hidden state, so can be easliy fed into the downstream model

build_model(build_model: dict, model_output_size: int, build_upstream: dict, build_featurizer: dict, build_downstream: dict)[source]#

By default build model with s3prl.nn.upstream.UpstreamDownstreamModel

Parameters:

build_model (dict) – same in default_config, arguments for s3prl.nn.upstream.UpstreamDownstreamModel
model_output_size (int) – the required model’s output hidden size
build_upstream (dict) – same in default_config, refer to build_upstream
build_featurizer (dict) – same in default_config, refer to build_featurizer
build_downstream (dict) – same in default_config, refer to build_downstream

Returns:

torch.nn.Module

Return the entire model for the task, which takes the direct items from DataLoader as the input. Usually, the components can be built by build_upstream, build_featurizer, build_downstream, and are concated together to get the final model. The upstream extracts multiple hidden states, the featuizer reduce them into a single hidden state, and the downstream takes the hidden states as the feature for the downstream-specific model.

build_optimizer(build_optimizer: dict, parameters)[source]#

Parameters:

build_optimizer (dict) –
same in default_config, refer to below

key

description

name

(str) - the optimizer class name in torch.optim

conf

(dict) - the arguments for initializing the optimizer class. e.g. {"lr": 1.0e-4}
parameters (iterable) – the standard params accepted by torch.optim.Optimizer.

Returns:

key	description
name	(str) - the optimizer class name in `torch.optim`
conf	(dict) - the arguments for initializing the optimizer class. e.g. `{"lr": 1.0e-4}`

torch.optim.Optimizer

An optimizer following standard torch usage

build_scheduler(build_scheduler: dict, optimizer)[source]#

Parameters:

build_scheduler (dict) –

same in default_config

key	description
name	(str) - the scheduler class name in `torch.optim.lr_scheduler`
conf	(dict) - the arguments for initializing the scheduler class. e.g. `{"gamma": 0.01}` for `torch.optim.lr_scheduler.StepLR`

optimizer – the standard torch optimizer accepted by Scheduler in torch.optim.lr_scheduler.

Returns:

torch scheduler

A scheduler following standard torch usage

build_task(build_task: dict, model, tokenizer)[source]#

build_upstream(build_upstream: dict)[source]#

By default build the upstream with s3prl.nn.upstream.S3PRLUpstream

Parameters:

build_upstream (dict) – same in default_config, arguments for s3prl.nn.upstream.S3PRLUpstream

Returns:

s3prl.nn.interface.AbsUpstream

Return an upstream model, whose forward takes the waveform input and returns multiple hidden states as features.

evaluate(evaluate: dict, mode: str, task, dataset, batch_sampler, collate_fn, eval_batch: int, dump_dir: str, device: str, num_workers: int)[source]#

The evaluate routine used by train (during validation phase) and run (during testing phase).

Parameters:

evaluate (dict) – same in default_config, no argument supported for now
**others – only meaningful when you want to override this train method, which is not the common case. Hence we skip the documentation for now.

classmethod get_class_from_name(name: str)[source]#

Parameters:: name (str) – the __name__ of the problem class
Returns:: Problem

load_model(model_ckpt_dir: str)[source]#

Return the saved model.

Parameters:: model_ckpt_dir (str) – Restore the model with build_model and the checkpoint saved in this directory.
Returns:: torch.nn.Module

load_model_and_task(ckpts_dir: str, task_overrides: dict = None)[source]#

This is a helper method to combine load_model and load_task together to directly load the model and the task. This method assumes the model is saved under ckpts_dir / 'model' and the task is saved under ckpts_dir / 'task'

Returns:

tuple

model (torch.nn.Module)
task (s3prl.task.Task)

load_task(task_ckpt_dir: str, model: Module, task_overrides: dict = None)[source]#

Return the saved task.

Parameters:

task_ckpt_dir (str) – Restore the task with build_task and the checkpoint saved in this directory.
model (torch.nn.Module) – the model for the task, since the model is separately saved and is required for build_task.
task_overrides (dict) – overrides the saved initialization arguments, so can change the loaded task’s behavior. Like, change the decoding hyperparameters.

Returns:

s3prl.task.Task

main(args: List[str] = None)[source]#

run(target_dir: str, cache_dir: str, remove_all_cache: bool = False, start: int = 0, stop: int = None, num_workers: int = 6, eval_batch: int = -1, device: str = 'cuda', world_size: int = 1, rank: int = 0, test_ckpt_dir: str = None, prepare_data: dict = None, prepare_tokenizer_data: dict = None, build_tokenizer: dict = None, build_dataset: dict = None, build_batch_sampler: dict = None, build_collate_fn: dict = None, build_upstream: dict = None, build_featurizer: dict = None, build_downstream: dict = None, build_model: dict = None, build_task: dict = None, build_optimizer: dict = None, build_scheduler: dict = None, save_model: dict = None, save_task: dict = None, train: dict = None, evaluate: dict = None)[source]#

stage	description
0	Parse the corpus and save the metadata file for ASR (waveform path, label…)
1	Prepare the metadata file for training tokenizer
2	Train the tokenizer
3	Train the ASR model
4	Evaluate the model on multiple test sets, multiple checkpoints will be evaluated for each test set (See `test_ckpt_steps`)

Parameters:

target_dir (str) – The directory that stores the script result.
cache_dir (str) – The directory that caches the processed data. Default: /home/user/.cache/s3prl/data
remove_all_cache (bool) – Whether to remove all the cache stored under cache_dir. Default: False
start (int) – The starting stage of the problem script. Default: 0
stop (int) – The stoping stage of the problem script, set None to reach the final stage. Default: None
num_workers (int) – num_workers for all the torch DataLoder
eval_batch (int) – During evaluation (valid or test), limit the number of batch. This is helpful for the fast development to check everything won’t crash. If is -1, disable this feature and evaluate the entire epoch. Default: -1
device (str) – The device type for all torch-related operation: “cpu” or “cuda” Default: “cuda”
world_size (int) – How many processes are running this script simultaneously (in parallel). Usually this is just 1, however if you are runnig distributed training, this should be > 1. Default: 1
rank (int) – When distributed training, world_size > 1. Take world_size == 8 for example, this means 8 processes (8 GPUs) are runing in parallel. The script needs to know which process among 8 processes it is. In this case, rank can range from 0~7. All the 8 processes have the same world_size but different rank (process id).
test_ckpt_dir (str) – Specify the checkpoint path for testing. If not, use checkpoints specified by test_ckpts_steps.
**others – The other arguments like prepare_data and build_model are method specific-arguments for methods like prepare_data and build_model, and will not be used in the core run logic. See the specific method documentation for their supported arguments and meaning

save_model(save_model: dict, model_ckpt_dir: str, build_model_all_args: dict, model: Module)[source]#

Save the model state_dict and the model initialization arguments into the given directory. If you override this method, it is highly possible you also need to override load_model

Parameters:

save_model (dict) – same in default_config, so the user can save additional settings, like the configuration of the dataset by duplicating the dataset hypers inside the save_model field. You can rely on the omegaconf package to simplify the duplication.
model_ckpt_dir (str) – save the model into the this directory.
build_model_all_args (dict) – all the arguments of build_model. By saving this dictionary, you can easily reconstruct the same model by calling build_model with the saved dictionary.
model (torch.nn.Module) – the model to be saved.

Returns:

None

save_task(save_task: dict, task_ckpt_dir: str, build_task_all_args_except_model: dict, task: Task)[source]#

Save the task’s state, task.get_state(), and the initialization arguments into the given directory. If you override this method, it is highly possible you also need to override load_task.

Parameters:

save_task (dict) – same in default_config, so the user can save additional settings, like the configuration of the dataset by duplicating the dataset hypers inside the save_task field. You can rely on the omegaconf package to simplify the duplication.
task_ckpt_dir (str) – save the task into this directory.
build_task_all_args_except_model (dict) – all the arguments of build_task except the model argument since the model should be sapartely saved by save_model. By saving this dictionary, you can easily reconstruct the same task by calling build_task with the saved dictionary.
task (Task) – the task to be saved.

Returns:

None

train(train: dict, train_dir: str, build_model_all_args: dict, build_task_all_args_except_model: dict, save_model: dict, save_task: dict, build_optimizer: dict, build_scheduler: dict, evaluate: dict, train_dataset, train_batch_sampler, train_collate_fn, valid_dataset, valid_batch_sampler, valid_collate_fn, num_workers: int, world_size: int, rank: int, eval_batch: int, device: str, global_config: dict = None)[source]#

Parameters:

train (dict) –

same in default_config

key	description
total_steps	(int) - the total optimization steps
log_step	(int) - logging frequency. log every `log_step` step
eval_step	(int) - evaluation frequency. Evaluate every `eval_step` step. Note that you can control how many batch to evaluate to speed up the development by the `eval_batch` argument in `run`
save_step	(int) - save the checkpoint every `save_step` step.
gradient_clipping	(float) - clip the gradient. important for RNNs.
gradient_accumulate	(int) - accumulate multiple steps’ gradient before updating network parameters to simulate large-batch optimization.
valid_metric	(str) - the metric to select the best valid checkpoint. Different Tasks have different supported valid_metrics. See `build_task` for the supported metrics.
valid_higher_better	(bool) - some metrics are higher better, while some are lower better this will affect how to save the best validation checkpoint.
auto_resume	(bool) - if there are already the last checkpoint in `target_dir` (see `run`), whether to resume from it or delete it and start a new training session.
resume_ckpt_dir	(str) - you can directly specify the checkpoint path to resume which is not necessary in `target_dir` (see `run`).
seed	(int) - fix the seed before the training start
keep_num_ckpts	(int) - to prevent saving too many checkpoints, only save the `keep_num_ckpts` latest checkpoints and delete the old ones.
use_scheduler	(bool) - whether to use the scheduler

**others – only meaningful when you want to override this train method, which is not the common case. Hence we skip the documentation for now.

prepare_librispeech#

s3prl.problem.asr.superb_asr.prepare_librispeech(target_dir, cache_dir, dataset_root, train_sets: List[str], valid_sets: List[str], test_sets: List[str], n_jobs: int = 6, get_path_only: bool = False)[source][source]#: Prepare LibriSpeech for ASR following SuperbASR.prepare_data format. See LibriSpeech for the arguments usage

prepare_common_tokenizer#

s3prl.problem.asr.superb_asr.prepare_common_tokenizer(target_dir, cache_dir, tokenizer_data_path, get_path_only=False, tokenizer_name: str = None, vocab_file: str = None, vocab_type: str = 'character', vocab_args: dict = None, slots_file: str = None)[source][source]#

Build the tokenizer following SuperbASR.build_tokenizer format

Parameters:

tokenizer_name (str) – Save the tokenizer filepath with this filename
vocab_file (str) – When the tokenizer was already prepared, and just want to load and return the tokenizer here. Path or URL
vocab_type (str) – character / phoneme / word / subword
vocab_args (dict) –

when vocab_type is character / phoneme / word, supports arguments in
s3prl.dataio.encoder.vocabulary.generate_basic_vocab

whe vocab_type is subword, supports arguments in
s3prl.dataio.encoder.vocabulary.generate_subword_vocab
slots_file (str) – If presented, the pre-defined slots will be used to encode the special tokens. Path or URL

Returns:

str

tokenizer path