superb_asr#
(s3prl.problem.asr.superb_asr)
The setting of Superb ASR
- Authors
Heng-Jui Chang 2022
Leo 2022
SuperbASR#
- class s3prl.problem.asr.superb_asr.SuperbASR[source][source]#
Bases:
ASR
- default_config() dict [source][source]#
The default arguments for
run
in yaml. Note that for the fields with inner values, likebuild_model
, the outer field name corresponds to a method name, so you can find the methodbuild_model
. Furthermore, the values inside that field will be directly passed into the method. So by changing these inner values, you can directly affect the behavior of the corresponding method. See the method documentation for all the supported arguments and their meanings.The methods affected by the following config are:
prepare_data
prepare_tokenizer_data
build_tokenizer
build_dataset
build_batch_sampler
build_upstream
build_featurizer
build_downstream
build_model
build_task
build_optimizer
build_scheduler
save_model
save_task
train
start: 0 stop: null target_dir: ??? cache_dir: null remove_all_cache: false prepare_data: dataset_root: ??? train_sets: - train-clean-100 valid_sets: - dev-clean test_sets: - test-clean prepare_tokenizer_data: {} build_tokenizer: vocab_type: character build_dataset: {} build_batch_sampler: train: batch_size: 32 max_length: 2000 shuffle: true valid: batch_size: 1 test: batch_size: 1 build_upstream: name: ??? build_featurizer: layer_selections: null normalize: false build_downstream: model_conf: module: LSTM proj_size: 1024 hidden_size: - 1024 - 1024 dropout: - 0.2 - 0.2 layer_norm: - false - false proj: - false - false sample_rate: - 1 - 1 sample_style: concat bidirectional: true specaug_conf: freq_mask_width_range: !!python/tuple - 0 - 50 num_freq_mask: 4 time_mask_width_range: !!python/tuple - 0 - 40 num_time_mask: 2 build_model: upstream_trainable: false build_task: log_metrics: - cer - wer build_optimizer: name: Adam conf: lr: 0.0001 build_scheduler: name: ExponentialLR gamma: 0.9 save_model: extra_conf: build_downstream_conf: ${build_downstream} save_task: {} train: total_steps: 200000 log_step: 100 eval_step: 2000 save_step: 500 gradient_clipping: 1.0 gradient_accumulate: 1 valid_metric: wer valid_higher_better: false auto_resume: true resume_ckpt_dir: null
- prepare_data(prepare_data: dict, target_dir: str, cache_dir: str, get_path_only: bool = False)[source][source]#
Prepare the task-specific data metadata (path, labels…). By default call
prepare_librispeech
with**prepare_data
- Parameters:
prepare_data (dict) – same in
default_config
, support arguments inprepare_librispeech
target_dir (str) – Parse your corpus and save the csv file into this directory
cache_dir (str) – If the parsing or preprocessing takes too long time, you can save the temporary files into this directory. This directory is expected to be shared across different training sessions (different hypers and
target_dir
)get_path_only (str) – Directly return the filepaths no matter they exist or not.
- Returns:
tuple
train_path (str)
valid_path (str)
test_paths (List[str])
Each path (str) should be a csv file containing the following columns:
column
description
id
(str) - the unique id for this data point
wav_path
(str) - the absolute path of the waveform file
transcription
(str) - a text string
- prepare_tokenizer_data(prepare_tokenizer_data: dict, target_dir: str, cache_dir: str, train_csv: str, valid_csv: str, test_csvs: List[str], get_path_only: bool = False)[source][source]#
Prepare the text file used for training tokenizer. By default only use the transcription in the
train_csv
returned fromprepare_data
The defaultprepare_tokenizer_data
build the character-based tokenizer- Parameters:
prepare_tokenizer_data (dict) – same in
default_config
, no supported argument for nowtarget_dir (str) – Save the text file into this directory
cache_dir (str) – If the parsing or preprocessing takes too long time, you can save the temporary files into this directory. This directory is expected to be shared across different training sessions (different hypers and
target_dir
)train_csv (str) – The train data given by
prepare_data
get_path_only (str) – Directly return the filepaths no matter they exist or not.
- Returns:
str
The text file path, the text file should be in the format
This is the first line This is the second line These are all text used for training tokenizer
- build_tokenizer(build_tokenizer: dict, target_dir: str, cache_dir: str, tokenizer_data_path: str, get_path_only: bool = False)[source][source]#
Build the tokenizer from the data prepared by
prepare_tokenizer_data
By default callprepare_common_tokenizer
with**build_tokenizer
- Parameters:
build_tokenizer (dict) – same in
default_config
, arguments forprepare_common_tokenizer
target_dir (str) – Current experinment directory
cache_dir (str) – If the parsing or preprocessing takes too long time, you can save the temporary files into this directory. This directory is expected to be shared across different training sessions (different hypers and
target_dir
)tokenizer_data_path (str) – The text file from
prepare_tokenizer_data
get_path_only (str) – Directly return the filepaths no matter they exist or not.
- Returns:
str
filepath of the pickled
s3prl.dataio.encoder.tokenizer.Tokenizer
- build_dataset(build_dataset: dict, target_dir: str, cache_dir: str, mode: str, data_csv: str, tokenizer_path: str)[source][source]#
Build the dataset for train/valid/test.
- Parameters:
build_dataset (dict) – same in
default_config
, not usedtarget_dir (str) – Current experiment directory
cache_dir (str) – If the preprocessing takes too long time, you can save the temporary files into this directory. This directory is expected to be shared across different training sessions (different hypers and
target_dir
)mode (str) – train/valid/test
data_csv (str) – The metadata csv file for the specific
mode
tokenizer_path (str) – The pickled tokenizer path for encoding transcription
- Returns:
torch Dataset
For all train/valid/test mode, the dataset should return each item as a dictionary containing the following keys:
key
description
x
(torch.FloatTensor) - the waveform in (seq_len, 1)
x_len
(int) - the waveform length
seq_len
class_ids
(torch.LongTensor) - the encoded class ids of a transcription (sentence)
labels
(str) - the text transcription
unique_name
(str) - the unique id for this datapoint
- build_batch_sampler(build_batch_sampler: dict, target_dir: str, cache_dir: str, mode: str, data_csv: str, dataset: Dataset)[source][source]#
Return the batch sampler for torch DataLoader.
- Parameters:
build_batch_sampler (dict) –
same in
default_config
key
description
train
(dict) - arguments for
SortedBucketingSampler
valid
(dict) - arguments for
FixedBatchSizeBatchSampler
test
(dict) - arguments for
FixedBatchSizeBatchSampler
target_dir (str) – Current experiment directory
cache_dir (str) – If the preprocessing takes too long time, save the temporary files into this directory. This directory is expected to be shared across different training sessions (different hypers and
target_dir
)mode (str) – train/valid/test
data_csv (str) – the
mode
specific csv fromprepare_data
dataset – the dataset from
build_dataset
- Returns:
batch sampler for torch DataLoader
- build_downstream(build_downstream: dict, downstream_input_size: int, downstream_output_size: int, downstream_input_stride: int)[source][source]#
Return the task-specific downstream model. By default build the
RNNEncoder
model wrapped withModelWithSpecaug
- Parameters:
build_downstream (dict) – same in
default_config
, has two keys:model_conf
is the arguments forRNNEncoder
;specaug_conf
is the arguments forModelWithSpecaug
downstream_input_size (int) – the required input size of the model
downstream_output_size (int) – the required output size of the model
downstream_input_stride (int) – the input feature’s stride (from 16 KHz)
- Returns:
- build_collate_fn(build_collate_fn: dict, mode: str)[source]#
By default returns
s3prl.dataset.base.default_collate_fn
- Parameters:
build_collate_fn (dict) – same in
default_config
, no argument supported for nowmode (str) – train, valid, or test
- Returns:
callable
the collate_fn for torch DataLoader in train/valid/test
mode
- build_featurizer(build_featurizer: dict, upstream)[source]#
By default build the featurizer with
s3prl.nn.Featurizer
- Parameters:
build_featurizer (dict) – same in
default_config
, arguments fors3prl.nn.Featurizer
upstream (
AbsUpstream
) – the upstream model built bybuild_upstream
- Returns:
s3prl.nn.interface.AbsFeaturizer
Return the featurizer model. The featurizer is used to reduce the multiple hidden states returned from the upstream model (built by
build_upstream
) into a single hidden state, so can be easliy fed into the downstream model
- build_model(build_model: dict, model_output_size: int, build_upstream: dict, build_featurizer: dict, build_downstream: dict)[source]#
By default build model with
s3prl.nn.upstream.UpstreamDownstreamModel
- Parameters:
build_model (dict) – same in
default_config
, arguments fors3prl.nn.upstream.UpstreamDownstreamModel
model_output_size (int) – the required model’s output hidden size
build_upstream (dict) – same in
default_config
, refer tobuild_upstream
build_featurizer (dict) – same in
default_config
, refer tobuild_featurizer
build_downstream (dict) – same in
default_config
, refer tobuild_downstream
- Returns:
torch.nn.Module
Return the entire model for the task, which takes the direct items from DataLoader as the input. Usually, the components can be built by
build_upstream
,build_featurizer
,build_downstream
, and are concated together to get the final model. The upstream extracts multiple hidden states, the featuizer reduce them into a single hidden state, and the downstream takes the hidden states as the feature for the downstream-specific model.
- build_optimizer(build_optimizer: dict, parameters)[source]#
- Parameters:
build_optimizer (dict) –
same in
default_config
, refer to belowkey
description
name
(str) - the optimizer class name in
torch.optim
conf
(dict) - the arguments for initializing the optimizer class. e.g.
{"lr": 1.0e-4}
parameters (iterable) – the standard params accepted by
torch.optim.Optimizer
.
- Returns:
torch.optim.Optimizer
An optimizer following standard torch usage
- build_scheduler(build_scheduler: dict, optimizer)[source]#
- Parameters:
build_scheduler (dict) –
same in
default_config
key
description
name
(str) - the scheduler class name in
torch.optim.lr_scheduler
conf
(dict) - the arguments for initializing the scheduler class. e.g.
{"gamma": 0.01}
fortorch.optim.lr_scheduler.StepLR
optimizer – the standard torch optimizer accepted by Scheduler in
torch.optim.lr_scheduler
.
- Returns:
torch scheduler
A scheduler following standard torch usage
- build_upstream(build_upstream: dict)[source]#
By default build the upstream with
s3prl.nn.upstream.S3PRLUpstream
- Parameters:
build_upstream (dict) – same in
default_config
, arguments fors3prl.nn.upstream.S3PRLUpstream
- Returns:
s3prl.nn.interface.AbsUpstream
Return an upstream model, whose forward takes the waveform input and returns multiple hidden states as features.
- evaluate(evaluate: dict, mode: str, task, dataset, batch_sampler, collate_fn, eval_batch: int, dump_dir: str, device: str, num_workers: int)[source]#
The evaluate routine used by
train
(during validation phase) andrun
(during testing phase).- Parameters:
evaluate (dict) – same in
default_config
, no argument supported for now**others – only meaningful when you want to override this train method, which is not the common case. Hence we skip the documentation for now.
- classmethod get_class_from_name(name: str)[source]#
- Parameters:
name (str) – the
__name__
of the problem class- Returns:
Problem
- load_model(model_ckpt_dir: str)[source]#
Return the saved model.
- Parameters:
model_ckpt_dir (str) – Restore the model with
build_model
and the checkpoint saved in this directory.- Returns:
torch.nn.Module
- load_model_and_task(ckpts_dir: str, task_overrides: Optional[dict] = None)[source]#
This is a helper method to combine
load_model
andload_task
together to directly load the model and the task. This method assumes the model is saved underckpts_dir / 'model'
and the task is saved underckpts_dir / 'task'
- Returns:
tuple
model (
torch.nn.Module
)task (
s3prl.task.Task
)
- load_task(task_ckpt_dir: str, model: Module, task_overrides: Optional[dict] = None)[source]#
Return the saved task.
- Parameters:
task_ckpt_dir (str) – Restore the task with
build_task
and the checkpoint saved in this directory.model (torch.nn.Module) – the model for the task, since the model is separately saved and is required for
build_task
.task_overrides (dict) – overrides the saved initialization arguments, so can change the loaded task’s behavior. Like, change the decoding hyperparameters.
- Returns:
- run(target_dir: str, cache_dir: str, remove_all_cache: bool = False, start: int = 0, stop: Optional[int] = None, num_workers: int = 6, eval_batch: int = -1, device: str = 'cuda', world_size: int = 1, rank: int = 0, test_ckpt_dir: Optional[str] = None, prepare_data: Optional[dict] = None, prepare_tokenizer_data: Optional[dict] = None, build_tokenizer: Optional[dict] = None, build_dataset: Optional[dict] = None, build_batch_sampler: Optional[dict] = None, build_collate_fn: Optional[dict] = None, build_upstream: Optional[dict] = None, build_featurizer: Optional[dict] = None, build_downstream: Optional[dict] = None, build_model: Optional[dict] = None, build_task: Optional[dict] = None, build_optimizer: Optional[dict] = None, build_scheduler: Optional[dict] = None, save_model: Optional[dict] = None, save_task: Optional[dict] = None, train: Optional[dict] = None, evaluate: Optional[dict] = None)[source]#
stage
description
0
Parse the corpus and save the metadata file for ASR (waveform path, label…)
1
Prepare the metadata file for training tokenizer
2
Train the tokenizer
3
Train the ASR model
4
Evaluate the model on multiple test sets, multiple checkpoints will be evaluated for each test set (See
test_ckpt_steps
)- Parameters:
target_dir (str) – The directory that stores the script result.
cache_dir (str) – The directory that caches the processed data. Default: /home/user/.cache/s3prl/data
remove_all_cache (bool) – Whether to remove all the cache stored under cache_dir. Default: False
start (int) – The starting stage of the problem script. Default: 0
stop (int) – The stoping stage of the problem script, set None to reach the final stage. Default: None
num_workers (int) – num_workers for all the torch DataLoder
eval_batch (int) – During evaluation (valid or test), limit the number of batch. This is helpful for the fast development to check everything won’t crash. If is -1, disable this feature and evaluate the entire epoch. Default: -1
device (str) – The device type for all torch-related operation: “cpu” or “cuda” Default: “cuda”
world_size (int) – How many processes are running this script simultaneously (in parallel). Usually this is just 1, however if you are runnig distributed training, this should be > 1. Default: 1
rank (int) – When distributed training, world_size > 1. Take
world_size == 8
for example, this means 8 processes (8 GPUs) are runing in parallel. The script needs to know which process among 8 processes it is. In this case,rank
can range from 0~7. All the 8 processes have the sameworld_size
but differentrank
(process id).test_ckpt_dir (str) – Specify the checkpoint path for testing. If not, use checkpoints specified by
test_ckpts_steps
.**others – The other arguments like
prepare_data
andbuild_model
are method specific-arguments for methods likeprepare_data
andbuild_model
, and will not be used in the corerun
logic. See the specific method documentation for their supported arguments and meaning
- save_model(save_model: dict, model_ckpt_dir: str, build_model_all_args: dict, model: Module)[source]#
Save the model state_dict and the model initialization arguments into the given directory. If you override this method, it is highly possible you also need to override
load_model
- Parameters:
save_model (dict) – same in
default_config
, so the user can save additional settings, like the configuration of the dataset by duplicating the dataset hypers inside thesave_model
field. You can rely on theomegaconf
package to simplify the duplication.model_ckpt_dir (str) – save the model into the this directory.
build_model_all_args (dict) – all the arguments of
build_model
. By saving this dictionary, you can easily reconstruct the same model by callingbuild_model
with the saved dictionary.model (torch.nn.Module) – the model to be saved.
- Returns:
None
- save_task(save_task: dict, task_ckpt_dir: str, build_task_all_args_except_model: dict, task: Task)[source]#
Save the task’s state,
task.get_state()
, and the initialization arguments into the given directory. If you override this method, it is highly possible you also need to overrideload_task
.- Parameters:
save_task (dict) – same in
default_config
, so the user can save additional settings, like the configuration of the dataset by duplicating the dataset hypers inside thesave_task
field. You can rely on theomegaconf
package to simplify the duplication.task_ckpt_dir (str) – save the task into this directory.
build_task_all_args_except_model (dict) – all the arguments of
build_task
except themodel
argument since the model should be sapartely saved bysave_model
. By saving this dictionary, you can easily reconstruct the same task by callingbuild_task
with the saved dictionary.task (Task) – the task to be saved.
- Returns:
None
- train(train: dict, train_dir: str, build_model_all_args: dict, build_task_all_args_except_model: dict, save_model: dict, save_task: dict, build_optimizer: dict, build_scheduler: dict, evaluate: dict, train_dataset, train_batch_sampler, train_collate_fn, valid_dataset, valid_batch_sampler, valid_collate_fn, num_workers: int, world_size: int, rank: int, eval_batch: int, device: str, global_config: Optional[dict] = None)[source]#
- Parameters:
train (dict) –
same in
default_config
key
description
total_steps
(int) - the total optimization steps
log_step
(int) - logging frequency. log every
log_step
stepeval_step
(int) - evaluation frequency. Evaluate every
eval_step
step. Note that you can control how many batch to evaluate to speed up the development by theeval_batch
argument inrun
save_step
(int) - save the checkpoint every
save_step
step.gradient_clipping
(float) - clip the gradient. important for RNNs.
gradient_accumulate
(int) - accumulate multiple steps’ gradient before updating network parameters to simulate large-batch optimization.
valid_metric
(str) - the metric to select the best valid checkpoint. Different Tasks have different supported valid_metrics. See
build_task
for the supported metrics.valid_higher_better
(bool) - some metrics are higher better, while some are lower better this will affect how to save the best validation checkpoint.
auto_resume
(bool) - if there are already the last checkpoint in
target_dir
(seerun
), whether to resume from it or delete it and start a new training session.resume_ckpt_dir
(str) - you can directly specify the checkpoint path to resume which is not necessary in
target_dir
(seerun
).seed
(int) - fix the seed before the training start
keep_num_ckpts
(int) - to prevent saving too many checkpoints, only save the
keep_num_ckpts
latest checkpoints and delete the old ones.use_scheduler
(bool) - whether to use the scheduler
**others – only meaningful when you want to override this train method, which is not the common case. Hence we skip the documentation for now.
prepare_librispeech#
- s3prl.problem.asr.superb_asr.prepare_librispeech(target_dir, cache_dir, dataset_root, train_sets: List[str], valid_sets: List[str], test_sets: List[str], n_jobs: int = 6, get_path_only: bool = False)[source][source]#
Prepare LibriSpeech for ASR following
SuperbASR.prepare_data
format. SeeLibriSpeech
for the arguments usage
prepare_common_tokenizer#
- s3prl.problem.asr.superb_asr.prepare_common_tokenizer(target_dir, cache_dir, tokenizer_data_path, get_path_only=False, tokenizer_name: Optional[str] = None, vocab_file: Optional[str] = None, vocab_type: str = 'character', vocab_args: Optional[dict] = None, slots_file: Optional[str] = None)[source][source]#
Build the tokenizer following
SuperbASR.build_tokenizer
format- Parameters:
tokenizer_name (str) – Save the tokenizer filepath with this filename
vocab_file (str) – When the tokenizer was already prepared, and just want to load and return the tokenizer here. Path or URL
vocab_type (str) – character / phoneme / word / subword
vocab_args (dict) –
- when
vocab_type
is character / phoneme / word, supports arguments in - whe
vocab_type
is subword, supports arguments in
- when
slots_file (str) – If presented, the pre-defined slots will be used to encode the special tokens. Path or URL
- Returns:
str
tokenizer path