librispeech#

(s3prl.dataio.corpus.librispeech)

Parse the LibriSpeech corpus

Authors:
  • Heng-Jui Chang 2022

LibriSpeech#

class s3prl.dataio.corpus.librispeech.LibriSpeech(dataset_root: str, n_jobs: int = 4, train_split: List[str] = ['train-clean-100'], valid_split: List[str] = ['dev-clean'], test_split: List[str] = ['test-clean'])[source][source]#

Bases: Corpus

LibriSpeech Corpus Link: https://www.openslr.org/12

Parameters:
  • dataset_root (str) – Path to LibriSpeech corpus directory.

  • n_jobs (int, optional) – Number of jobs. Defaults to 4.

  • train_split (List[str], optional) – Training splits. Defaults to [“train-clean-100”].

  • valid_split (List[str], optional) – Validation splits. Defaults to [“dev-clean”].

  • test_split (List[str], optional) – Testing splits. Defaults to [“test-clean”].

get_corpus_splits(splits: List[str])[source][source]#
property all_data[source]#

Return all the data points in a dict of the format

data_id1:
    wav_path: (str) The waveform path
    transcription: (str) The transcription
    speaker: (str) The speaker name
    gender: (str) The speaker's gender
    corpus_split: (str) The split of corpus this sample belongs to

data_id2:
    ...
property data_split_ids[source]#
classmethod download_dataset(target_dir: str, splits: List[str] = ['train-clean-100', 'dev-clean', 'test-clean']) None[source][source]#
property data_split[source]#
static dataframe_to_datapoints(df: DataFrame, unique_name_fn: callable)[source]#