tokenizer#
(s3prl.dataio.encoder.tokenizer)
Load tokenizer to encode & decode
Modified from tensorflow_datasets.features.text.* Reference: https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text_lib
- Authors:
Heng-Jui Chang 2022
CharacterTokenizer#
CharacterSlotTokenizer#
SubwordTokenizer#
SubwordSlotTokenizer#
WordTokenizer#
PhonemeTokenizer#
load_tokenizer#
- s3prl.dataio.encoder.tokenizer.load_tokenizer(mode: str, vocab_file: Optional[str] = None, vocab_list: Optional[List[str]] = None, slots_file: Optional[str] = None) Tokenizer [source][source]#
Load a text tokenizer.
- Parameters:
mode (str) – Mode (“character”, “character-slot”, “subword”, “subword-slot”, “word”, “bert-…”)
vocab_file (str, optional) – Path to vocabularies. Defaults to None.
vocab_list (List[str], optional) – List of vocabularies. Defaults to None.
slots_file (str, optional) – Path to slots. Defaults to None.
- Raises:
NotImplementedError – If mode is not implemented.
- Returns:
Text tokenizer.
- Return type:
default_phoneme_tokenizer#
- s3prl.dataio.encoder.tokenizer.default_phoneme_tokenizer() PhonemeTokenizer [source][source]#
Returns a default LibriSpeech phoneme tokenizer.
- Returns:
Vocabs include 71 phonemes
- Return type: