encoder#
(s3prl.dataio.encoder)
Encode the raw data into numeric format, and then decode it
Simple categorical encoder |
|
Basic G2P |
|
Load tokenizer to encode & decode |
|
Create vocabulary (train tokenizer) |
CategoryEncoder#
CategoryEncoders#
G2P#
- class s3prl.dataio.encoder.G2P(file_list: Optional[List[str]] = None, allow_unk: bool = False)[source][source]#
Bases:
object
Grapheme-to-phoneme
- Parameters:
file_list (List[str], optional) – List of lexicon files. Defaults to None.
allow_unk (bool) – If false, raise Error when a word can not be recognized by this basic G2P
Tokenizer#
BertTokenizer#
WordTokenizer#
CharacterTokenizer#
CharacterSlotTokenizer#
SubwordTokenizer#
SubwordSlotTokenizer#
generate_basic_vocab#
- s3prl.dataio.encoder.generate_basic_vocab(mode: str, text_list: List[str], vocab_size: int = -1, coverage: float = 1.0, sort_vocab: bool = True) List[str] [source][source]#
Generates basic vocabularies, including character and word-based vocabularies.
- Parameters:
mode (str) – Vocabulary type (character or word).
text_list (List[str]) – List of text data.
vocab_size (int, optional) – Vocabulary size, if not specified, vocab_size would be coverage * actual vocab size. Defaults to -1.
coverage (float, optional) – Vocabulary coverage. Defaults to 1.0.
sort_vocab (bool, optional) – Sort vocabularies alphabetically. Defaults to True.
- Returns:
A list of vocabularies.
- Return type:
List[str]
generate_subword_vocab#
- s3prl.dataio.encoder.generate_subword_vocab(text_list: Optional[List[str]] = None, text_file: Optional[str] = None, output_file: Optional[str] = None, vocab_size: int = 1000, character_coverage: float = 1.0) str [source][source]#
Generates subword vocabularies based on sentencepiece.
- Parameters:
text_list (List[str], optional) – List of text data. Defaults to None.
text_file (str, optional) – Path to text data. Defaults to None.
output_file (str, optional) – Path to save trained subword vocabularies. Defaults to “”.
vocab_size (int, optional) – Vocabulary size. Defaults to 8000.
character_coverage (float, optional) – Coverage of characters in text data. Defaults to 1.0.
- Raises:
ImportError – If sentencepiece is not installed.
- Returns:
Path to ${output_file}.model.
- Return type:
str
generate_vocab#
- s3prl.dataio.encoder.generate_vocab(mode: str, text_list: Optional[List[str]] = None, text_file: Optional[str] = None, read_lines: int = 10000000, **vocab_args) Union[List[str], str] [source][source]#
Generates vocabularies given text data.
- Parameters:
mode (str) – Vocabulary type
text_list (List[str], optional) – List of text data. Defaults to None.
text_file (str, optional) – Path to text data. Defaults to None.
read_lines (int, optional) – Maximum lines to read from text_file. Defaults to 10000000.
vocab_args – if
mode != subword
, arguments forgenerate_basic_vocab
ifmode == subword
, arguments forgenerate_subword_vocab
- Returns:
A list of vocabularies or a path to .vocab file.
- Return type:
Union[List[str], str]