vocabulary#
(s3prl.dataio.encoder.vocabulary)
Create vocabulary (train tokenizer)
- Authors:
Heng-Jui Chang 2022
generate_basic_vocab#
- s3prl.dataio.encoder.vocabulary.generate_basic_vocab(mode: str, text_list: List[str], vocab_size: int = -1, coverage: float = 1.0, sort_vocab: bool = True) List[str] [source][source]#
Generates basic vocabularies, including character and word-based vocabularies.
- Parameters:
mode (str) – Vocabulary type (character or word).
text_list (List[str]) – List of text data.
vocab_size (int, optional) – Vocabulary size, if not specified, vocab_size would be coverage * actual vocab size. Defaults to -1.
coverage (float, optional) – Vocabulary coverage. Defaults to 1.0.
sort_vocab (bool, optional) – Sort vocabularies alphabetically. Defaults to True.
- Returns:
A list of vocabularies.
- Return type:
List[str]
generate_subword_vocab#
- s3prl.dataio.encoder.vocabulary.generate_subword_vocab(text_list: Optional[List[str]] = None, text_file: Optional[str] = None, output_file: Optional[str] = None, vocab_size: int = 1000, character_coverage: float = 1.0) str [source][source]#
Generates subword vocabularies based on sentencepiece.
- Parameters:
text_list (List[str], optional) – List of text data. Defaults to None.
text_file (str, optional) – Path to text data. Defaults to None.
output_file (str, optional) – Path to save trained subword vocabularies. Defaults to “”.
vocab_size (int, optional) – Vocabulary size. Defaults to 8000.
character_coverage (float, optional) – Coverage of characters in text data. Defaults to 1.0.
- Raises:
ImportError – If sentencepiece is not installed.
- Returns:
Path to ${output_file}.model.
- Return type:
str
generate_vocab#
- s3prl.dataio.encoder.vocabulary.generate_vocab(mode: str, text_list: Optional[List[str]] = None, text_file: Optional[str] = None, read_lines: int = 10000000, **vocab_args) Union[List[str], str] [source][source]#
Generates vocabularies given text data.
- Parameters:
mode (str) – Vocabulary type
text_list (List[str], optional) – List of text data. Defaults to None.
text_file (str, optional) – Path to text data. Defaults to None.
read_lines (int, optional) – Maximum lines to read from text_file. Defaults to 10000000.
vocab_args – if
mode != subword
, arguments forgenerate_basic_vocab
ifmode == subword
, arguments forgenerate_subword_vocab
- Returns:
A list of vocabularies or a path to .vocab file.
- Return type:
Union[List[str], str]