vocabulary#

(s3prl.dataio.encoder.vocabulary)

Create vocabulary (train tokenizer)

Authors:
  • Heng-Jui Chang 2022

generate_basic_vocab#

s3prl.dataio.encoder.vocabulary.generate_basic_vocab(mode: str, text_list: List[str], vocab_size: int = -1, coverage: float = 1.0, sort_vocab: bool = True) List[str][source][source]#

Generates basic vocabularies, including character and word-based vocabularies.

Parameters:
  • mode (str) – Vocabulary type (character or word).

  • text_list (List[str]) – List of text data.

  • vocab_size (int, optional) – Vocabulary size, if not specified, vocab_size would be coverage * actual vocab size. Defaults to -1.

  • coverage (float, optional) – Vocabulary coverage. Defaults to 1.0.

  • sort_vocab (bool, optional) – Sort vocabularies alphabetically. Defaults to True.

Returns:

A list of vocabularies.

Return type:

List[str]

generate_subword_vocab#

s3prl.dataio.encoder.vocabulary.generate_subword_vocab(text_list: Optional[List[str]] = None, text_file: Optional[str] = None, output_file: Optional[str] = None, vocab_size: int = 1000, character_coverage: float = 1.0) str[source][source]#

Generates subword vocabularies based on sentencepiece.

Parameters:
  • text_list (List[str], optional) – List of text data. Defaults to None.

  • text_file (str, optional) – Path to text data. Defaults to None.

  • output_file (str, optional) – Path to save trained subword vocabularies. Defaults to “”.

  • vocab_size (int, optional) – Vocabulary size. Defaults to 8000.

  • character_coverage (float, optional) – Coverage of characters in text data. Defaults to 1.0.

Raises:

ImportError – If sentencepiece is not installed.

Returns:

Path to ${output_file}.model.

Return type:

str

generate_vocab#

s3prl.dataio.encoder.vocabulary.generate_vocab(mode: str, text_list: Optional[List[str]] = None, text_file: Optional[str] = None, read_lines: int = 10000000, **vocab_args) Union[List[str], str][source][source]#

Generates vocabularies given text data.

Parameters:
  • mode (str) – Vocabulary type

  • text_list (List[str], optional) – List of text data. Defaults to None.

  • text_file (str, optional) – Path to text data. Defaults to None.

  • read_lines (int, optional) – Maximum lines to read from text_file. Defaults to 10000000.

  • vocab_args – if mode != subword, arguments for generate_basic_vocab if mode == subword, arguments for generate_subword_vocab

Returns:

A list of vocabularies or a path to .vocab file.

Return type:

Union[List[str], str]