aac_metrics.utils.tokenization module¶
- preprocess_mono_sents(
- sentences: list[str],
- cache_path: str | Path | None =
None, - java_path: str | Path | None =
None, - tmp_path: str | Path | None =
None, - punctuations: Iterable[str] =
("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';'), - normalize_apostrophe: bool =
False, - verbose: int =
0, Tokenize sentences using PTB Tokenizer then merge them by space.
Warning
PTB tokenizer is a java program that takes a list[str] as input, so calling several times this function is slow on list[list[str]].
If you want to process multiple sentences (list[list[str]]), use
preprocess_mult_sents()instead.- Parameters:¶
- sentences: list[str]¶
The list of sentences to process.
- cache_path: str | Path | None =
None¶ The path to the external code directory. defaults to the value returned by
get_default_cache_path().- java_path: str | Path | None =
None¶ The path to the java executable. defaults to the value returned by
get_default_java_path().- tmp_path: str | Path | None =
None¶ Temporary directory path. defaults to the value returned by
get_default_tmp_path().- punctuations: Iterable[str] =
("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';')¶ Set of punctuations to remove. defaults to PTB_PUNCTUATIONS.
- normalize_apostrophe: bool =
False¶ If True, add apostrophes for French language. defaults to False.
- verbose: int =
0¶ The verbose level. defaults to 0.
- Returns:¶
The sentences processed by the tokenizer.
- preprocess_mult_sents(
- mult_sentences: list[list[str]],
- cache_path: str | Path | None =
None, - java_path: str | Path | None =
None, - tmp_path: str | Path | None =
None, - punctuations: Iterable[str] =
("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';'), - normalize_apostrophe: bool =
False, - verbose: int =
0, Tokenize multiple sentences using PTB Tokenizer with only one call then merge them by space.
- Parameters:¶
- mult_sentences: list[list[str]]¶
The list of list of sentences to process.
- cache_path: str | Path | None =
None¶ The path to the external code directory. defaults to the value returned by
get_default_cache_path().- java_path: str | Path | None =
None¶ The path to the java executable. defaults to the value returned by
get_default_java_path().- tmp_path: str | Path | None =
None¶ Temporary directory path. defaults to the value returned by
get_default_tmp_path().- normalize_apostrophe: bool =
False¶ If True, add apostrophes for French language. defaults to False.
- verbose: int =
0¶ The verbose level. defaults to 0.
- Returns:¶
The multiple sentences processed by the tokenizer.
- ptb_tokenize_batch(
- sentences: Iterable[str],
- audio_ids: Iterable[Hashable] | None =
None, - cache_path: str | Path | None =
None, - java_path: str | Path | None =
None, - tmp_path: str | Path | None =
None, - punctuations: Iterable[str] =
("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';'), - normalize_apostrophe: bool =
False, - verbose: int =
0, Use PTB Tokenizer to process sentences. Should be used only with all the sentences of a subset due to slow computation.
- Parameters:¶
- sentences: Iterable[str]¶
The sentences to tokenize.
- audio_ids: Iterable[Hashable] | None =
None¶ The optional audio names for the PTB Tokenizer program. None will use the audio index as name. defaults to None.
- cache_path: str | Path | None =
None¶ The path to the external directory containing the JAR program. defaults to the value returned by
get_default_cache_path().- java_path: str | Path | None =
None¶ The path to the java executable. defaults to the value returned by
get_default_java_path().- tmp_path: str | Path | None =
None¶ The path to a temporary directory. defaults to the value returned by
get_default_tmp_path().- normalize_apostrophe: bool =
False¶ If True, add apostrophes for French language. defaults to False.
- verbose: int =
0¶ The verbose level. defaults to 0.
- Returns:¶
The sentences tokenized as list[list[str]].