aac_metrics.utils.tokenization module¶

preprocess_mono_sents( sentences: list[str], cache_path: str | Path | None = None, java_path: str | Path | None = None, tmp_path: str | Path | None = None, punctuations: Iterable[str] = ("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';'), normalize_apostrophe: bool = False, verbose: int = 0, ) → list[str][source]¶

Tokenize sentences using PTB Tokenizer then merge them by space.

Warning

PTB tokenizer is a java program that takes a list[str] as input, so calling several times this function is slow on list[list[str]].

If you want to process multiple sentences (list[list[str]]), use preprocess_mult_sents() instead.

Parameters:¶

sentences: list[str]¶: The list of sentences to process.
cache_path: str | Path | None = None¶: The path to the external code directory. defaults to the value returned by get_default_cache_path().
java_path: str | Path | None = None¶: The path to the java executable. defaults to the value returned by get_default_java_path().
tmp_path: str | Path | None = None¶: Temporary directory path. defaults to the value returned by get_default_tmp_path().
punctuations: Iterable[str] = ("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';')¶: Set of punctuations to remove. defaults to PTB_PUNCTUATIONS.
normalize_apostrophe: bool = False¶: If True, add apostrophes for French language. defaults to False.
verbose: int = 0¶: The verbose level. defaults to 0.

Returns:¶

The sentences processed by the tokenizer.

preprocess_mult_sents( mult_sentences: list[list[str]], cache_path: str | Path | None = None, java_path: str | Path | None = None, tmp_path: str | Path | None = None, punctuations: Iterable[str] = ("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';'), normalize_apostrophe: bool = False, verbose: int = 0, ) → list[list[str]][source]¶

Tokenize multiple sentences using PTB Tokenizer with only one call then merge them by space.

Parameters:¶

mult_sentences: list[list[str]]¶: The list of list of sentences to process.
cache_path: str | Path | None = None¶: The path to the external code directory. defaults to the value returned by get_default_cache_path().
java_path: str | Path | None = None¶: The path to the java executable. defaults to the value returned by get_default_java_path().
tmp_path: str | Path | None = None¶: Temporary directory path. defaults to the value returned by get_default_tmp_path().
normalize_apostrophe: bool = False¶: If True, add apostrophes for French language. defaults to False.
verbose: int = 0¶: The verbose level. defaults to 0.

Returns:¶

The multiple sentences processed by the tokenizer.

ptb_tokenize_batch( sentences: Iterable[str], audio_ids: Iterable[Hashable] | None = None, cache_path: str | Path | None = None, java_path: str | Path | None = None, tmp_path: str | Path | None = None, punctuations: Iterable[str] = ("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';'), normalize_apostrophe: bool = False, verbose: int = 0, ) → list[list[str]][source]¶

Use PTB Tokenizer to process sentences. Should be used only with all the sentences of a subset due to slow computation.

Parameters:¶

sentences: Iterable[str]¶: The sentences to tokenize.
audio_ids: Iterable[Hashable] | None = None¶: The optional audio names for the PTB Tokenizer program. None will use the audio index as name. defaults to None.
cache_path: str | Path | None = None¶: The path to the external directory containing the JAR program. defaults to the value returned by get_default_cache_path().
java_path: str | Path | None = None¶: The path to the java executable. defaults to the value returned by get_default_java_path().
tmp_path: str | Path | None = None¶: The path to a temporary directory. defaults to the value returned by get_default_tmp_path().
normalize_apostrophe: bool = False¶: If True, add apostrophes for French language. defaults to False.
verbose: int = 0¶: The verbose level. defaults to 0.

Returns:¶

The sentences tokenized as list[list[str]].