aac_metrics.utils.tokenization module¶

preprocess_mono_sents( sentences: list[str], cache_path: str | Path | None = None, java_path: str | Path | None = None, tmp_path: str | Path | None = None, punctuations: Iterable[str] = ("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';'), normalize_apostrophe: bool = False, verbose: int = 0, ) → list[str][source]¶

Tokenize sentences using PTB Tokenizer then merge them by space.

Warning

PTB tokenizer is a java program that takes a list[str] as input, so calling several times this function is slow on list[list[str]].

If you want to process multiple sentences (list[list[str]]), use preprocess_mult_sents() instead.

Parameters:

sentences – The list of sentences to process.
cache_path – The path to the external code directory. defaults to the value returned by get_default_cache_path().
java_path – The path to the java executable. defaults to the value returned by get_default_java_path().
tmp_path – Temporary directory path. defaults to the value returned by get_default_tmp_path().
punctuations – Set of punctuations to remove. defaults to PTB_PUNCTUATIONS.
normalize_apostrophe – If True, add apostrophes for French language. defaults to False.
verbose – The verbose level. defaults to 0.

Returns:

The sentences processed by the tokenizer.

preprocess_mult_sents( mult_sentences: list[list[str]], cache_path: str | Path | None = None, java_path: str | Path | None = None, tmp_path: str | Path | None = None, punctuations: Iterable[str] = ("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';'), normalize_apostrophe: bool = False, verbose: int = 0, ) → list[list[str]][source]¶

Tokenize multiple sentences using PTB Tokenizer with only one call then merge them by space.

Parameters:

mult_sentences – The list of list of sentences to process.
cache_path – The path to the external code directory. defaults to the value returned by get_default_cache_path().
java_path – The path to the java executable. defaults to the value returned by get_default_java_path().
tmp_path – Temporary directory path. defaults to the value returned by get_default_tmp_path().
normalize_apostrophe – If True, add apostrophes for French language. defaults to False.
verbose – The verbose level. defaults to 0.

Returns:

The multiple sentences processed by the tokenizer.

ptb_tokenize_batch( sentences: Iterable[str], audio_ids: Iterable[Hashable] | None = None, cache_path: str | Path | None = None, java_path: str | Path | None = None, tmp_path: str | Path | None = None, punctuations: Iterable[str] = ("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';'), normalize_apostrophe: bool = False, verbose: int = 0, ) → list[list[str]][source]¶

Use PTB Tokenizer to process sentences. Should be used only with all the sentences of a subset due to slow computation.

Parameters:

sentences – The sentences to tokenize.
audio_ids – The optional audio names for the PTB Tokenizer program. None will use the audio index as name. defaults to None.
cache_path – The path to the external directory containing the JAR program. defaults to the value returned by get_default_cache_path().
java_path – The path to the java executable. defaults to the value returned by get_default_java_path().
tmp_path – The path to a temporary directory. defaults to the value returned by get_default_tmp_path().
normalize_apostrophe – If True, add apostrophes for French language. defaults to False.
verbose – The verbose level. defaults to 0.

Returns:

The sentences tokenized as list[list[str]].