aac_metrics.utils.tokenization module

preprocess_mono_sents(
sentences: list[str],
cache_path: str | Path | None = None,
java_path: str | Path | None = None,
tmp_path: str | Path | None = None,
punctuations: Iterable[str] = ("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';'),
normalize_apostrophe: bool = False,
verbose: int = 0,
) list[str][source]

Tokenize sentences using PTB Tokenizer then merge them by space.

Warning

PTB tokenizer is a java program that takes a list[str] as input, so calling several times this function is slow on list[list[str]].

If you want to process multiple sentences (list[list[str]]), use preprocess_mult_sents() instead.

Parameters:
  • sentences – The list of sentences to process.

  • cache_path – The path to the external code directory. defaults to the value returned by get_default_cache_path().

  • java_path – The path to the java executable. defaults to the value returned by get_default_java_path().

  • tmp_path – Temporary directory path. defaults to the value returned by get_default_tmp_path().

  • punctuations – Set of punctuations to remove. defaults to PTB_PUNCTUATIONS.

  • normalize_apostrophe – If True, add apostrophes for French language. defaults to False.

  • verbose – The verbose level. defaults to 0.

Returns:

The sentences processed by the tokenizer.

preprocess_mult_sents(
mult_sentences: list[list[str]],
cache_path: str | Path | None = None,
java_path: str | Path | None = None,
tmp_path: str | Path | None = None,
punctuations: Iterable[str] = ("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';'),
normalize_apostrophe: bool = False,
verbose: int = 0,
) list[list[str]][source]

Tokenize multiple sentences using PTB Tokenizer with only one call then merge them by space.

Parameters:
  • mult_sentences – The list of list of sentences to process.

  • cache_path – The path to the external code directory. defaults to the value returned by get_default_cache_path().

  • java_path – The path to the java executable. defaults to the value returned by get_default_java_path().

  • tmp_path – Temporary directory path. defaults to the value returned by get_default_tmp_path().

  • normalize_apostrophe – If True, add apostrophes for French language. defaults to False.

  • verbose – The verbose level. defaults to 0.

Returns:

The multiple sentences processed by the tokenizer.

ptb_tokenize_batch(
sentences: Iterable[str],
audio_ids: Iterable[Hashable] | None = None,
cache_path: str | Path | None = None,
java_path: str | Path | None = None,
tmp_path: str | Path | None = None,
punctuations: Iterable[str] = ("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';'),
normalize_apostrophe: bool = False,
verbose: int = 0,
) list[list[str]][source]

Use PTB Tokenizer to process sentences. Should be used only with all the sentences of a subset due to slow computation.

Parameters:
  • sentences – The sentences to tokenize.

  • audio_ids – The optional audio names for the PTB Tokenizer program. None will use the audio index as name. defaults to None.

  • cache_path – The path to the external directory containing the JAR program. defaults to the value returned by get_default_cache_path().

  • java_path – The path to the java executable. defaults to the value returned by get_default_java_path().

  • tmp_path – The path to a temporary directory. defaults to the value returned by get_default_tmp_path().

  • normalize_apostrophe – If True, add apostrophes for French language. defaults to False.

  • verbose – The verbose level. defaults to 0.

Returns:

The sentences tokenized as list[list[str]].