aac_metrics.utils.tokenization module¶
- preprocess_mono_sents(
- sentences: list[str],
- cache_path: str | Path | None = None,
- java_path: str | Path | None = None,
- tmp_path: str | Path | None = None,
- punctuations: Iterable[str] = ("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';'),
- normalize_apostrophe: bool = False,
- verbose: int = 0,
Tokenize sentences using PTB Tokenizer then merge them by space.
Warning
PTB tokenizer is a java program that takes a list[str] as input, so calling several times this function is slow on list[list[str]].
If you want to process multiple sentences (list[list[str]]), use
preprocess_mult_sents()instead.- Parameters:
sentences – The list of sentences to process.
cache_path – The path to the external code directory. defaults to the value returned by
get_default_cache_path().java_path – The path to the java executable. defaults to the value returned by
get_default_java_path().tmp_path – Temporary directory path. defaults to the value returned by
get_default_tmp_path().punctuations – Set of punctuations to remove. defaults to PTB_PUNCTUATIONS.
normalize_apostrophe – If True, add apostrophes for French language. defaults to False.
verbose – The verbose level. defaults to 0.
- Returns:
The sentences processed by the tokenizer.
- preprocess_mult_sents(
- mult_sentences: list[list[str]],
- cache_path: str | Path | None = None,
- java_path: str | Path | None = None,
- tmp_path: str | Path | None = None,
- punctuations: Iterable[str] = ("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';'),
- normalize_apostrophe: bool = False,
- verbose: int = 0,
Tokenize multiple sentences using PTB Tokenizer with only one call then merge them by space.
- Parameters:
mult_sentences – The list of list of sentences to process.
cache_path – The path to the external code directory. defaults to the value returned by
get_default_cache_path().java_path – The path to the java executable. defaults to the value returned by
get_default_java_path().tmp_path – Temporary directory path. defaults to the value returned by
get_default_tmp_path().normalize_apostrophe – If True, add apostrophes for French language. defaults to False.
verbose – The verbose level. defaults to 0.
- Returns:
The multiple sentences processed by the tokenizer.
- ptb_tokenize_batch(
- sentences: Iterable[str],
- audio_ids: Iterable[Hashable] | None = None,
- cache_path: str | Path | None = None,
- java_path: str | Path | None = None,
- tmp_path: str | Path | None = None,
- punctuations: Iterable[str] = ("''", "'", '``', '`', '-LRB-', '-RRB-', '-LCB-', '-RCB-', '.', '?', '!', ',', ':', '-', '--', '...', ';'),
- normalize_apostrophe: bool = False,
- verbose: int = 0,
Use PTB Tokenizer to process sentences. Should be used only with all the sentences of a subset due to slow computation.
- Parameters:
sentences – The sentences to tokenize.
audio_ids – The optional audio names for the PTB Tokenizer program. None will use the audio index as name. defaults to None.
cache_path – The path to the external directory containing the JAR program. defaults to the value returned by
get_default_cache_path().java_path – The path to the java executable. defaults to the value returned by
get_default_java_path().tmp_path – The path to a temporary directory. defaults to the value returned by
get_default_tmp_path().normalize_apostrophe – If True, add apostrophes for French language. defaults to False.
verbose – The verbose level. defaults to 0.
- Returns:
The sentences tokenized as list[list[str]].