aac_metrics.functional package¶

bert_score_mrefs( candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, model: str | Module = 'roberta-large', tokenizer: Callable | None = None, device: str | device | None = 'cuda_if_available', batch_size: int | None = 32, num_threads: int = 0, max_length: int = 64, reset_state: bool = True, idf: bool = False, reduction: 'mean' | 'max' | 'min' | Callable[[...], Tensor] = 'max', filter_nan: bool = True, verbose: int = 0, ) → tuple[BERTScoreMRefsScores, BERTScoreMRefsScores] | Tensor[source]¶

BERTScore metric which supports multiple references.

The implementation is based on the bert_score implementation of torchmetrics.

Paper: https://arxiv.org/pdf/1904.09675.pdf

Parameters:¶

candidates: list[str]¶: The list of sentences to evaluate.
mult_references: list[list[str]]¶: The list of list of sentences used as target.
return_all_scores: bool = True¶: If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
model: str | Module = 'roberta-large'¶: The model name or the instantiated model to use to compute token embeddings. defaults to “roberta-large”.
tokenizer: Callable | None = None¶: The fast tokenizer used to split sentences into words. If None, use the tokenizer corresponding to the model argument. defaults to None.
device: str | device | None = 'cuda_if_available'¶: The PyTorch device used to run the BERT model. defaults to “cuda_if_available”.
batch_size: int | None = 32¶: The batch size used in the model forward.
num_threads: int = 0¶: A number of threads to use for a dataloader. defaults to 0.
max_length: int = 64¶: Max length when encoding sentences to tensor ids. defaults to 64.
idf: bool = False¶: Whether or not using Inverse document frequency to ponderate the BERTScores. defaults to False.
reduction: 'mean' | 'max' | 'min' | Callable[[...], Tensor] = 'max'¶: The reduction function to apply between multiple references for each audio. defaults to “max”.
filter_nan: bool = True¶: If True, replace NaN scores by 0.0. defaults to True.
verbose: int = 0¶: The verbose level. defaults to 0.

Returns:¶

A tuple of globals and locals scores or a scalar tensor with the main global score.

bleu(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, n: int = 4, option: ~typing.Literal['shortest', 'average', 'closest'] = 'closest', verbose: int = 0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_1_to_n: bool = False) → tuple[dict[str, Tensor], dict[str, Tensor]] | Tensor[source]¶

BiLingual Evaluation Understudy function.

Paper: https://www.aclweb.org/anthology/P02-1040.pdf

Note: this version of the BLEU metric applies a penalty formula that depends on the size of all candidates and the length of the references, which means that the average score of the candidates is not equal to the corpus score.

Parameters:¶

candidates: The list of sentences to evaluate.
mult_references: The list of list of sentences used as target.
return_all_scores: If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
n: Maximal number of n-grams taken into account. defaults to 4.
option: Corpus reference length mode. Can be “shortest”, “average” or “closest”. defaults to “closest”.
verbose: The verbose level. defaults to 0.
tokenizer: The fast tokenizer used to split sentences into words. defaults to str.split.
return_1_to_n: If True, returns the n-grams results from 1 to n. Otherwise return the n-grams scores. defauts to False.

Returns:¶

A tuple of globals and locals scores or a scalar tensor with the main global score.

bleu_1(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, option: ~typing.Literal['shortest', 'average', 'closest'] = 'closest', verbose: int = 0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_1_to_n: bool = False) → tuple[dict[str, Tensor], dict[str, Tensor]] | Tensor[source]¶

bleu_2(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, option: ~typing.Literal['shortest', 'average', 'closest'] = 'closest', verbose: int = 0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_1_to_n: bool = False) → tuple[dict[str, Tensor], dict[str, Tensor]] | Tensor[source]¶

bleu_3(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, option: ~typing.Literal['shortest', 'average', 'closest'] = 'closest', verbose: int = 0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_1_to_n: bool = False) → tuple[dict[str, Tensor], dict[str, Tensor]] | Tensor[source]¶

bleu_4(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, option: ~typing.Literal['shortest', 'average', 'closest'] = 'closest', verbose: int = 0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_1_to_n: bool = False) → tuple[dict[str, Tensor], dict[str, Tensor]] | Tensor[source]¶

cider_d(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, n: int = 4, sigma: float = 6.0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_tfidf: bool = False, scale: float = 10.0) → tuple[CIDErDScores, CIDErDScores] | Tensor[source]¶

Consensus-based Image Description Evaluation function.

Paper: https://arxiv.org/pdf/1411.5726.pdf

Warning

This metric requires at least 2 candidates with 2 sets of references, otherwise it will raises a ValueError.

Parameters:¶

candidates: The list of sentences to evaluate.
mult_references: The list of list of sentences used as target.
return_all_scores: If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
n: Maximal number of n-grams taken into account. defaults to 4.
sigma: Standard deviation parameter used for gaussian penalty. defaults to 6.0.
tokenizer: The fast tokenizer used to split sentences into words. defaults to str.split.
return_tfidf: If True, returns the list of dictionaries containing the tf-idf scores of n-grams in the sents_score output. defaults to False.
scale: CIDEr-D score factor. defaults to 10.0.

Returns:¶

A tuple of globals and locals scores or a scalar tensor with the main global score.

clap_sim( candidates: list[str], mult_references: list[list[str]] | None = None, audio_paths: list[str] | None = None, return_all_scores: bool = True, *, clap_method: 'audio' | 'text' = 'text', clap_model: str | CLAPWrapper = 'MS-CLAP-2023', device: str | device | None = 'cuda_if_available', batch_size: int | None = 32, reset_state: bool = True, seed: int | None = 42, verbose: int = 0, ) → tuple[CLAPScores, CLAPScores] | Tensor[source]¶

Cosine-similarity of the Contrastive Language-Audio Pretraining (CLAP) embeddings.

The implementation is based on the msclap pypi package.

Paper: https://arxiv.org/pdf/2411.00321
msclap package: https://pypi.org/project/msclap/

Parameters:¶

candidates: list[str]¶: The list of sentences to evaluate.
mult_references: list[list[str]] | None = None¶: The list of list of sentences used as target when method is “text”. defaults to None.
audio_paths: list[str] | None = None¶: Audio filepaths required when method is “audio”. defaults to None.
return_all_scores: bool = True¶: If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
clap_method: 'audio' | 'text' = 'text'¶: The method used to encode the sentences. Can be “text” or “audio”. defaults to “text”.
clap_model: str | CLAPWrapper = 'MS-CLAP-2023'¶: The CLAP model used to extract sentence embeddings for cosine-similarity. defaults to “2023”.
device: str | device | None = 'cuda_if_available'¶: The PyTorch device used to run MACE models. If “cuda_if_available”, it will use cuda if available. defaults to “cuda_if_available”.
batch_size: int | None = 32¶: The batch size of the CLAP models. defaults to 32.
reset_state: bool = True¶: If True, reset the state of the PyTorch global generator after the initialization of the pre-trained models. defaults to True.
seed: int | None = 42¶: Optional seed to make CLAP-sim scores deterministic when using clap_method=”audio” on large audio files. defaults to 42.
verbose: int = 0¶: The verbose level. defaults to 0.

Returns:¶

A tuple of globals and locals scores or a scalar tensor with the main global score.

Evaluate candidates with multiple references with the DCASE2023 Audio Captioning metrics.

Parameters:¶

candidates: list[str]¶: The list of sentences to evaluate.
mult_references: list[list[str]]¶: The list of list of sentences used as target.
preprocess: bool | Callable[[list[str]], list[str]] = True¶: If True, the candidates and references will be passed as input to the PTB stanford tokenizer before computing metrics. defaults to True.
cache_path: str | Path | None = None¶: The path to the external code directory. defaults to the value returned by get_default_cache_path().
java_path: str | Path | None = None¶: The path to the java executable. defaults to the value returned by get_default_java_path().
tmp_path: str | Path | None = None¶: Temporary directory path. defaults to the value returned by get_default_tmp_path().
device: str | device | None = 'cuda_if_available'¶: The PyTorch device used to run FENSE and SPIDErFL models. If None, it will try to detect use cuda if available. defaults to “cuda_if_available”.
verbose: int = 0¶: The verbose level. defaults to 0.

Returns:¶

A tuple contains the corpus and sentences scores.

Evaluate candidates with multiple references with the DCASE2024 Audio Captioning metrics.

Parameters:¶

candidates: list[str]¶: The list of sentences to evaluate.
mult_references: list[list[str]]¶: The list of list of sentences used as target.
preprocess: bool | Callable[[list[str]], list[str]] = True¶: If True, the candidates and references will be passed as input to the PTB stanford tokenizer before computing metrics. defaults to True.
cache_path: str | Path | None = None¶: The path to the external code directory. defaults to the value returned by get_default_cache_path().
java_path: str | Path | None = None¶: The path to the java executable. defaults to the value returned by get_default_java_path().
tmp_path: str | Path | None = None¶: Temporary directory path. defaults to the value returned by get_default_tmp_path().
device: str | device | None = 'cuda_if_available'¶: The PyTorch device used to run FENSE and SPIDErFL models. If None, it will try to detect use cuda if available. defaults to “cuda_if_available”.
verbose: int = 0¶: The verbose level. defaults to 0.

Returns:¶

A tuple contains the corpus and sentences scores.

Evaluate candidates with multiple references with custom metrics.

Parameters:¶

candidates: list[str]¶: The list of sentences to evaluate.
mult_references: list[list[str]]¶: The list of list of sentences used as target.
preprocess: bool | Callable[[list[str]], list[str]] = True¶: If True, the candidates and references will be passed as input to the PTB stanford tokenizer before computing metrics. defaults to True.
metrics: str | Iterable[str] | Iterable[Callable[[list, list], tuple]] = 'default'¶: The name of the metric list or the explicit list of metrics to compute. defaults to “default”.
cache_path: str | Path | None = None¶: The path to the external code directory. defaults to the value returned by get_default_cache_path().
java_path: str | Path | None = None¶: The path to the java executable. defaults to the value returned by get_default_java_path().
tmp_path: str | Path | None = None¶: Temporary directory path. defaults to the value returned by get_default_tmp_path().
device: str | device | None = 'cuda_if_available'¶: The PyTorch device used to run FENSE and SPIDErFL models. If None, it will try to detect use cuda if available. defaults to “cuda_if_available”.
verbose: int = 0¶: The verbose level. defaults to 0.

Returns:¶

A tuple contains the corpus and sentences scores.

fense( candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, sbert_model: str | SentenceTransformer = 'paraphrase-TinyBERT-L6-v2', echecker: str | BERTFlatClassifier = 'echecker_clotho_audiocaps_base', echecker_tokenizer: AutoTokenizer | None = None, error_threshold: float = 0.9, device: str | device | None = 'cuda_if_available', batch_size: int | None = 32, reset_state: bool = True, return_probs: bool = False, penalty: float = 0.9, verbose: int = 0, ) → tuple[FENSEScores, FENSEScores] | Tensor[source]¶

Fluency ENhanced Sentence-bert Evaluation (FENSE)

Paper: https://arxiv.org/abs/2110.04684
Original implementation: https://github.com/blmoistawinde/fense

Parameters:¶

candidates: list[str]¶: The list of sentences to evaluate.
mult_references: list[list[str]]¶: The list of list of sentences used as target.
return_all_scores: bool = True¶: If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
sbert_model: str | SentenceTransformer = 'paraphrase-TinyBERT-L6-v2'¶: The sentence BERT model used to extract sentence embeddings for cosine-similarity. defaults to “paraphrase-TinyBERT-L6-v2”.
echecker: str | BERTFlatClassifier = 'echecker_clotho_audiocaps_base'¶: The echecker model used to detect fluency errors. Can be “echecker_clotho_audiocaps_base”, “echecker_clotho_audiocaps_tiny”, “none” or None. defaults to “echecker_clotho_audiocaps_base”.
echecker_tokenizer: AutoTokenizer | None = None¶: The tokenizer of the echecker model. If None and echecker is not None, this value will be inferred with echecker.model_type. defaults to None.
error_threshold: float = 0.9¶: The threshold used to detect fluency errors for echecker model. defaults to 0.9.
penalty: float = 0.9¶: The penalty coefficient applied. Higher value means to lower the cos-sim scores when an error is detected. defaults to 0.9.
device: str | device | None = 'cuda_if_available'¶: The PyTorch device used to run pre-trained models. If “cuda_if_available”, it will use cuda if available. defaults to “cuda_if_available”.
batch_size: int | None = 32¶: The batch size of the sBERT and echecker models. defaults to 32.
reset_state: bool = True¶: If True, reset the state of the PyTorch global generator after the initialization of the pre-trained models. defaults to True.
return_probs: bool = False¶: If True, return each individual error probability given by the fluency detector model. defaults to False.
verbose: int = 0¶: The verbose level. defaults to 0.

Returns:¶

A tuple of globals and locals scores or a scalar tensor with the main global score.

fer( candidates: list[str], return_all_scores: bool = True, *, echecker: str | BERTFlatClassifier = 'echecker_clotho_audiocaps_base', echecker_tokenizer: AutoTokenizer | None = None, error_threshold: float = 0.9, device: str | device | None = 'cuda_if_available', batch_size: int | None = 32, reset_state: bool = True, return_probs: bool = False, verbose: int = 0, ) → tuple[FERScores, FERScores] | Tensor[source]¶

Return Fluency Error Rate (FER) detected by a pre-trained BERT model.

Paper: https://arxiv.org/abs/2110.04684
Original implementation: https://github.com/blmoistawinde/fense

Parameters:¶

candidates: list[str]¶: The list of sentences to evaluate.
mult_references: The list of list of sentences used as target.
return_all_scores: bool = True¶: If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
echecker: str | BERTFlatClassifier = 'echecker_clotho_audiocaps_base'¶: The echecker model used to detect fluency errors. Can be “echecker_clotho_audiocaps_base”, “echecker_clotho_audiocaps_tiny”, “none” or None. defaults to “echecker_clotho_audiocaps_base”.
echecker_tokenizer: AutoTokenizer | None = None¶: The tokenizer of the echecker model. If None and echecker is not None, this value will be inferred with echecker.model_type. defaults to None.
error_threshold: float = 0.9¶: The threshold used to detect fluency errors for echecker model. defaults to 0.9.
device: str | device | None = 'cuda_if_available'¶: The PyTorch device used to run pre-trained models. If “cuda_if_available”, it will use cuda if available. defaults to “cuda_if_available”.
batch_size: int | None = 32¶: The batch size of the echecker models. defaults to 32.
reset_state: bool = True¶: If True, reset the state of the PyTorch global generator after the initialization of the pre-trained models. defaults to True.
return_probs: bool = False¶: If True, return each individual error probability given by the fluency detector model. defaults to False.
verbose: int = 0¶: The verbose level. defaults to 0.

Returns:¶

A tuple of globals and locals scores or a scalar tensor with the main global score.

mace( candidates: list[str], mult_references: list[list[str]] | None = None, audio_paths: list[str] | None = None, return_all_scores: bool = True, *, mace_method: 'text' | 'audio' | 'combined' = 'text', penalty: float = 0.3, clap_model: str | CLAPWrapper = 'MS-CLAP-2023', seed: int | None = 42, echecker: str | BERTFlatClassifier = 'echecker_clotho_audiocaps_base', echecker_tokenizer: AutoTokenizer | None = None, error_threshold: float = 0.97, device: str | device | None = 'cuda_if_available', batch_size: int | None = 32, reset_state: bool = True, return_probs: bool = False, verbose: int = 0, ) → Tensor | tuple[MACEScores, MACEScores][source]¶

Multimodal Audio-Caption Evaluation class (MACE).

MACE is a metric designed for evaluating automated audio captioning (AAC) systems. Unlike metrics that compare machine-generated captions solely to human references, MACE uses both audio and text to improve evaluation. By integrating both audio and text, it produces assessments that align better with human judgments.

The implementation is based on the mace original implementation (original author have accepted to include their code in aac-metrics under the MIT license).

Paper: https://arxiv.org/pdf/2411.00321
Original author: Satvik Dixit
Original implementation: https://github.com/satvik-dixit/mace/tree/main

Parameters:¶

candidates: list[str]¶: The list of sentences to evaluate.
mult_references: list[list[str]] | None = None¶: The list of list of sentences used as target when method is “text” or “combined”. defaults to None.
audio_paths: list[str] | None = None¶: Audio filepaths required when method is “audio” or “combined”. defaults to None.
return_all_scores: bool = True¶: If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
mace_method: 'text' | 'audio' | 'combined' = 'text'¶: The method used to encode the sentences. Can be “text”, “audio” or “combined”. defaults to “text”.
penalty: float = 0.3¶: The penalty coefficient applied. Higher value means to lower the cos-sim scores when an error is detected. defaults to 0.3.
clap_model: str | CLAPWrapper = 'MS-CLAP-2023'¶: The CLAP model used to extract CLAP embeddings for cosine-similarity. defaults to “MS-CLAP-2023”.
seed: int | None = 42¶: Optional seed to make CLAP-sim scores deterministic when using mace_method=”audio” or “combined” on large audio files. defaults to 42.
echecker: str | BERTFlatClassifier = 'echecker_clotho_audiocaps_base'¶: The echecker model used to detect fluency errors. Can be “echecker_clotho_audiocaps_base”, “echecker_clotho_audiocaps_tiny”, “none” or None. defaults to “echecker_clotho_audiocaps_base”.
echecker_tokenizer: AutoTokenizer | None = None¶: The tokenizer of the echecker model. If None and echecker is not None, this value will be inferred with echecker.model_type. defaults to None.
error_threshold: float = 0.97¶: The threshold used to detect fluency errors for echecker model. defaults to 0.97.
device: str | device | None = 'cuda_if_available'¶: The PyTorch device used to run pre-trained models. If “cuda_if_available”, it will use cuda if available. defaults to “cuda_if_available”.
batch_size: int | None = 32¶: The batch size of the CLAP and echecker models. defaults to 32.
reset_state: bool = True¶: If True, reset the state of the PyTorch global generator after the initialization of the pre-trained models. defaults to True.
return_probs: bool = False¶: If True, return each individual error probability given by the fluency detector model. defaults to False.
verbose: int = 0¶: The verbose level. defaults to 0.

Returns:¶

A tuple of globals and locals scores or a scalar tensor with the main global score.

Metric for Evaluation of Translation with Explicit ORdering function.

Paper: https://dl.acm.org/doi/pdf/10.5555/1626355.1626389
Documentation: https://www.cs.cmu.edu/~alavie/METEOR/README.html
Original implementation: https://github.com/tylin/coco-caption

Parameters:¶

candidates: list[str]¶: The list of sentences to evaluate.
mult_references: list[list[str]]¶: The list of list of sentences used as target.
return_all_scores: bool = True¶: If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
cache_path: str | Path | None = None¶: The path to the external code directory. defaults to the value returned by get_default_cache_path().
java_path: str | Path | None = None¶: The path to the java executable. defaults to the value returned by get_default_java_path().
java_max_memory: str = '2G'¶: The maximal java memory used. defaults to “2G”.
language: 'en' | 'cz' | 'de' | 'es' | 'fr' = 'en'¶: The language used for stem, synonym and paraphrase matching. Can be one of (“en”, “cz”, “de”, “es”, “fr”). defaults to “en”.
use_shell: bool | None = None¶: Optional argument to force use os-specific shell for the java subprogram. If None, it will use shell only on Windows OS. defaults to None.
params: Iterable[float] | None = None¶: List of 4 parameters (alpha, beta gamma delta) used in METEOR metric. If None, it will use the default of the java program, which is (0.85, 0.2, 0.6, 0.75). defaults to None.
weights: Iterable[float] | None = None¶: List of 4 parameters (w1, w2, w3, w4) used in METEOR metric. If None, it will use the default of the java program, which is (1.0 1.0 0.6 0.8). defaults to None.
verbose: int = 0¶: The verbose level. defaults to 0.

Returns:¶

A tuple of globals and locals scores or a scalar tensor with the main global score.

rouge_l(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, beta: float = 1.2, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>) → tuple[ROUGELScores, ROUGELScores] | Tensor[source]¶

Recall-Oriented Understudy for Gisting Evaluation function.

Paper: https://aclanthology.org/W04-1013.pdf
Original Author: Ramakrishna Vedantam <vrama91@vt.edu>
Original implementation: https://github.com/tylin/coco-caption

Parameters:¶

candidates: The list of sentences to evaluate.
mult_references: The list of list of sentences used as target.
return_all_scores: If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
beta: Determines the weight of recall in the combined f-score. defaults to 1.2.
tokenizer: The fast tokenizer used to split sentences into words. defaults to str.split.

Returns:¶

A tuple of globals and locals scores or a scalar tensor with the main global score.

sbert_sim( candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, sbert_model: str | SentenceTransformer = 'paraphrase-TinyBERT-L6-v2', device: str | device | None = 'cuda_if_available', batch_size: int | None = 32, reset_state: bool = True, verbose: int = 0, ) → tuple[SBERTSimScores, SBERTSimScores] | Tensor[source]¶

Cosine-similarity of the Sentence-BERT embeddings.

Paper: https://arxiv.org/abs/1908.10084
Original implementation: https://github.com/blmoistawinde/fense

Parameters:¶

candidates: list[str]¶: The list of sentences to evaluate.
mult_references: list[list[str]]¶: The list of list of sentences used as target.
return_all_scores: bool = True¶: If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
sbert_model: str | SentenceTransformer = 'paraphrase-TinyBERT-L6-v2'¶: The sentence BERT model used to extract sentence embeddings for cosine-similarity. defaults to “paraphrase-TinyBERT-L6-v2”.
device: str | device | None = 'cuda_if_available'¶: The PyTorch device used to run pre-trained models. If “cuda_if_available”, it will use cuda if available. defaults to “cuda_if_available”.
batch_size: int | None = 32¶: The batch size of the sBERT models. defaults to 32.
reset_state: bool = True¶: If True, reset the state of the PyTorch global generator after the initialization of the pre-trained models. defaults to True.
verbose: int = 0¶: The verbose level. defaults to 0.

Returns:¶

A tuple of globals and locals scores or a scalar tensor with the main global score.

Semantic Propositional Image Caption Evaluation function.

Paper: https://arxiv.org/pdf/1607.08822.pdf

Parameters:¶

candidates: list[str]¶: The list of sentences to evaluate.
mult_references: list[list[str]]¶: The list of list of sentences used as target.
return_all_scores: bool = True¶: If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
cache_path: str | Path | None = None¶: The path to the external code directory. defaults to the value returned by get_default_cache_path().
java_path: str | Path | None = None¶: The path to the java executable. defaults to the value returned by get_default_java_path().
tmp_path: str | Path | None = None¶: Temporary directory path. defaults to the value returned by get_default_tmp_path().
n_threads: int | None = None¶: Number of threads used to compute SPICE. None value will use the default value of the java program. defaults to None.
java_max_memory: str = '8G'¶: The maximal java memory used. defaults to “8G”.
timeout: None | int | Iterable[int] = None¶: The number of seconds before killing the java subprogram. If a list is given, it will restart the program if the i-th timeout is reached. If None, no timeout will be used. defaults to None.
separate_cache_dir: bool = True¶: If True, the SPICE cache files will be stored into in a new temporary directory. This removes potential freezes when multiple instances of SPICE are running in the same cache dir. defaults to True.
use_shell: bool | None = None¶: Optional argument to force use os-specific shell for the java subprogram. If None, it will use shell only on Windows OS. defaults to None.
verbose: int = 0¶: The verbose level. defaults to 0.

Returns:¶

A tuple of globals and locals scores or a scalar tensor with the main global score.

spider(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, n: int = 4, sigma: float = 6.0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_tfidf: bool = False, cache_path: str | ~pathlib.Path | None = None, java_path: str | ~pathlib.Path | None = None, tmp_path: str | ~pathlib.Path | None = None, n_threads: int | None = None, java_max_memory: str = '8G', timeout: None | int | ~typing.Iterable[int] = None, verbose: int = 0) → tuple[SPIDErScores, SPIDErScores] | Tensor[source]¶

SPIDEr function.

Paper: https://arxiv.org/pdf/1612.00370.pdf

Warning

This metric requires at least 2 candidates with 2 sets of references, otherwise it will raises a ValueError.

Parameters:¶

candidates: The list of sentences to evaluate.
mult_references: The list of list of sentences used as target.
return_all_scores: If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
n: Maximal number of n-grams taken into account. defaults to 4.
sigma: Standard deviation parameter used for gaussian penalty. defaults to 6.0.
tokenizer: The fast tokenizer used to split sentences into words. defaults to str.split.
return_tfidf: If True, returns the list of dictionaries containing the tf-idf scores of n-grams in the sents_score output. defaults to False.
cache_path: The path to the external code directory. defaults to the value returned by get_default_cache_path().
java_path: The path to the java executable. defaults to the value returned by get_default_java_path().
tmp_path: Temporary directory path. defaults to the value returned by get_default_tmp_path().
n_threads: Number of threads used to compute SPICE. None value will use the default value of the java program. defaults to None.
java_max_memory: The maximal java memory used. defaults to “8G”.
timeout: The number of seconds before killing the java subprogram. If a list is given, it will restart the program if the i-th timeout is reached. If None, no timeout will be used. defaults to None.
verbose: The verbose level. defaults to 0.

Returns:¶

A tuple of globals and locals scores or a scalar tensor with the main global score.

spider_fl(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, n: int = 4, sigma: float = 6.0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_tfidf: bool = False, cache_path: str | ~pathlib.Path | None = None, java_path: str | ~pathlib.Path | None = None, tmp_path: str | ~pathlib.Path | None = None, n_threads: int | None = None, java_max_memory: str = '8G', timeout: None | int | ~typing.Iterable[int] = None, echecker: str | ~aac_metrics.functional.fer.BERTFlatClassifier = 'echecker_clotho_audiocaps_base', echecker_tokenizer: ~transformers.models.auto.tokenization_auto.AutoTokenizer | None = None, error_threshold: float = 0.9, device: str | ~torch.device | None = 'cuda_if_available', batch_size: int | None = 32, reset_state: bool = True, return_probs: bool = True, penalty: float = 0.9, verbose: int = 0) → tuple[SPIDErFLScores, SPIDErFLScores] | Tensor[source]¶

Combinaison of SPIDEr with Fluency Error detector.

Original implementation: https://github.com/felixgontier/dcase-2023-baseline/blob/main/metrics.py#L48.

Warning

This metric requires at least 2 candidates with 2 sets of references, otherwise it will raises a ValueError.

Parameters:¶

candidates: The list of sentences to evaluate.
mult_references: The list of list of sentences used as target.
return_all_scores: If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
n: Maximal number of n-grams taken into account. defaults to 4.
sigma: Standard deviation parameter used for gaussian penalty. defaults to 6.0.
tokenizer: The fast tokenizer used to split sentences into words. defaults to str.split.
return_tfidf: If True, returns the list of dictionaries containing the tf-idf scores of n-grams in the sents_score output. defaults to False.
cache_path: The path to the external code directory. defaults to the value returned by get_default_cache_path().
java_path: The path to the java executable. defaults to the value returned by get_default_java_path().
tmp_path: Temporary directory path. defaults to the value returned by get_default_tmp_path().
n_threads: Number of threads used to compute SPICE. None value will use the default value of the java program. defaults to None.
java_max_memory: The maximal java memory used. defaults to “8G”.
timeout: The number of seconds before killing the java subprogram. If a list is given, it will restart the program if the i-th timeout is reached. If None, no timeout will be used. defaults to None.
echecker: The echecker model used to detect fluency errors. Can be “echecker_clotho_audiocaps_base”, “echecker_clotho_audiocaps_tiny”, “none” or None. defaults to “echecker_clotho_audiocaps_base”.
echecker_tokenizer: The tokenizer of the echecker model. If None and echecker is not None, this value will be inferred with echecker.model_type. defaults to None.
error_threshold: The threshold used to detect fluency errors for echecker model. defaults to 0.9.
device: The PyTorch device used to run pre-trained models. If “cuda_if_available”, it will use cuda if available. defaults to “cuda_if_available”.
batch_size: The batch size of the sBERT and echecker models. defaults to 32.
reset_state: If True, reset the state of the PyTorch global generator after the initialization of the pre-trained models. defaults to True.
return_probs: If True, return each individual error probability given by the fluency detector model. defaults to True.
penalty: The penalty coefficient applied. Higher value means to lower the cos-sim scores when an error is detected. defaults to 0.9.
verbose: The verbose level. defaults to 0.

Returns:¶

A tuple of globals and locals scores or a scalar tensor with the main global score.

spider_max(mult_candidates: list[list[str]], mult_references: list[list[str]], return_all_scores: bool = True, *, return_all_cands_scores: bool = False, n: int = 4, sigma: float = 6.0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_tfidf: bool = False, cache_path: str | ~pathlib.Path | None = None, java_path: str | ~pathlib.Path | None = None, tmp_path: str | ~pathlib.Path | None = None, n_threads: int | None = None, java_max_memory: str = '8G', timeout: None | int | ~typing.Iterable[int] = None, verbose: int = 0) → tuple[SPIDErMaxScores, SPIDErMaxScores] | Tensor[source]¶

SPIDEr-max function.

Compute the maximal SPIDEr score accross multiple candidates.

Paper: https://dcase.community/documents/workshop2022/proceedings/DCASE2022Workshop_Labbe_46.pdf

Warning

This metric requires at least 2 candidates with 2 sets of references, otherwise it will raises a ValueError.

Parameters:¶

mult_candidates: The list of list of sentences to evaluate.
mult_references: The list of list of sentences used as target.
return_all_scores: If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
return_all_cands_scores: If True, returns all multiple candidates scores in sents_scores outputs as tensor of shape (n_audio, n_cands_per_audio). defaults to False.
n: Maximal number of n-grams taken into account. defaults to 4.
sigma: Standard deviation parameter used for gaussian penalty. defaults to 6.0.
tokenizer: The fast tokenizer used to split sentences into words. defaults to str.split.
return_tfidf: If True, returns the list of dictionaries containing the tf-idf scores of n-grams in the sents_score output. defaults to False.
cache_path: The path to the external code directory. defaults to the value returned by get_default_cache_path().
java_path: The path to the java executable. defaults to the value returned by get_default_java_path().
tmp_path: Temporary directory path. defaults to the value returned by get_default_tmp_path().
java_max_memory: The maximal java memory used. defaults to “8G”.
n_threads: Number of threads used to compute SPICE. None value will use the default value of the java program. defaults to None.
timeout: The number of seconds before killing the java subprogram. If a list is given, it will restart the program if the i-th timeout is reached. If None, no timeout will be used. defaults to None.
verbose: The verbose level. defaults to 0.

Returns:¶

A tuple of globals and locals scores or a scalar tensor with the main global score.

vocab(candidates: list[str], mult_references: list[list[str]] | None, return_all_scores: bool = True, *, seed: None | int | ~torch._C.Generator = 1234, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, dtype: ~torch.dtype = torch.float64, pop_strategy: ~typing.Literal['max', 'min'] | int = 'max', verbose: int = 0) → tuple[VocabScores, VocabScores] | Tensor[source]¶

Compute vocabulary statistics.

Returns the candidate corpus vocabulary length, the references vocabulary length, the average vocabulary length for single references, and the vocabulary ratios between candidates and references.

Parameters:¶

candidates: The list of sentences to evaluate.
mult_references: The list of list of sentences used as target. Can also be None.
return_all_scores: If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
seed: Random seed used to compute average vocabulary length for multiple references. defaults to 1234.
tokenizer: The function used to split a sentence into tokens. defaults to str.split.
dtype: Torch floating point dtype for numerical precision. defaults to torch.float64.
pop_strategy: Strategy to compute average reference vocab. defaults to “max”.
verbose: The verbose level. defaults to 0.

Returns:¶

A tuple of globals and locals scores or a scalar tensor with the main global score.

aac_metrics.functional package¶

Submodules¶