aac_metrics.functional package

bert_score_mrefs(
candidates: list[str],
mult_references: list[list[str]],
return_all_scores: bool = True,
*,
model: str | Module = 'roberta-large',
tokenizer: Callable | None = None,
device: str | device | None = 'cuda_if_available',
batch_size: int | None = 32,
num_threads: int = 0,
max_length: int = 64,
reset_state: bool = True,
idf: bool = False,
reduction: 'mean' | 'max' | 'min' | Callable[[...], Tensor] = 'max',
filter_nan: bool = True,
verbose: int = 0,
) tuple[BERTScoreMRefsScores, BERTScoreMRefsScores] | Tensor[source]

BERTScore metric which supports multiple references.

The implementation is based on the bert_score implementation of torchmetrics.

Parameters:
candidates: list[str]

The list of sentences to evaluate.

mult_references: list[list[str]]

The list of list of sentences used as target.

return_all_scores: bool = True

If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.

model: str | Module = 'roberta-large'

The model name or the instantiated model to use to compute token embeddings. defaults to “roberta-large”.

tokenizer: Callable | None = None

The fast tokenizer used to split sentences into words. If None, use the tokenizer corresponding to the model argument. defaults to None.

device: str | device | None = 'cuda_if_available'

The PyTorch device used to run the BERT model. defaults to “cuda_if_available”.

batch_size: int | None = 32

The batch size used in the model forward.

num_threads: int = 0

A number of threads to use for a dataloader. defaults to 0.

max_length: int = 64

Max length when encoding sentences to tensor ids. defaults to 64.

idf: bool = False

Whether or not using Inverse document frequency to ponderate the BERTScores. defaults to False.

reduction: 'mean' | 'max' | 'min' | Callable[[...], Tensor] = 'max'

The reduction function to apply between multiple references for each audio. defaults to “max”.

filter_nan: bool = True

If True, replace NaN scores by 0.0. defaults to True.

verbose: int = 0

The verbose level. defaults to 0.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.

bleu(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, n: int = 4, option: ~typing.Literal['shortest', 'average', 'closest'] = 'closest', verbose: int = 0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_1_to_n: bool = False) tuple[dict[str, Tensor], dict[str, Tensor]] | Tensor[source]

BiLingual Evaluation Understudy function.

Note: this version of the BLEU metric applies a penalty formula that depends on the size of all candidates and the length of the references, which means that the average score of the candidates is not equal to the corpus score.

Parameters:
candidates

The list of sentences to evaluate.

mult_references

The list of list of sentences used as target.

return_all_scores

If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.

n

Maximal number of n-grams taken into account. defaults to 4.

option

Corpus reference length mode. Can be “shortest”, “average” or “closest”. defaults to “closest”.

verbose

The verbose level. defaults to 0.

tokenizer

The fast tokenizer used to split sentences into words. defaults to str.split.

return_1_to_n

If True, returns the n-grams results from 1 to n. Otherwise return the n-grams scores. defauts to False.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.

bleu_1(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, option: ~typing.Literal['shortest', 'average', 'closest'] = 'closest', verbose: int = 0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_1_to_n: bool = False) tuple[dict[str, Tensor], dict[str, Tensor]] | Tensor[source]
bleu_2(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, option: ~typing.Literal['shortest', 'average', 'closest'] = 'closest', verbose: int = 0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_1_to_n: bool = False) tuple[dict[str, Tensor], dict[str, Tensor]] | Tensor[source]
bleu_3(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, option: ~typing.Literal['shortest', 'average', 'closest'] = 'closest', verbose: int = 0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_1_to_n: bool = False) tuple[dict[str, Tensor], dict[str, Tensor]] | Tensor[source]
bleu_4(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, option: ~typing.Literal['shortest', 'average', 'closest'] = 'closest', verbose: int = 0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_1_to_n: bool = False) tuple[dict[str, Tensor], dict[str, Tensor]] | Tensor[source]
cider_d(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, n: int = 4, sigma: float = 6.0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_tfidf: bool = False, scale: float = 10.0) tuple[CIDErDScores, CIDErDScores] | Tensor[source]

Consensus-based Image Description Evaluation function.

Warning

This metric requires at least 2 candidates with 2 sets of references, otherwise it will raises a ValueError.

Parameters:
candidates

The list of sentences to evaluate.

mult_references

The list of list of sentences used as target.

return_all_scores

If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.

n

Maximal number of n-grams taken into account. defaults to 4.

sigma

Standard deviation parameter used for gaussian penalty. defaults to 6.0.

tokenizer

The fast tokenizer used to split sentences into words. defaults to str.split.

return_tfidf

If True, returns the list of dictionaries containing the tf-idf scores of n-grams in the sents_score output. defaults to False.

scale

CIDEr-D score factor. defaults to 10.0.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.

clap_sim(
candidates: list[str],
mult_references: list[list[str]] | None = None,
audio_paths: list[str] | None = None,
return_all_scores: bool = True,
*,
clap_method: 'audio' | 'text' = 'text',
clap_model: str | CLAPWrapper = 'MS-CLAP-2023',
device: str | device | None = 'cuda_if_available',
batch_size: int | None = 32,
reset_state: bool = True,
seed: int | None = 42,
verbose: int = 0,
) tuple[CLAPScores, CLAPScores] | Tensor[source]

Cosine-similarity of the Contrastive Language-Audio Pretraining (CLAP) embeddings.

The implementation is based on the msclap pypi package.

Parameters:
candidates: list[str]

The list of sentences to evaluate.

mult_references: list[list[str]] | None = None

The list of list of sentences used as target when method is “text”. defaults to None.

audio_paths: list[str] | None = None

Audio filepaths required when method is “audio”. defaults to None.

return_all_scores: bool = True

If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.

clap_method: 'audio' | 'text' = 'text'

The method used to encode the sentences. Can be “text” or “audio”. defaults to “text”.

clap_model: str | CLAPWrapper = 'MS-CLAP-2023'

The CLAP model used to extract sentence embeddings for cosine-similarity. defaults to “2023”.

device: str | device | None = 'cuda_if_available'

The PyTorch device used to run MACE models. If “cuda_if_available”, it will use cuda if available. defaults to “cuda_if_available”.

batch_size: int | None = 32

The batch size of the CLAP models. defaults to 32.

reset_state: bool = True

If True, reset the state of the PyTorch global generator after the initialization of the pre-trained models. defaults to True.

seed: int | None = 42

Optional seed to make CLAP-sim scores deterministic when using clap_method=”audio” on large audio files. defaults to 42.

verbose: int = 0

The verbose level. defaults to 0.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.

dcase2023_evaluate(
candidates: list[str],
mult_references: list[list[str]],
preprocess: bool | Callable[[list[str]], list[str]] = True,
cache_path: str | Path | None = None,
java_path: str | Path | None = None,
tmp_path: str | Path | None = None,
device: str | device | None = 'cuda_if_available',
verbose: int = 0,
) tuple[dict[str, Tensor], dict[str, Tensor]][source]

Evaluate candidates with multiple references with the DCASE2023 Audio Captioning metrics.

Parameters:
candidates: list[str]

The list of sentences to evaluate.

mult_references: list[list[str]]

The list of list of sentences used as target.

preprocess: bool | Callable[[list[str]], list[str]] = True

If True, the candidates and references will be passed as input to the PTB stanford tokenizer before computing metrics. defaults to True.

cache_path: str | Path | None = None

The path to the external code directory. defaults to the value returned by get_default_cache_path().

java_path: str | Path | None = None

The path to the java executable. defaults to the value returned by get_default_java_path().

tmp_path: str | Path | None = None

Temporary directory path. defaults to the value returned by get_default_tmp_path().

device: str | device | None = 'cuda_if_available'

The PyTorch device used to run FENSE and SPIDErFL models. If None, it will try to detect use cuda if available. defaults to “cuda_if_available”.

verbose: int = 0

The verbose level. defaults to 0.

Returns:

A tuple contains the corpus and sentences scores.

dcase2024_evaluate(
candidates: list[str],
mult_references: list[list[str]],
preprocess: bool | Callable[[list[str]], list[str]] = True,
cache_path: str | Path | None = None,
java_path: str | Path | None = None,
tmp_path: str | Path | None = None,
device: str | device | None = 'cuda_if_available',
verbose: int = 0,
) tuple[dict[str, Tensor], dict[str, Tensor]][source]

Evaluate candidates with multiple references with the DCASE2024 Audio Captioning metrics.

Parameters:
candidates: list[str]

The list of sentences to evaluate.

mult_references: list[list[str]]

The list of list of sentences used as target.

preprocess: bool | Callable[[list[str]], list[str]] = True

If True, the candidates and references will be passed as input to the PTB stanford tokenizer before computing metrics. defaults to True.

cache_path: str | Path | None = None

The path to the external code directory. defaults to the value returned by get_default_cache_path().

java_path: str | Path | None = None

The path to the java executable. defaults to the value returned by get_default_java_path().

tmp_path: str | Path | None = None

Temporary directory path. defaults to the value returned by get_default_tmp_path().

device: str | device | None = 'cuda_if_available'

The PyTorch device used to run FENSE and SPIDErFL models. If None, it will try to detect use cuda if available. defaults to “cuda_if_available”.

verbose: int = 0

The verbose level. defaults to 0.

Returns:

A tuple contains the corpus and sentences scores.

evaluate(
candidates: list[str],
mult_references: list[list[str]],
preprocess: bool | Callable[[list[str]], list[str]] = True,
metrics: str | Iterable[str] | Iterable[Callable[[list, list], tuple]] = 'default',
cache_path: str | Path | None = None,
java_path: str | Path | None = None,
tmp_path: str | Path | None = None,
device: str | device | None = 'cuda_if_available',
verbose: int = 0,
) tuple[dict[str, Tensor], dict[str, Tensor]][source]

Evaluate candidates with multiple references with custom metrics.

Parameters:
candidates: list[str]

The list of sentences to evaluate.

mult_references: list[list[str]]

The list of list of sentences used as target.

preprocess: bool | Callable[[list[str]], list[str]] = True

If True, the candidates and references will be passed as input to the PTB stanford tokenizer before computing metrics. defaults to True.

metrics: str | Iterable[str] | Iterable[Callable[[list, list], tuple]] = 'default'

The name of the metric list or the explicit list of metrics to compute. defaults to “default”.

cache_path: str | Path | None = None

The path to the external code directory. defaults to the value returned by get_default_cache_path().

java_path: str | Path | None = None

The path to the java executable. defaults to the value returned by get_default_java_path().

tmp_path: str | Path | None = None

Temporary directory path. defaults to the value returned by get_default_tmp_path().

device: str | device | None = 'cuda_if_available'

The PyTorch device used to run FENSE and SPIDErFL models. If None, it will try to detect use cuda if available. defaults to “cuda_if_available”.

verbose: int = 0

The verbose level. defaults to 0.

Returns:

A tuple contains the corpus and sentences scores.

fense(
candidates: list[str],
mult_references: list[list[str]],
return_all_scores: bool = True,
*,
sbert_model: str | SentenceTransformer = 'paraphrase-TinyBERT-L6-v2',
echecker: str | BERTFlatClassifier = 'echecker_clotho_audiocaps_base',
echecker_tokenizer: AutoTokenizer | None = None,
error_threshold: float = 0.9,
device: str | device | None = 'cuda_if_available',
batch_size: int | None = 32,
reset_state: bool = True,
return_probs: bool = False,
penalty: float = 0.9,
verbose: int = 0,
) tuple[FENSEScores, FENSEScores] | Tensor[source]

Fluency ENhanced Sentence-bert Evaluation (FENSE)

Parameters:
candidates: list[str]

The list of sentences to evaluate.

mult_references: list[list[str]]

The list of list of sentences used as target.

return_all_scores: bool = True

If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.

sbert_model: str | SentenceTransformer = 'paraphrase-TinyBERT-L6-v2'

The sentence BERT model used to extract sentence embeddings for cosine-similarity. defaults to “paraphrase-TinyBERT-L6-v2”.

echecker: str | BERTFlatClassifier = 'echecker_clotho_audiocaps_base'

The echecker model used to detect fluency errors. Can be “echecker_clotho_audiocaps_base”, “echecker_clotho_audiocaps_tiny”, “none” or None. defaults to “echecker_clotho_audiocaps_base”.

echecker_tokenizer: AutoTokenizer | None = None

The tokenizer of the echecker model. If None and echecker is not None, this value will be inferred with echecker.model_type. defaults to None.

error_threshold: float = 0.9

The threshold used to detect fluency errors for echecker model. defaults to 0.9.

penalty: float = 0.9

The penalty coefficient applied. Higher value means to lower the cos-sim scores when an error is detected. defaults to 0.9.

device: str | device | None = 'cuda_if_available'

The PyTorch device used to run pre-trained models. If “cuda_if_available”, it will use cuda if available. defaults to “cuda_if_available”.

batch_size: int | None = 32

The batch size of the sBERT and echecker models. defaults to 32.

reset_state: bool = True

If True, reset the state of the PyTorch global generator after the initialization of the pre-trained models. defaults to True.

return_probs: bool = False

If True, return each individual error probability given by the fluency detector model. defaults to False.

verbose: int = 0

The verbose level. defaults to 0.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.

fer(
candidates: list[str],
return_all_scores: bool = True,
*,
echecker: str | BERTFlatClassifier = 'echecker_clotho_audiocaps_base',
echecker_tokenizer: AutoTokenizer | None = None,
error_threshold: float = 0.9,
device: str | device | None = 'cuda_if_available',
batch_size: int | None = 32,
reset_state: bool = True,
return_probs: bool = False,
verbose: int = 0,
) tuple[FERScores, FERScores] | Tensor[source]

Return Fluency Error Rate (FER) detected by a pre-trained BERT model.

Parameters:
candidates: list[str]

The list of sentences to evaluate.

mult_references

The list of list of sentences used as target.

return_all_scores: bool = True

If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.

echecker: str | BERTFlatClassifier = 'echecker_clotho_audiocaps_base'

The echecker model used to detect fluency errors. Can be “echecker_clotho_audiocaps_base”, “echecker_clotho_audiocaps_tiny”, “none” or None. defaults to “echecker_clotho_audiocaps_base”.

echecker_tokenizer: AutoTokenizer | None = None

The tokenizer of the echecker model. If None and echecker is not None, this value will be inferred with echecker.model_type. defaults to None.

error_threshold: float = 0.9

The threshold used to detect fluency errors for echecker model. defaults to 0.9.

device: str | device | None = 'cuda_if_available'

The PyTorch device used to run pre-trained models. If “cuda_if_available”, it will use cuda if available. defaults to “cuda_if_available”.

batch_size: int | None = 32

The batch size of the echecker models. defaults to 32.

reset_state: bool = True

If True, reset the state of the PyTorch global generator after the initialization of the pre-trained models. defaults to True.

return_probs: bool = False

If True, return each individual error probability given by the fluency detector model. defaults to False.

verbose: int = 0

The verbose level. defaults to 0.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.

mace(
candidates: list[str],
mult_references: list[list[str]] | None = None,
audio_paths: list[str] | None = None,
return_all_scores: bool = True,
*,
mace_method: 'text' | 'audio' | 'combined' = 'text',
penalty: float = 0.3,
clap_model: str | CLAPWrapper = 'MS-CLAP-2023',
seed: int | None = 42,
echecker: str | BERTFlatClassifier = 'echecker_clotho_audiocaps_base',
echecker_tokenizer: AutoTokenizer | None = None,
error_threshold: float = 0.97,
device: str | device | None = 'cuda_if_available',
batch_size: int | None = 32,
reset_state: bool = True,
return_probs: bool = False,
verbose: int = 0,
) Tensor | tuple[MACEScores, MACEScores][source]

Multimodal Audio-Caption Evaluation class (MACE).

MACE is a metric designed for evaluating automated audio captioning (AAC) systems. Unlike metrics that compare machine-generated captions solely to human references, MACE uses both audio and text to improve evaluation. By integrating both audio and text, it produces assessments that align better with human judgments.

The implementation is based on the mace original implementation (original author have accepted to include their code in aac-metrics under the MIT license).

Parameters:
candidates: list[str]

The list of sentences to evaluate.

mult_references: list[list[str]] | None = None

The list of list of sentences used as target when method is “text” or “combined”. defaults to None.

audio_paths: list[str] | None = None

Audio filepaths required when method is “audio” or “combined”. defaults to None.

return_all_scores: bool = True

If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.

mace_method: 'text' | 'audio' | 'combined' = 'text'

The method used to encode the sentences. Can be “text”, “audio” or “combined”. defaults to “text”.

penalty: float = 0.3

The penalty coefficient applied. Higher value means to lower the cos-sim scores when an error is detected. defaults to 0.3.

clap_model: str | CLAPWrapper = 'MS-CLAP-2023'

The CLAP model used to extract CLAP embeddings for cosine-similarity. defaults to “MS-CLAP-2023”.

seed: int | None = 42

Optional seed to make CLAP-sim scores deterministic when using mace_method=”audio” or “combined” on large audio files. defaults to 42.

echecker: str | BERTFlatClassifier = 'echecker_clotho_audiocaps_base'

The echecker model used to detect fluency errors. Can be “echecker_clotho_audiocaps_base”, “echecker_clotho_audiocaps_tiny”, “none” or None. defaults to “echecker_clotho_audiocaps_base”.

echecker_tokenizer: AutoTokenizer | None = None

The tokenizer of the echecker model. If None and echecker is not None, this value will be inferred with echecker.model_type. defaults to None.

error_threshold: float = 0.97

The threshold used to detect fluency errors for echecker model. defaults to 0.97.

device: str | device | None = 'cuda_if_available'

The PyTorch device used to run pre-trained models. If “cuda_if_available”, it will use cuda if available. defaults to “cuda_if_available”.

batch_size: int | None = 32

The batch size of the CLAP and echecker models. defaults to 32.

reset_state: bool = True

If True, reset the state of the PyTorch global generator after the initialization of the pre-trained models. defaults to True.

return_probs: bool = False

If True, return each individual error probability given by the fluency detector model. defaults to False.

verbose: int = 0

The verbose level. defaults to 0.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.

meteor(
candidates: list[str],
mult_references: list[list[str]],
return_all_scores: bool = True,
*,
cache_path: str | Path | None = None,
java_path: str | Path | None = None,
java_max_memory: str = '2G',
language: 'en' | 'cz' | 'de' | 'es' | 'fr' = 'en',
use_shell: bool | None = None,
params: Iterable[float] | None = None,
weights: Iterable[float] | None = None,
verbose: int = 0,
) tuple[METEORScores, METEORScores] | Tensor[source]

Metric for Evaluation of Translation with Explicit ORdering function.

Parameters:
candidates: list[str]

The list of sentences to evaluate.

mult_references: list[list[str]]

The list of list of sentences used as target.

return_all_scores: bool = True

If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.

cache_path: str | Path | None = None

The path to the external code directory. defaults to the value returned by get_default_cache_path().

java_path: str | Path | None = None

The path to the java executable. defaults to the value returned by get_default_java_path().

java_max_memory: str = '2G'

The maximal java memory used. defaults to “2G”.

language: 'en' | 'cz' | 'de' | 'es' | 'fr' = 'en'

The language used for stem, synonym and paraphrase matching. Can be one of (“en”, “cz”, “de”, “es”, “fr”). defaults to “en”.

use_shell: bool | None = None

Optional argument to force use os-specific shell for the java subprogram. If None, it will use shell only on Windows OS. defaults to None.

params: Iterable[float] | None = None

List of 4 parameters (alpha, beta gamma delta) used in METEOR metric. If None, it will use the default of the java program, which is (0.85, 0.2, 0.6, 0.75). defaults to None.

weights: Iterable[float] | None = None

List of 4 parameters (w1, w2, w3, w4) used in METEOR metric. If None, it will use the default of the java program, which is (1.0 1.0 0.6 0.8). defaults to None.

verbose: int = 0

The verbose level. defaults to 0.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.

rouge_l(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, beta: float = 1.2, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>) tuple[ROUGELScores, ROUGELScores] | Tensor[source]

Recall-Oriented Understudy for Gisting Evaluation function.

Parameters:
candidates

The list of sentences to evaluate.

mult_references

The list of list of sentences used as target.

return_all_scores

If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.

beta

Determines the weight of recall in the combined f-score. defaults to 1.2.

tokenizer

The fast tokenizer used to split sentences into words. defaults to str.split.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.

sbert_sim(
candidates: list[str],
mult_references: list[list[str]],
return_all_scores: bool = True,
*,
sbert_model: str | SentenceTransformer = 'paraphrase-TinyBERT-L6-v2',
device: str | device | None = 'cuda_if_available',
batch_size: int | None = 32,
reset_state: bool = True,
verbose: int = 0,
) tuple[SBERTSimScores, SBERTSimScores] | Tensor[source]

Cosine-similarity of the Sentence-BERT embeddings.

Parameters:
candidates: list[str]

The list of sentences to evaluate.

mult_references: list[list[str]]

The list of list of sentences used as target.

return_all_scores: bool = True

If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.

sbert_model: str | SentenceTransformer = 'paraphrase-TinyBERT-L6-v2'

The sentence BERT model used to extract sentence embeddings for cosine-similarity. defaults to “paraphrase-TinyBERT-L6-v2”.

device: str | device | None = 'cuda_if_available'

The PyTorch device used to run pre-trained models. If “cuda_if_available”, it will use cuda if available. defaults to “cuda_if_available”.

batch_size: int | None = 32

The batch size of the sBERT models. defaults to 32.

reset_state: bool = True

If True, reset the state of the PyTorch global generator after the initialization of the pre-trained models. defaults to True.

verbose: int = 0

The verbose level. defaults to 0.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.

spice(
candidates: list[str],
mult_references: list[list[str]],
return_all_scores: bool = True,
*,
cache_path: str | Path | None = None,
java_path: str | Path | None = None,
tmp_path: str | Path | None = None,
n_threads: int | None = None,
java_max_memory: str = '8G',
timeout: None | int | Iterable[int] = None,
separate_cache_dir: bool = True,
use_shell: bool | None = None,
verbose: int = 0,
) tuple[SPICEScores, SPICEScores] | Tensor[source]

Semantic Propositional Image Caption Evaluation function.

Parameters:
candidates: list[str]

The list of sentences to evaluate.

mult_references: list[list[str]]

The list of list of sentences used as target.

return_all_scores: bool = True

If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.

cache_path: str | Path | None = None

The path to the external code directory. defaults to the value returned by get_default_cache_path().

java_path: str | Path | None = None

The path to the java executable. defaults to the value returned by get_default_java_path().

tmp_path: str | Path | None = None

Temporary directory path. defaults to the value returned by get_default_tmp_path().

n_threads: int | None = None

Number of threads used to compute SPICE. None value will use the default value of the java program. defaults to None.

java_max_memory: str = '8G'

The maximal java memory used. defaults to “8G”.

timeout: None | int | Iterable[int] = None

The number of seconds before killing the java subprogram. If a list is given, it will restart the program if the i-th timeout is reached. If None, no timeout will be used. defaults to None.

separate_cache_dir: bool = True

If True, the SPICE cache files will be stored into in a new temporary directory. This removes potential freezes when multiple instances of SPICE are running in the same cache dir. defaults to True.

use_shell: bool | None = None

Optional argument to force use os-specific shell for the java subprogram. If None, it will use shell only on Windows OS. defaults to None.

verbose: int = 0

The verbose level. defaults to 0.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.

spider(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, n: int = 4, sigma: float = 6.0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_tfidf: bool = False, cache_path: str | ~pathlib.Path | None = None, java_path: str | ~pathlib.Path | None = None, tmp_path: str | ~pathlib.Path | None = None, n_threads: int | None = None, java_max_memory: str = '8G', timeout: None | int | ~typing.Iterable[int] = None, verbose: int = 0) tuple[SPIDErScores, SPIDErScores] | Tensor[source]

SPIDEr function.

Warning

This metric requires at least 2 candidates with 2 sets of references, otherwise it will raises a ValueError.

Parameters:
candidates

The list of sentences to evaluate.

mult_references

The list of list of sentences used as target.

return_all_scores

If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.

n

Maximal number of n-grams taken into account. defaults to 4.

sigma

Standard deviation parameter used for gaussian penalty. defaults to 6.0.

tokenizer

The fast tokenizer used to split sentences into words. defaults to str.split.

return_tfidf

If True, returns the list of dictionaries containing the tf-idf scores of n-grams in the sents_score output. defaults to False.

cache_path

The path to the external code directory. defaults to the value returned by get_default_cache_path().

java_path

The path to the java executable. defaults to the value returned by get_default_java_path().

tmp_path

Temporary directory path. defaults to the value returned by get_default_tmp_path().

n_threads

Number of threads used to compute SPICE. None value will use the default value of the java program. defaults to None.

java_max_memory

The maximal java memory used. defaults to “8G”.

timeout

The number of seconds before killing the java subprogram. If a list is given, it will restart the program if the i-th timeout is reached. If None, no timeout will be used. defaults to None.

verbose

The verbose level. defaults to 0.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.

spider_fl(candidates: list[str], mult_references: list[list[str]], return_all_scores: bool = True, *, n: int = 4, sigma: float = 6.0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_tfidf: bool = False, cache_path: str | ~pathlib.Path | None = None, java_path: str | ~pathlib.Path | None = None, tmp_path: str | ~pathlib.Path | None = None, n_threads: int | None = None, java_max_memory: str = '8G', timeout: None | int | ~typing.Iterable[int] = None, echecker: str | ~aac_metrics.functional.fer.BERTFlatClassifier = 'echecker_clotho_audiocaps_base', echecker_tokenizer: ~transformers.models.auto.tokenization_auto.AutoTokenizer | None = None, error_threshold: float = 0.9, device: str | ~torch.device | None = 'cuda_if_available', batch_size: int | None = 32, reset_state: bool = True, return_probs: bool = True, penalty: float = 0.9, verbose: int = 0) tuple[SPIDErFLScores, SPIDErFLScores] | Tensor[source]

Combinaison of SPIDEr with Fluency Error detector.

Warning

This metric requires at least 2 candidates with 2 sets of references, otherwise it will raises a ValueError.

Parameters:
candidates

The list of sentences to evaluate.

mult_references

The list of list of sentences used as target.

return_all_scores

If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.

n

Maximal number of n-grams taken into account. defaults to 4.

sigma

Standard deviation parameter used for gaussian penalty. defaults to 6.0.

tokenizer

The fast tokenizer used to split sentences into words. defaults to str.split.

return_tfidf

If True, returns the list of dictionaries containing the tf-idf scores of n-grams in the sents_score output. defaults to False.

cache_path

The path to the external code directory. defaults to the value returned by get_default_cache_path().

java_path

The path to the java executable. defaults to the value returned by get_default_java_path().

tmp_path

Temporary directory path. defaults to the value returned by get_default_tmp_path().

n_threads

Number of threads used to compute SPICE. None value will use the default value of the java program. defaults to None.

java_max_memory

The maximal java memory used. defaults to “8G”.

timeout

The number of seconds before killing the java subprogram. If a list is given, it will restart the program if the i-th timeout is reached. If None, no timeout will be used. defaults to None.

echecker

The echecker model used to detect fluency errors. Can be “echecker_clotho_audiocaps_base”, “echecker_clotho_audiocaps_tiny”, “none” or None. defaults to “echecker_clotho_audiocaps_base”.

echecker_tokenizer

The tokenizer of the echecker model. If None and echecker is not None, this value will be inferred with echecker.model_type. defaults to None.

error_threshold

The threshold used to detect fluency errors for echecker model. defaults to 0.9.

device

The PyTorch device used to run pre-trained models. If “cuda_if_available”, it will use cuda if available. defaults to “cuda_if_available”.

batch_size

The batch size of the sBERT and echecker models. defaults to 32.

reset_state

If True, reset the state of the PyTorch global generator after the initialization of the pre-trained models. defaults to True.

return_probs

If True, return each individual error probability given by the fluency detector model. defaults to True.

penalty

The penalty coefficient applied. Higher value means to lower the cos-sim scores when an error is detected. defaults to 0.9.

verbose

The verbose level. defaults to 0.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.

spider_max(mult_candidates: list[list[str]], mult_references: list[list[str]], return_all_scores: bool = True, *, return_all_cands_scores: bool = False, n: int = 4, sigma: float = 6.0, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, return_tfidf: bool = False, cache_path: str | ~pathlib.Path | None = None, java_path: str | ~pathlib.Path | None = None, tmp_path: str | ~pathlib.Path | None = None, n_threads: int | None = None, java_max_memory: str = '8G', timeout: None | int | ~typing.Iterable[int] = None, verbose: int = 0) tuple[SPIDErMaxScores, SPIDErMaxScores] | Tensor[source]

SPIDEr-max function.

Compute the maximal SPIDEr score accross multiple candidates.

Warning

This metric requires at least 2 candidates with 2 sets of references, otherwise it will raises a ValueError.

Parameters:
mult_candidates

The list of list of sentences to evaluate.

mult_references

The list of list of sentences used as target.

return_all_scores

If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.

return_all_cands_scores

If True, returns all multiple candidates scores in sents_scores outputs as tensor of shape (n_audio, n_cands_per_audio). defaults to False.

n

Maximal number of n-grams taken into account. defaults to 4.

sigma

Standard deviation parameter used for gaussian penalty. defaults to 6.0.

tokenizer

The fast tokenizer used to split sentences into words. defaults to str.split.

return_tfidf

If True, returns the list of dictionaries containing the tf-idf scores of n-grams in the sents_score output. defaults to False.

cache_path

The path to the external code directory. defaults to the value returned by get_default_cache_path().

java_path

The path to the java executable. defaults to the value returned by get_default_java_path().

tmp_path

Temporary directory path. defaults to the value returned by get_default_tmp_path().

java_max_memory

The maximal java memory used. defaults to “8G”.

n_threads

Number of threads used to compute SPICE. None value will use the default value of the java program. defaults to None.

timeout

The number of seconds before killing the java subprogram. If a list is given, it will restart the program if the i-th timeout is reached. If None, no timeout will be used. defaults to None.

verbose

The verbose level. defaults to 0.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.

vocab(candidates: list[str], mult_references: list[list[str]] | None, return_all_scores: bool = True, *, seed: None | int | ~torch._C.Generator = 1234, tokenizer: ~typing.Callable[[str], list[str]] = <method 'split' of 'str' objects>, dtype: ~torch.dtype = torch.float64, pop_strategy: ~typing.Literal['max', 'min'] | int = 'max', verbose: int = 0) tuple[VocabScores, VocabScores] | Tensor[source]

Compute vocabulary statistics.

Returns the candidate corpus vocabulary length, the references vocabulary length, the average vocabulary length for single references, and the vocabulary ratios between candidates and references.

Parameters:
candidates

The list of sentences to evaluate.

mult_references

The list of list of sentences used as target. Can also be None.

return_all_scores

If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.

seed

Random seed used to compute average vocabulary length for multiple references. defaults to 1234.

tokenizer

The function used to split a sentence into tokens. defaults to str.split.

dtype

Torch floating point dtype for numerical precision. defaults to torch.float64.

pop_strategy

Strategy to compute average reference vocab. defaults to “max”.

verbose

The verbose level. defaults to 0.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.

Submodules