aac_metrics.functional.mace module¶

class MACEScores¶

Bases: dict

clap_sim: Tensor¶

fer: Tensor¶

mace: Tensor¶

mace( candidates: list[str], mult_references: list[list[str]] | None = None, audio_paths: list[str] | None = None, return_all_scores: bool = True, *, mace_method: Literal['text', 'audio', 'combined'] = 'text', penalty: float = 0.3, clap_model: str | CLAPWrapper = 'MS-CLAP-2023', seed: int | None = 42, echecker: str | BERTFlatClassifier = 'echecker_clotho_audiocaps_base', echecker_tokenizer: AutoTokenizer | None = None, error_threshold: float = 0.97, device: str | device | None = 'cuda_if_available', batch_size: int | None = 32, reset_state: bool = True, return_probs: bool = False, verbose: int = 0, ) → Tensor | tuple[MACEScores, MACEScores][source]¶

Multimodal Audio-Caption Evaluation class (MACE).

MACE is a metric designed for evaluating automated audio captioning (AAC) systems. Unlike metrics that compare machine-generated captions solely to human references, MACE uses both audio and text to improve evaluation. By integrating both audio and text, it produces assessments that align better with human judgments.

The implementation is based on the mace original implementation (original author have accepted to include their code in aac-metrics under the MIT license).

Paper: https://arxiv.org/pdf/2411.00321
Original author: Satvik Dixit
Original implementation: https://github.com/satvik-dixit/mace/tree/main

Parameters:

candidates – The list of sentences to evaluate.
mult_references – The list of list of sentences used as target when method is “text” or “combined”. defaults to None.
audio_paths – Audio filepaths required when method is “audio” or “combined”. defaults to None.
return_all_scores – If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
mace_method – The method used to encode the sentences. Can be “text”, “audio” or “combined”. defaults to “text”.
penalty – The penalty coefficient applied. Higher value means to lower the cos-sim scores when an error is detected. defaults to 0.3.
clap_model – The CLAP model used to extract CLAP embeddings for cosine-similarity. defaults to “MS-CLAP-2023”.
seed – Optional seed to make CLAP-sim scores deterministic when using mace_method=”audio” or “combined” on large audio files. defaults to 42.
echecker – The echecker model used to detect fluency errors. Can be “echecker_clotho_audiocaps_base”, “echecker_clotho_audiocaps_tiny”, “none” or None. defaults to “echecker_clotho_audiocaps_base”.
echecker_tokenizer – The tokenizer of the echecker model. If None and echecker is not None, this value will be inferred with echecker.model_type. defaults to None.
error_threshold – The threshold used to detect fluency errors for echecker model. defaults to 0.97.
device – The PyTorch device used to run pre-trained models. If “cuda_if_available”, it will use cuda if available. defaults to “cuda_if_available”.
batch_size – The batch size of the CLAP and echecker models. defaults to 32.
reset_state – If True, reset the state of the PyTorch global generator after the initialization of the pre-trained models. defaults to True.
return_probs – If True, return each individual error probability given by the fluency detector model. defaults to False.
verbose – The verbose level. defaults to 0.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.