aac_metrics.functional.clap_sim module¶

class CLAPScores¶

Bases: dict

clap_sim: Tensor¶

clap_sim( candidates: list[str], mult_references: list[list[str]] | None = None, audio_paths: list[str] | None = None, return_all_scores: bool = True, *, clap_method: Literal['audio', 'text'] = 'text', clap_model: str | CLAPWrapper = 'MS-CLAP-2023', device: str | device | None = 'cuda_if_available', batch_size: int | None = 32, reset_state: bool = True, seed: int | None = 42, verbose: int = 0, ) → Tensor | tuple[CLAPScores, CLAPScores][source]¶

Cosine-similarity of the Contrastive Language-Audio Pretraining (CLAP) embeddings.

The implementation is based on the msclap pypi package.

Paper: https://arxiv.org/pdf/2411.00321
msclap package: https://pypi.org/project/msclap/

Parameters:

candidates – The list of sentences to evaluate.
mult_references – The list of list of sentences used as target when method is “text”. defaults to None.
audio_paths – Audio filepaths required when method is “audio”. defaults to None.
return_all_scores – If True, returns a tuple containing the globals and locals scores. Otherwise returns a scalar tensor containing the main global score. defaults to True.
clap_method – The method used to encode the sentences. Can be “text” or “audio”. defaults to “text”.
clap_model – The CLAP model used to extract sentence embeddings for cosine-similarity. defaults to “2023”.
device – The PyTorch device used to run MACE models. If “cuda_if_available”, it will use cuda if available. defaults to “cuda_if_available”.
batch_size – The batch size of the CLAP models. defaults to 32.
reset_state – If True, reset the state of the PyTorch global generator after the initialization of the pre-trained models. defaults to True.
seed – Optional seed to make CLAP-sim scores deterministic when using clap_method=”audio” on large audio files. defaults to 42.
verbose – The verbose level. defaults to 0.

Returns:

A tuple of globals and locals scores or a scalar tensor with the main global score.