SPIDEr-max

SPIDEr-max is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model.

Why ?

The SPIDEr metric used in audio captioning is highly sensitive to the frequencies of the words used. Here is 2 examples with the 5 candidates generated by the beam search algorithm, their corresponding SPIDEr scores and the associated references:

Candidates (predictions) captions

Beam search candidates

SPIDEr

heavy rain is falling on a roof

0.562

heavy rain is falling on a tin roof

0.930

a heavy rain is falling on a roof

0.594

a heavy rain is falling on the ground

0.335

a heavy rain is falling on the roof

0.594

References (ground truth) captions

References

heavy rain falls loudly onto a structure with a thin roof

heavy rainfall falling onto a thin structure with a thin roof

it is raining hard and the rain hits a tin roof

rain that is pouring down very hard outside

the hard rain is noisy as it hits a tin roof

(Audio file named “rain.wav” from Clotho development-testing subset)

Candidates (predictions) captions

Beam search candidates

SPIDEr

a woman speaks and a sheep bleats

0.190

a woman speaks and a goat bleats

1.259

a man speaks and a sheep bleats

0.344

an adult male speaks and a sheep bleats

0.231

an adult male is speaking and a sheep bleats

0.189

References (ground truth) captions

References

a man speaking and laughing followed by a goat bleat

a man is speaking in high tone while a goat is bleating one time

a man speaks followed by a goat bleat

a person speaks and a goat bleats

a man is talking and snickering followed by a goat bleating

(Audio file id “jid4t-FzUn0” from AudioCaps testing subset)

Even with very similar candidates, the SPIDEr scores varies drastically. To adress this issue, we proposed a SPIDEr-max metric which take the maximum value of several candidates for the same audio. SPIDEr-max demonstrate that SPIDEr can exceed state-of-the-art scores on AudioCaps and Clotho and even human scores on AudioCaps.

How ?

This usage is very similar to other captioning metrics, with the main difference of take a multiple candidates list as input.

from aac_metrics.functional import spider_max
from aac_metrics.utils.tokenization import preprocess_mult_sents

mult_candidates: list[list[str]] = [["a man is speaking", "maybe someone speaking"]]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]]

mult_candidates = preprocess_mult_sents(mult_candidates)
mult_references = preprocess_mult_sents(mult_references)

corpus_scores, sents_scores = spider_max(mult_candidates, mult_references)
print(corpus_scores)
# {"spider_max": tensor(0.1), ...}
print(sents_scores)
# {"spider_max": tensor([0.9, ...]), ...}