Score Aggregation for Multilingual benchmark (and in general) #752

KennethEnevoldsen · 2024-05-17T10:14:10Z

KennethEnevoldsen
May 17, 2024
Maintainer

This discussion is to figure out how scores should be aggregated and benchmarks constructed for the example I am just thinking of an overall multilingual benchmark, but the approach should be generalizable. For this, I propose the following:

Run model on all tasks (ignoring superseded tasks)
Once we have the scores compute find clusters of correlated datasets
Select a representative sample from this cluster
Aggregation can then be either a naive mean across tasks (since highly correlated tasks are now removed) or a hierarchical mean (aggregate by language families and then across language families)

Selecting a representative task

Selecting a representative task is quite hard, however I believe we can greatly simplify it using some pragmatic assumptions:

Prefer datasets which have permissible licenses (accessibility argument)
Prefer datasets with a scientific publication attached (quality argument)
Prefer human-annotated datasets over e.g. datasets with derived annotations (quality argument)
Prefer datasets with smaller sample size (accessibility argument)

This is not intended to be the final version of the aggregation, but I believe it is doable within our time frame.

x-tabdeveloping · 2024-05-17T11:06:14Z

x-tabdeveloping
May 17, 2024
Collaborator

One point I always bring up and I think it's well suited here is that we have to be cautious of how we weight languages.
In my view we should:

Weight some languages higher than others. Clearly a language with 1.35 billion native speakers (Chinese) should weigh a lot more than a language with 1.2 million speakers (Estonian) if we want to maximize the benchmark's overall utility for someone who "just needs a multilingual embedding model" and is oblivious or ignorant to which languages it will have to encode in their production pipeline.
Linearly weighting the languages by the number of speakers would also be out of proportions and would be unfair. Every language should account for at least one unit.

I was thinking we could take inspiration from voting systems, since that's also an area where they have to balance e.g. the number of seats in the parliaments in such a way that constituencies get adequate representation, but also the number of voters in each constituency has to be accounted for.

1 reply

KennethEnevoldsen May 17, 2024
Maintainer Author

The mT5 does a language weighting which we could probably borrow here for the number of speakers.

However, I do believe that we might both want an even weighting (what we are interested in here is language variance, so two similar languages will get less variance, while a completely unique language would have a higher score as it contains more information).

I believe Siva made a great point:

I have some ideas but did not get a chance to post. It would be useful to think about desiderata and then work backwards.

KennethEnevoldsen · 2024-05-17T12:02:43Z

KennethEnevoldsen
May 17, 2024
Maintainer Author

Vague suggestions for desiderata (in need for formalization)

score := aggregated performance

A single language (domain, task type etc.) or task should overly influence the score
Practioneers should be able to use the score for select models for their specific use-case
(related to 2) The score should reflect a preference ranking for the practioneers / reflect language/task usage
Model developers should be able to use the score to guide model development

0 replies

sivareddyg · 2024-05-17T12:17:26Z

sivareddyg
May 17, 2024
Collaborator

Thanks for starting this discussion. The desired criteria for metrics are as follows:

We should be able to report language level scores/ranks, plus an overall score/rank across all languages.
My main thought is around how we should aggregate across tasks even within a language. As we know, different tasks have different ranges, for example, summarization scores are often in the range of 30s show little variance across models whereas retrieval scores are high and show high variance. Just taking an average across such disparate tasks seems incorrect.
If the goal is to rank the best model on the top, it doesn't have to be based on average score. I looked into various leaderboards and HELM is doing things differently: Instead of averaging, they do ranking. Think of it as follows. Each task/dataset is like a person who votes. Since scores are comparable across a dataset, you can take a vote of which model wins for a given task/dataset. Whichever model wins more votes, the higher the rank. This seems a right thing to do. See my Twitter discussion. This paper is a good source (we can choose a much simpler formulation than that).
But the problem with ranking is that it doesn't tell us how good a model is and how much gap there is to the ceiling (the best performance one can get). So ideally we also want a aggregate metric. This aggregate should correlate well with ranking. Aggregate scores are important if one is writing a paper and not solely to be on top of the leaderboard.
Language weighting: I am not sure if we should give high weight to language just because it has higher population. Instead, you can come with classes that represent like population count, family of the language etc. and report scores among along these classes.

@vaibhavad and I discussed some ideas, how we might be able to come up with an aggregate score for each language (based on a few concrete systems).

More to add later.

2 replies

KennethEnevoldsen May 17, 2024
Maintainer Author

Seems to be the main point here (very similar to 1. above). I would probably rephrase it more like the aggregated score should not be disproportionally influenced by a single task (if a task varies from 10-50 based on noise vs another which varies from 74-78). I completely agree that the mean is a poor estimator for what we know the distribution to be (if fact we don't even believe it to be one distribution).
Winrate/ranking has the problem that (as you note in the discussion) 10 models can be almost eq. but if rank will make them look drastically different (especially when we have other scores of interest such as speed and embedding size I believe this is problematic)

there generally seems to be a conflict between relative aggregation (such as rank) and agnostic (such as the mean). We could do an elo like LMSys.

Will read the paper as well.

Just to note here that the newly suggested interface allows the user to select the aggregation metric (so we only need to select a default metric for the benchmarks).

KennethEnevoldsen May 21, 2024
Maintainer Author

@sivareddyg had a look at the score aggregation proposed in the paper. I believe this approach is very reasonable. I am however unsure how it is influenced by e.g. two highly correlated tasks. However, we can do some experimentation on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Score Aggregation for Multilingual benchmark (and in general) #752

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Score Aggregation for Multilingual benchmark (and in general) #752

KennethEnevoldsen May 17, 2024 Maintainer

Replies: 3 comments · 3 replies

x-tabdeveloping May 17, 2024 Collaborator

KennethEnevoldsen May 17, 2024 Maintainer Author

KennethEnevoldsen May 17, 2024 Maintainer Author

sivareddyg May 17, 2024 Collaborator

KennethEnevoldsen May 17, 2024 Maintainer Author

KennethEnevoldsen May 21, 2024 Maintainer Author

KennethEnevoldsen
May 17, 2024
Maintainer

Replies: 3 comments 3 replies

x-tabdeveloping
May 17, 2024
Collaborator

KennethEnevoldsen May 17, 2024
Maintainer Author

KennethEnevoldsen
May 17, 2024
Maintainer Author

sivareddyg
May 17, 2024
Collaborator

KennethEnevoldsen May 17, 2024
Maintainer Author

KennethEnevoldsen May 21, 2024
Maintainer Author