Score Aggregation for Multilingual benchmark (and in general) #752
Replies: 3 comments 3 replies
-
One point I always bring up and I think it's well suited here is that we have to be cautious of how we weight languages.
I was thinking we could take inspiration from voting systems, since that's also an area where they have to balance e.g. the number of seats in the parliaments in such a way that constituencies get adequate representation, but also the number of voters in each constituency has to be accounted for. |
Beta Was this translation helpful? Give feedback.
-
Vague suggestions for desiderata (in need for formalization) score := aggregated performance
|
Beta Was this translation helpful? Give feedback.
-
Thanks for starting this discussion. The desired criteria for metrics are as follows:
@vaibhavad and I discussed some ideas, how we might be able to come up with an aggregate score for each language (based on a few concrete systems). More to add later. |
Beta Was this translation helpful? Give feedback.
-
This discussion is to figure out how scores should be aggregated and benchmarks constructed for the example I am just thinking of an overall multilingual benchmark, but the approach should be generalizable. For this, I propose the following:
Selecting a representative task
Selecting a representative task is quite hard, however I believe we can greatly simplify it using some pragmatic assumptions:
This is not intended to be the final version of the aggregation, but I believe it is doable within our time frame.
Beta Was this translation helpful? Give feedback.
All reactions