BREAKING: v2.0.0 #1433

KennethEnevoldsen · 2024-11-11T09:24:01Z

This is a work-in-progress branch which will be the release of MTEB v2.0.0!

Features:

Added evaluation of image embedding (MIEB, not merged in yet)
Improved handling of seeds (can still be improved by Avoid using global seeds #942)
Major updates to the leaderboard
Evaluators ambiguity: class/module #1124
New benchmark interface #1272
Remove encode_corpus and encode_queries and implement a "document" class #1284
Consolidate Retrieval/Reranking/Instruction Variants #1359

@x-tabdeveloping, @orionw, @isaac-chung, @Samoed, @gowitheflow-1998 etc. please make PR to this when relevant (MIEB still goes it its own branch but will try to merge it in here)

* update * merged retrieval; working * update tasks; working multilingual * everything working except instructions * working instructions; just need cleanup * add metadata for all but MindSmall * faster evaluation; mindsmall can compute in reasonable time * fix bad merge of docs * lint * fix test * qa * updated mindsmall * lint * fix debug * Update mteb/abstasks/dataloaders.py Co-authored-by: Roman Solomatin <[email protected]> * lint --------- Co-authored-by: Roman Solomatin <[email protected]>

…into v2.0.0

* fix: Count unique texts, data leaks in calculate metrics (#1438) * add more stat * add more stat * update statistics * fix: update task metadata to allow for null (#1448) * Update tasks table * 1.19.5 Automatically generated by python-semantic-release * base * sync with main --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions <[email protected]>

* enable codecarbon by default * lint * update flag * add allow_multiple_runs param * make lint * add warning * lint * negate the flag --------- Co-authored-by: Isaac Chung <[email protected]>

* run tasks * remove test script * lint * remove cache * fix sickbrsts * fix tests * add datasets

* fix test * skip mock * add message to assert * fix test * lint * fix tests * upd tests * update descriptive stats files * add stat to speed

* multilingual loader * lint

* add citations * fix typo

* add code for comupting number of qrels * add stats fever hotpotqa msmarco topiocqa * miracl mrtidy * multilongdoc miracl reranking * add multi eurlex * fix tests for descriptive stats * fix tests --------- Co-authored-by: Roman Solomatin <[email protected]>

* add code for comupting number of qrels * BibleNLPBitextMining descriptive stats added * SwissJudgementClassification descriptive stats added * VoyageMMarcoReranking descriptive stats added * WebLINXCandidatesReranking descriptive stats added * MultiEURLEXMultilabelClassification descriptive stats added * MIRACLReranking descriptive stats added * MindSmallReranking descriptive stats added * updated test_TaskMetadata * fix test --------- Co-authored-by: Imene Kerboua <[email protected]> Co-authored-by: Imene Kerboua <[email protected]> Co-authored-by: Roman Solomatin <[email protected]>

* fix bright loader * lint * fix comment

* fix: Count unique texts, data leaks in calculate metrics (#1438) * add more stat * add more stat * update statistics * fix: update task metadata to allow for null (#1448) * Update tasks table * 1.19.5 Automatically generated by python-semantic-release * Fix: Made data parsing in the leaderboard figure more robust (#1450) Bugfixes with data parsing in main figure * Fixed task loading (#1451) * Fixed task result loading from disk * Fixed task result loading from disk * fix: publish (#1452) * 1.19.6 Automatically generated by python-semantic-release * fix: Fix load external results with `None` mteb_version (#1453) * fix * lint * 1.19.7 Automatically generated by python-semantic-release * WIP: Polishing up leaderboard UI (#1461) * fix: Removed column wrapping on the table, so that it remains readable * Added disclaimer to figure * fix: Added links to task info table, switched out license with metric * fix: loading pre 1.11.0 (#1460) * small fix * fix: fix * 1.19.8 Automatically generated by python-semantic-release * fix: swap touche2020 to maintain compatibility (#1469) swap touche2020 for parity * 1.19.9 Automatically generated by python-semantic-release * docs: Add sum per language for task counts (#1468) * add sum per lang * add sort by sum option * make lint * fix: pinned datasets to <3.0.0 (#1470) * 1.19.10 Automatically generated by python-semantic-release * feat: add CUREv1 retrieval dataset (#1459) * feat: add CUREv1 dataset --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> * feat: add missing domains to medical tasks * feat: modify benchmark tasks * chore: benchmark naming --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> * Update tasks table * 1.20.0 Automatically generated by python-semantic-release * fix: check if `model` attr of model exists (#1499) * check if model attr of model exists * lint * Fix retrieval evaluator * 1.20.1 Automatically generated by python-semantic-release * add cure statistics --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions <[email protected]> Co-authored-by: Márton Kardos <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: Napuh <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]>

* fix bright loader * lint * fix comment * fix stats * fix retrieval stats * update stats * add rest of the stat * move bach code * fix docs * lint

* fix FilipinoHateSpeechClassification * update tests

* init * find all wierd repos * move to mteb WikipediaRetrievalMultilingual * add base upload utils * retrieval, classification, bitextmining * test retrieval * test retrieval * test task uploaded * update tasks * working version * remove comments * lint * move upload * fix tests * fix test * move upload to task * Update mteb/tasks/Retrieval/multilingual/WikipediaRetrievalMultilingual.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * fix: hatespeech filipino (#1522) * fix FilipinoHateSpeechClassification * update tests * lint --------- Co-authored-by: Kenneth Enevoldsen <[email protected]>

* fix: Count unique texts, data leaks in calculate metrics (#1438) * add more stat * add more stat * update statistics * fix: update task metadata to allow for null (#1448) * Update tasks table * 1.19.5 Automatically generated by python-semantic-release * Fix: Made data parsing in the leaderboard figure more robust (#1450) Bugfixes with data parsing in main figure * Fixed task loading (#1451) * Fixed task result loading from disk * Fixed task result loading from disk * fix: publish (#1452) * 1.19.6 Automatically generated by python-semantic-release * fix: Fix load external results with `None` mteb_version (#1453) * fix * lint * 1.19.7 Automatically generated by python-semantic-release * WIP: Polishing up leaderboard UI (#1461) * fix: Removed column wrapping on the table, so that it remains readable * Added disclaimer to figure * fix: Added links to task info table, switched out license with metric * fix: loading pre 1.11.0 (#1460) * small fix * fix: fix * 1.19.8 Automatically generated by python-semantic-release * fix: swap touche2020 to maintain compatibility (#1469) swap touche2020 for parity * 1.19.9 Automatically generated by python-semantic-release * docs: Add sum per language for task counts (#1468) * add sum per lang * add sort by sum option * make lint * fix: pinned datasets to <3.0.0 (#1470) * 1.19.10 Automatically generated by python-semantic-release * feat: add CUREv1 retrieval dataset (#1459) * feat: add CUREv1 dataset --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> * feat: add missing domains to medical tasks * feat: modify benchmark tasks * chore: benchmark naming --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> * Update tasks table * 1.20.0 Automatically generated by python-semantic-release * fix: check if `model` attr of model exists (#1499) * check if model attr of model exists * lint * Fix retrieval evaluator * 1.20.1 Automatically generated by python-semantic-release * fix: Leaderboard demo data loading (#1507) * Made get_scores error tolerant * Added join_revisions, made get_scores failsafe * Fetching metadata fixed fr HF models * Added failsafe metadata fetching to leaderboard code * Added revision joining to leaderboard app * fix * Only show models that have metadata, when filter_models is called * Ran linting * 1.20.2 Automatically generated by python-semantic-release * fix: leaderboard only shows models that have ModelMeta (#1508) Filtering for models that have metadata * 1.20.3 Automatically generated by python-semantic-release * fix: align readme with current mteb (#1493) * align readme with current mteb * align with mieb branch * fix test * 1.20.4 Automatically generated by python-semantic-release * docs: Add lang family mapping and map to task table (#1486) * add lang family mapping and map to task table * make lint * add back some unclassified lang codes * Update tasks table * fix: Ensure that models match the names on embedding-benchmarks/results (#1519) * 1.20.5 Automatically generated by python-semantic-release * fix: Adding missing metadata on models and mathcing names up with the results repo (#1528) * Added Voyage 3 models * Added correct metadata to Cohere models and matched names with the results repo * 1.20.6 Automatically generated by python-semantic-release * feat: Evaluate missing splits (#1525) * fix: evaluate missing splits (#1268) * implement partial evaluation for missing splits * lint * requested changes done from scratch * test for missing split evaluation added * uncomment test * lint * avoid circular import * use TaskResult * skip tests for now --------- Co-authored-by: Isaac Chung <[email protected]> * got test_all_splits_evaluated passing * tests passing * address review comments * make lint * handle None cases for kg_co2_emissions * use new results info --------- Co-authored-by: Thivyanth <[email protected]> * 1.21.0 Automatically generated by python-semantic-release * fix: Correct typos superseeded -> superseded (#1532) fix typo -> superseded * 1.21.1 Automatically generated by python-semantic-release * fix: Task load data error for SICK-BR-STS and XStance (#1534) * fix task load data for two tasks * correct dataset keys * 1.21.2 Automatically generated by python-semantic-release * fix: Proprietary models now get correctly shown in leaderboard (#1530) * Fixed showing proprietary models in leaderboard * Added links to all OpenAI models * Fixed table formatting issues * Bumped Gradio version * 1.21.3 Automatically generated by python-semantic-release * docs: Add Model Meta parameters and metadata (#1536) * add multi_qa_MiniLM_L6_cos_v1 model meta * add all_mpnet_base_v2 * add parameters to model meta * make lint * add extra params to meta * fix: add more model meta (jina, e5) (#1537) * add e5 model meta * address review comments * 1.21.4 Automatically generated by python-semantic-release * Add cohere models (#1538) * fix: bug cohere names * format * fix: add nomic models (#1543) #1515 * fix: Added all-minilm-l12-v2 (#1542) #1515 * fix: Added arctic models (#1541) #1515 * fix: add sentence trimming to OpenAIWrapper (#1526) * fix: add sentence trimming to OpenAIWrapper * fix: import tiktoken library inside encode function * fix: check tokenizer library installed and update ModelMeta to pass tokenizer_name * fix: pass tokenizer_name, max_tokens to loader * fix: make tokenizer_name None for default * fix: delete changes for ModelMeta * fix: fix revision to 2 for OpenAI models * fix: add docstring for OpenAIWrapper * fix: lint * feat: add openai optional dependency set * fix: add sleep for too many requests * fix: add lint * fix: delete evaluate file * 1.21.5 Automatically generated by python-semantic-release * fix: Fixed metadata errors (#1547) * 1.21.6 Automatically generated by python-semantic-release * fix: remove curev1 from multlingual (#1552) Seems like it was added here: 1cc6c9e * 1.21.7 Automatically generated by python-semantic-release * fix: Add Model2vec (#1546) * Added Model2Vec wrapper * Added Model2vec models * Added model2vec models to registry * Added model2vec as a dependency * Ran linting * Update mteb/models/model2vec_models.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update mteb/models/model2vec_models.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * Added adapted_from and superseeded_by to model2vec models. * Added missing import * Moved pyproject.toml to optional dependencies * Fixed typos * Added import error and changed model to model_name * Added Numpy to frameworks * Added Numpy to frameworks * Corrected false info on model2vec models * Replaced np.inf with maxint * Update mteb/models/model2vec_models.py Co-authored-by: Isaac Chung <[email protected]> * Added option to have infinite max tokens, added it to Model2vec --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: Isaac Chung <[email protected]> * Made result loading more permissive, changed eval splits for HotPotQA and DBPedia (#1554) * Removed train and dev from eval splits on HotpotQA * Removed dev from eval splits on DBPedia * Made task_results validation more permissive * Readded exception in get_score * Ran linting * 1.21.8 Automatically generated by python-semantic-release * docs: Correction of SICK-R metadata (#1558) * Correction of SICK-R metadata * Correction of SICK-R metadata --------- Co-authored-by: rposwiata <[email protected]> * feat(google_models): fix issues and add support for `text-embedding-005` and `text-multilingual-embedding-002` (#1562) * fix: google_models batching and prompt * feat: add text-embedding-005 and text-multilingual-embedding-002 * chore: `make lint` errors * fix: address PR comments * 1.22.0 Automatically generated by python-semantic-release * fix(bm25s): search implementation (#1566) fix: bm25s implementation * 1.22.1 Automatically generated by python-semantic-release * docs: Fix dependency library name for bm25s (#1568) * fix: bm25s implementation * correct library name --------- Co-authored-by: Daniel Buades Marcos <[email protected]> * fix: Add training dataset to model meta (#1561) * fix: Add training dataset to model meta Adresses #1556 * Added docs * format * feat: (cohere_models) cohere_task_type issue, batch requests and tqdm for visualization (#1564) * feat: batch requests to cohere models * fix: use correct task_type * feat: use tqdm with openai * fix: explicitely set `show_progress_bar` to False * fix(publichealth-qa): ignore rows with `None` values in `question` or `answer` (#1565) * 1.23.0 Automatically generated by python-semantic-release * fix wongnai * update inits * fix tests * lint * update imports * fix tests * lint --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions <[email protected]> Co-authored-by: Márton Kardos <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: Napuh <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> Co-authored-by: Thivyanth <[email protected]> Co-authored-by: Youngjoon Jang <[email protected]> Co-authored-by: Rafał Poświata <[email protected]>

# Conflicts: # docs/tasks.md # mteb/abstasks/AbsTaskClassification.py # mteb/abstasks/AbsTaskClusteringFast.py # mteb/abstasks/AbsTaskInstructionRetrieval.py # mteb/abstasks/AbsTaskMultilabelClassification.py # mteb/abstasks/AbsTaskPairClassification.py # mteb/abstasks/AbsTaskReranking.py # mteb/abstasks/AbsTaskRetrieval.py # mteb/abstasks/AbsTaskSTS.py # mteb/descriptive_stats/InstructionRetrieval/Core17InstructionRetrieval.json # mteb/descriptive_stats/MultilabelClassification/MultiEURLEXMultilabelClassification.json # mteb/descriptive_stats/Reranking/AskUbuntuDupQuestions.json # mteb/descriptive_stats/Reranking/ESCIReranking.json # mteb/descriptive_stats/Reranking/WikipediaRerankingMultilingual.json # mteb/descriptive_stats/Retrieval/AppsRetrieval.json # mteb/descriptive_stats/Retrieval/BelebeleRetrieval.json # mteb/descriptive_stats/Retrieval/COIRCodeSearchNetRetrieval.json # mteb/descriptive_stats/Retrieval/CodeEditSearchRetrieval.json # mteb/descriptive_stats/Retrieval/CodeFeedbackMT.json # mteb/descriptive_stats/Retrieval/CodeFeedbackST.json # mteb/descriptive_stats/Retrieval/CodeSearchNetCCRetrieval.json # mteb/descriptive_stats/Retrieval/CodeSearchNetRetrieval.json # mteb/descriptive_stats/Retrieval/CodeTransOceanContest.json # mteb/descriptive_stats/Retrieval/CodeTransOceanDL.json # mteb/descriptive_stats/Retrieval/CosQA.json # mteb/descriptive_stats/Retrieval/JaqketRetrieval.json # mteb/descriptive_stats/Retrieval/NFCorpus.json # mteb/descriptive_stats/Retrieval/StackOverflowQA.json # mteb/descriptive_stats/Retrieval/SyntheticText2SQL.json # mteb/descriptive_stats/Retrieval/Touche2020.json # mteb/descriptive_stats/Retrieval/Touche2020Retrieval.v3.json # mteb/descriptive_stats/Retrieval/mFollowIRCrossLingualInstructionRetrieval.json # mteb/descriptive_stats/Retrieval/mFollowIRInstructionRetrieval.json # mteb/evaluation/MTEB.py # mteb/evaluation/evaluators/RetrievalEvaluator.py # mteb/leaderboard/app.py # mteb/leaderboard/figures.py # mteb/leaderboard/table.py # mteb/model_meta.py # mteb/models/arctic_models.py # mteb/models/e5_models.py # mteb/models/nomic_models.py # mteb/models/overview.py # mteb/models/sentence_transformers_models.py # mteb/tasks/Reranking/zho/CMTEBReranking.py # mteb/tasks/Retrieval/__init__.py # mteb/tasks/STS/por/SickBrSTS.py # pyproject.toml # tests/test_benchmark/mock_tasks.py

fix: Ensure seed is based on RNG State (#1193)

e2520df

KennethEnevoldsen added this to the v2.0.0 milestone Nov 11, 2024

isaac-chung marked this pull request as draft November 11, 2024 09:27

KennethEnevoldsen mentioned this pull request Nov 13, 2024

Consolidate Retrieval/Reranking/Instruction Variants #1359

Merged

1 task

orionw and others added 5 commits November 13, 2024 11:30

fix: Unsure TaskResults can handle runtime and version being unspecified

2a8a370

Merge branch 'v2.0.0' of https://github.com/embeddings-benchmark/mteb …

dea2b77

…into v2.0.0

fix: remove NaN handling for retrieval

23d6cb2

Merge branch 'main' into v2.0.0

8868cd4

Samoed mentioned this pull request Nov 14, 2024

fix: Count unique texts, data leaks in calculate metrics #1438

Merged

2 tasks

Samoed and others added 16 commits November 14, 2024 21:26

feat: enable codecarbon by default (#1428)

70a3ff2

* enable codecarbon by default * lint * update flag * add allow_multiple_runs param * make lint * add warning * lint * negate the flag --------- Co-authored-by: Isaac Chung <[email protected]>

Add decriptive stat almost to all datasets (#1466)

0e9b6fd

* run tasks * remove test script * lint * remove cache * fix sickbrsts * fix tests * add datasets

fix: Fix test for empty descriptive tasks (#1413)

0a5bedb

* fix test * skip mock * add message to assert * fix test * lint * fix tests * upd tests * update descriptive stats files * add stat to speed

fix: pin datasets version <3.0.0 (#1471)

6da2a1a

feat: Multilingual retrieval loader (#1473)

a27de33

* multilingual loader * lint

fix: add citations to ModelMeta (#1477)

0df0210

* add citations * fix typo

fix: Fix BrightRetrieval calculate stats (#1484)

99247b2

* fix bright loader * lint * fix comment

Fix: retrieval stats (#1496)

6383950

* fix bright loader * lint * fix comment * fix stats * fix retrieval stats * update stats * add rest of the stat * move bach code * fix docs * lint

fix: hatespeech filipino (#1522)

d54fb75

* fix FilipinoHateSpeechClassification * update tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BREAKING: v2.0.0 #1433

BREAKING: v2.0.0 #1433

KennethEnevoldsen commented Nov 11, 2024 •

edited by orionw

Loading

BREAKING: v2.0.0 #1433

Are you sure you want to change the base?

BREAKING: v2.0.0 #1433

Conversation

KennethEnevoldsen commented Nov 11, 2024 • edited by orionw Loading

KennethEnevoldsen commented Nov 11, 2024 •

edited by orionw

Loading