-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Add MTEB(code) dataset #1237
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; For context @john-b-yang added new tables in the paper hence the points!
Can you make sure the tests pass? Like the linting etc |
No problem, just noticed that the |
This looks great @john-b-yang, will you add a short description to H1 as well about the creation? Does not need to be very long |
Ah yes no problem @KennethEnevoldsen! Will do this today! |
* fix: OpenAI BadRequestError by limiting input dimensions to 2048 elem… (#1203) * fix: OpenAI BadRequestError by limiting input dimensions to 2048 elements (#1201) Fix OpenAI BadRequestError by limiting input dimensions to 2048 elements - Ensure the 'sentences' list passed to OpenAI API does not exceed 2048 elements - Reference: OpenAI's Embedding API documentation on input limits Co-authored-by: Ali Shiraee <[email protected]> * fix ruff formatting * Added minor test fixes to ensure reproducility across systems * Ensure that tmp.json is not created within repo when running tests * format * fixes path issues * Rerun CI --------- Co-authored-by: HSILA <[email protected]> Co-authored-by: Ali Shiraee <[email protected]> * fix: Ensure STS pearson and spearman does not use the p-value only the correlation (#1207) Fixes #1206 * 1.14.16 Automatically generated by python-semantic-release * fix: Normalize licenses including casing, uses of "-" etc. * fix: Normalize licenses including casing, uses of "-" etc. (#1210) * fix: Normalize licenses including casing, uses of "-" etc. * fix tests * 1.14.17 Automatically generated by python-semantic-release * fix: Normalize benchmarks no only include task objects and added getter for benchmarks (#1208) * Normalize benchmarks to only include tasks - Force benchmarks to only include tasks. This fixes a few bugs where benchmarks can reference a task which is not implemented - implements `mteb.get_benchmark`, which makes it easier to fetch benchmarks - Added tests + updated docs A few outstanding issues: I would like `mteb.MTEB(benchmark)` to always reproduce the benchmark. Currently this is not possible as MTEB(eng) required the split to be specified. A solution it to allow "eval_splits) to be specified when initializing a task and then pass it on to the `load_data()`. This way we can write the following: `mteb.get_tasks(tasks=[...], eval_splits=["test"], ...)` I would also love the aggregation to be a part of the benchmark (such that it is clear how it should be aggregated). This is especially relevant for MTEB(eng) as it average the CQAD datasets before creating the global average. This way we can also create a result object for the benchmark itself. A complimenting solution for this is to allow nested benchmarks. * fix error in tests * format * Added corrections based on review * added example and formatted * 1.14.18 Automatically generated by python-semantic-release * docs: Fix broken links in docs (#1212) * Added fixes for broken links in adding_a_dataset and adding_a_model docs. * Updated link name * Mismatch of the category of AmazonPolarityClassification (#1220) Fixes #1219 * Update tasks table * fix: Ensure that results are returned even when hitting cache (#1215) Fixes #1122 * 1.14.19 Automatically generated by python-semantic-release * fix: Allow benchmark to specify eval_splits (#1217) * fix: Allow benchmark to specify eval_splits This PR allow for benchmarks to specify specific eval. splits. This allow us to fully specify a benchmark within the benchmark object. To do this it add the following: - added eval_splits to the Abstask object, which default to metadata.eval_splits - use the task.eval_splits unless overwritten in mteb.MTEB.run - added eval_splits arg to mteb.get_tasks, which filter the tasks based on splits - updated documentation - renamed the "Advanced Usage" to "Usage Documentation" to make it more accicible - added tests where relevant * Added correction based on feedback * 1.14.20 Automatically generated by python-semantic-release * Update points table * Update points table * docs: clarify adding a model (#1222) * fix: Add RepLLaMA style models (#1223) * init commit * working and reproducing * lint * update hashes * warning * add pyproject * Update points table * 1.14.21 Automatically generated by python-semantic-release * docs: Update points (#1228) * Fix case * Fix casing * Fix case * Fix case * Create 971.jsonl * Update contrib * Add contributors * Update points table * docs: Add MTEB(code) dataset (#1237) * docs: Add MTEB(code) dataset * Fix linting * Update points table * Update of my affiliation (#1242) Update points.md * Add contributor (#1243) * fix: @mrshu's name in `points.md` (#1246) * Use the diacritic character to be inline with Slovak spelling. Signed-off-by: mr.Shu <[email protected]> * docs: Create benchmarks overview table (#1245) * fix get_benchmarks method * add create benchmark script * make lint * 1.14.22 Automatically generated by python-semantic-release * docs: Update affiliation (#1247) Update points.md * Added author-information * Add final author list * Update points table * docs: Added coordination point for Jimmy Lee (#1253) docs: Added coordination point for Jimmy lee for his work on the coordination of Crystina and Nandan * Update points table * fix: Add multilingual Benchmark (#1252) * fix: Add multilingual bench * Update mteb/benchmarks/benchmarks.py Co-authored-by: Niklas Muennighoff <[email protected]> * format --------- Co-authored-by: Niklas Muennighoff <[email protected]> * 1.14.23 Automatically generated by python-semantic-release * docs: Small point changes & more contributors (#1254) * Update points.md * Fix format * Fix attribution * Update points table * fix: Downsample large retrieval datasets (#1236) * most tasks * lint * fix other issues * refactor * lint and docs * add polish * keep case sensitive mteb paths * add potential points * fix points * fix test about metadata * update tasks and stats * lint * Update points table * Update tasks table * 1.14.24 Automatically generated by python-semantic-release * fix: Get meta from CrossEncoder (#1255) * remove indent after return * handle cross encoders for model meta * make lint * update filename since we now have model name * 1.14.25 Automatically generated by python-semantic-release * fix: Add listing all available benchmarks CLI option (#1256) * add benchmarks.md in README * add cli option * add benchmark cli test case * correct typo * 1.14.26 Automatically generated by python-semantic-release * docs: Update affiliation (#1248) * Update points.md * Update points.md --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * docs: Update mteb(eng) calculation (#1258) * Update mteb(eng) calculation * Fixed citations * Update MTEB(eng) + MTEB(multilingual) * feat: leverage SentenceTransformers' query/passage specific prompts (#1221) * feat: leverage SentenceTransformer models' query/passage specific prompts * refactor: remove E5Wrapper fix: wrong e5 revisions * fix: default prompt_type to None * fix: e4ce987 revision no longer exists for multilingual-e5-small on the Hub * fix: keep `prompt_name` in kwargs when model doesn't have a `prompts` attr * feat: use Enum for `prompt_type` * docs: specify how to use prompts with Sentence Transformers * feat: readd arctic models due to metadata * 1.15.0 Automatically generated by python-semantic-release * fix: Add Touche2020v3 and JMTEB (#1262) * add datasets * fix metrics * add Touche2020v3 * fix metadata * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <[email protected]> * upd name and supress * add benchmark class --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update tasks table * 1.15.1 Automatically generated by python-semantic-release * fix: Select benchmarks CLI option (#1261) * add test case for a list of Benchmarks * add selecting benchmarks CLI option * typos * use a separate attribute for benchmarks * try fixing tests * should accept string as well * revert filename change * use Benchmark and avoid circular import * fix: derive `results_directory` path from `results_repo` name (#1275) fix: don't hardcode repo name when downloading results * 1.15.2 Automatically generated by python-semantic-release * fix: sorting benchmark tasks by MTEB, then alphabetical (#1271) * sorted * fixed formatting * efficiency changes * fix test * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * 1.15.3 Automatically generated by python-semantic-release * ci: Removed 3.8 dependency (#1281) Changes include: - remove 3.8 from tests (added 3.11 and 3.12) - changed other CI to 3.9 - updated lint rules to use 3.8 * Update points table * fix: Allow Numpy >=2.0 (#1264) Allow Numpy >=2.0 * 1.15.4 Automatically generated by python-semantic-release * docs: points for paper writing (#1286) * Create 1004.jsonl * Create 1006.jsonl * Update docs/mmteb/points/1004.jsonl * Update docs/mmteb/points/1006.jsonl --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update points table * Update points table * Update points table * docs: Fix a link in the README (#1289) * Fix a link in the README And fix some typos. * Update README.md * Update points table * fix: Update benchmarks (#1288) * make benchmark var name uppercase * update touche to v3 * add MIRACLRetrievalHardNegatives to multilingual * add mteb(indic) * add eu benchmark * 1.15.5 Automatically generated by python-semantic-release * fix: Allow numpy<2.0.0 (#1291) * 1.15.6 Automatically generated by python-semantic-release * fix: Add metadata dict to QBQTC in C-MTEB (#1292) * fix QBQTC in C-MTEB * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * 1.15.7 Automatically generated by python-semantic-release * fix: Remove non-existent eval split of CMNLI (#1294) fix eval_splits of CMNLI * 1.15.8 Automatically generated by python-semantic-release * Leaderboard (#1235) * Add leaderboard dev * Renamed MTEBResults to TaskResult * Moved model and model meta loading utilities into overview.py * Added get_model_metas to retrieve filtered metadata for models * Restructured results object and made it into a class instead of a dict * Added utilities for filtering models on BenchmarkResults objects * Added to_table utility function to BenchmarkResults * Added serialization utilities to BenchmarkResults * Attempted fixing tests * Added get_model_metas to __init__ * Added get_benchmarks to __init__ and made it return all benchmarks by default * Added get_benchmarks to __init__ * Made tasks hashable * Added task filtering based on task objects on BenchmarkResults * Added BenchmarkResults to __init__ * Added additional arguments to get_scores on two classes * Made get_scores smarter on BenchmarkResult * Added basic multilingual benchmark * Modified benchmark to be able to easily access results * Added useful properties and filtering functions to BenchmarkResults * Added minimal functioning example * Added smarter table, task-list updating and tried fixing dropdown scrolling * Made restrict_results into a private function Co-authored-by: Kenneth Enevoldsen <[email protected]> * Removed old leaderboard scripts * Hardcoded max and min model size * Removed redundant utils file * Ran linting * added leaderboard dependencies as optional * Fixed union type error on Python 3.9 * Removed references to Dict in task aggregation * Fixed name errors in _restrict_task_results * Fixed _restrict_task_results * Made hf_subsets={'default'} when the task is monolingual in _restric_task_results * Task dropdown now gets filtered based on the other criteria * Ran linting again * Introduced hotfix for reranking test * Added BenchmarkResults to __all__ in __init__ * Fixed validate_and_filter_scores method, and replaced _restric_task_results with it --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * feat: Use prompts instead of encode_corpus and encode_queries (#1278) * add prompt per task type * fix prompt * upd test * lint * fix test * fix DeprecatedSummarizationEvaluator * fix prompts * add test * lint * logger info * use task type only in model_encode * lint * update interface * add prompt types to docs * fix test * mock tasks * mock task registry * remove last task_type * fix tests * lint * fix test * fix * use wrapper and new prompts * fix tests * lint * fix test * remove conftest * validate task to prompt_name * override model prompts * task to prompt name optional * fix tests * fix models * remove task_to_prompt_name * remove from mteb __init__ * update docs * load existing model prompts if model_prompts is None * fix * lint * change wrapper loader * add wrapper class * lint * add wrapper file * update logging * upd logging * refactor reranking * lint * remove prints * 1.16.0 Automatically generated by python-semantic-release * fix: Add Retrieval SK Quad dataset for Slovak search evaluation (#1276) * Add Retrieval SK Quad dataset for Slovak search evaluation This commit introduces the Retrieval SK Quad dataset, designed to assess Slovak search performance. The dataset is derived from SK-QuAD and includes questions with their best answers categorized post-annotation. This addition provides a significant resource for advancing Slovak language search evaluation and supporting further research and development. * Add Retrieval SK Quad dataset for Slovak search evaluation 2 Added the requested changes on the SKQuadRetrieval.py file * add task to init * add missing task metadata --------- Co-authored-by: Isaac Chung <[email protected]> * Update tasks table * 1.16.1 Automatically generated by python-semantic-release * fix: Add Slovak Hate Speech and Offensive Language Dataset (#1274) * Add Slovak Hate Speech and Offensive Language Dataset This commit introduces the Slovak Hate Speech and Offensive Language Database to MTEB. The dataset includes posts from a social network, annotated by humans for hate speech and offensive content. Additionally, the corresponding task has been added to the tasks.md table to reflect this update. * Add Slovak Hate Speech and Offensive Language Dataset - Updated __init__.py to include the new SlovakHateSpeechClassification task. - Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability. * Did requested changes: - Updated __init__.py to include the new SlovakHateSpeechClassification task. - Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability. * resolve linting issues by running `make lint` * Update tasks table * WIP: Leaderboard UI improvements (#1312) * Fixed typos in task_results * Fixed typos in task_results * Added Tailwind, reorganized layout and fixed scrolling * Ran linting * 1.16.2 Automatically generated by python-semantic-release * fix: remove duplicate multilingual * 1.16.3 Automatically generated by python-semantic-release * fix: Re-upload dataset to hub to avoid using script upload (#1322) * fix dataset upload * add linting * Update tasks table * 1.16.4 Automatically generated by python-semantic-release * fix: Add implementations of common reranker models (#1309) * init * revert * revert * add metadata * lint * add reqs * change to float16 * benchmark lint fix * 1.16.5 Automatically generated by python-semantic-release * Add multilingual mFollowIR dataset (#1308) * add mFollowIR * paper name * edit warning->info * convert to parquet * lint * Update tasks table * Cache the embeddings when requested (#1307) * add caching * update test to use close * change from json to pkl * fix for window * cleanup on Windows again * infer dimension * move cachewrapper * add wrapper * fix * updates * fix tests * fix lint * lint * add test * WIP: Leaderboard UI improvements (#1320) * Fixed typos in task_results * Fixed typos in task_results * Added Tailwind, reorganized layout and fixed scrolling * Ran linting * Removed faux benchmark * Updated layout * Changed table number format * Table highlights highest values by making them bold * Added rank to table, removed organization from model_name * Added mean rank to table * Ran linting * feat: Update metadata for all models (#1316) * Added model meta * format * fixed metadata * Metadata update for voyage models * Update mteb/models/cohere_models.py Co-authored-by: Roman Solomatin <[email protected]> * Update mteb/models/cohere_models.py Co-authored-by: Roman Solomatin <[email protected]> * Added corrections from review * fix spelling error --------- Co-authored-by: Roman Solomatin <[email protected]> * resolved bugs from pytest --collect-only * Avoid wrapping all models with the SentenceTransformerWrapper * Added normalize_embeddings_to_numpy to ensure standard embeddings during evaluations * fixed moved on correction from @Samoed * conditionally set .predict method on SentenceTransformerWrapper --------- Signed-off-by: mr.Shu <[email protected]> Co-authored-by: HSILA <[email protected]> Co-authored-by: Ali Shiraee <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Thomas van Dongen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Niklas Muennighoff <[email protected]> Co-authored-by: Orion Weller <[email protected]> Co-authored-by: John Yang <[email protected]> Co-authored-by: Imene Kerboua <[email protected]> Co-authored-by: Marek Šuppa <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: Xa9aX ツ <[email protected]> Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: Sathvik Nallamalli <[email protected]> Co-authored-by: Michael Graczyk <[email protected]> Co-authored-by: Mariya Hendriksen <[email protected]> Co-authored-by: Santiago Castro <[email protected]> Co-authored-by: Joey Xia <[email protected]> Co-authored-by: Márton Kardos <[email protected]> Co-authored-by: Oliver <[email protected]>
* [MIEB] Adding DataComp CLIP models (#1283) * adding data comp CLIP models * update model and caltech101 results * make lint * [mieb] Any2TextMultipleChoice Abstask&Evaluator & four tasks in CV-bench (#1287) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * fix meta data * fix validate points * CV-Bench * evaluator args comment * fix --------- Co-authored-by: Isaac Chung <[email protected]> * [mieb] adding 10 tasks (#1290) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add vidore benchmark 10 tasks * fix reference * fix old metadata * fix meta * [mieb] Adding MOCOv3 models (#1293) * add moco models first try * add as a timm model * add large model results * make lint * [mieb] Add more Any2AnyRetrieval datasets (#1285) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * remove GLDv2I2IRetrieval * [mieb] Add any2any multiple choice evaluator and abstask (and one task) (#1301) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * [mieb] Fix FORB dataset (#1306) * correct format * update results * add more results * add more results * [mieb] run tasks fix (#1302) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * fix e5v&vista * task type fix for running tasks * fix wrong meta * run mieb script * script * lint * align * [mieb] split RParisI2IRetrieval and ROxfordI2IRetrieval into easy, medium and hard versions (#1305) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] run tasks small fix (#1310) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * fix e5v&vista * task type fix for running tasks * fix wrong meta * run mieb script * script * lint * align * fix * linting * [mieb] Add VLM2vec (#1323) * wip vlm2vec model * making i2t classification work wit Calteh101 * test vlm2vec on other task types * move peft into class * feat: Merge main into MIEB (#1329) * fix: OpenAI BadRequestError by limiting input dimensions to 2048 elem… (#1203) * fix: OpenAI BadRequestError by limiting input dimensions to 2048 elements (#1201) Fix OpenAI BadRequestError by limiting input dimensions to 2048 elements - Ensure the 'sentences' list passed to OpenAI API does not exceed 2048 elements - Reference: OpenAI's Embedding API documentation on input limits Co-authored-by: Ali Shiraee <[email protected]> * fix ruff formatting * Added minor test fixes to ensure reproducility across systems * Ensure that tmp.json is not created within repo when running tests * format * fixes path issues * Rerun CI --------- Co-authored-by: HSILA <[email protected]> Co-authored-by: Ali Shiraee <[email protected]> * fix: Ensure STS pearson and spearman does not use the p-value only the correlation (#1207) Fixes #1206 * 1.14.16 Automatically generated by python-semantic-release * fix: Normalize licenses including casing, uses of "-" etc. * fix: Normalize licenses including casing, uses of "-" etc. (#1210) * fix: Normalize licenses including casing, uses of "-" etc. * fix tests * 1.14.17 Automatically generated by python-semantic-release * fix: Normalize benchmarks no only include task objects and added getter for benchmarks (#1208) * Normalize benchmarks to only include tasks - Force benchmarks to only include tasks. This fixes a few bugs where benchmarks can reference a task which is not implemented - implements `mteb.get_benchmark`, which makes it easier to fetch benchmarks - Added tests + updated docs A few outstanding issues: I would like `mteb.MTEB(benchmark)` to always reproduce the benchmark. Currently this is not possible as MTEB(eng) required the split to be specified. A solution it to allow "eval_splits) to be specified when initializing a task and then pass it on to the `load_data()`. This way we can write the following: `mteb.get_tasks(tasks=[...], eval_splits=["test"], ...)` I would also love the aggregation to be a part of the benchmark (such that it is clear how it should be aggregated). This is especially relevant for MTEB(eng) as it average the CQAD datasets before creating the global average. This way we can also create a result object for the benchmark itself. A complimenting solution for this is to allow nested benchmarks. * fix error in tests * format * Added corrections based on review * added example and formatted * 1.14.18 Automatically generated by python-semantic-release * docs: Fix broken links in docs (#1212) * Added fixes for broken links in adding_a_dataset and adding_a_model docs. * Updated link name * Mismatch of the category of AmazonPolarityClassification (#1220) Fixes #1219 * Update tasks table * fix: Ensure that results are returned even when hitting cache (#1215) Fixes #1122 * 1.14.19 Automatically generated by python-semantic-release * fix: Allow benchmark to specify eval_splits (#1217) * fix: Allow benchmark to specify eval_splits This PR allow for benchmarks to specify specific eval. splits. This allow us to fully specify a benchmark within the benchmark object. To do this it add the following: - added eval_splits to the Abstask object, which default to metadata.eval_splits - use the task.eval_splits unless overwritten in mteb.MTEB.run - added eval_splits arg to mteb.get_tasks, which filter the tasks based on splits - updated documentation - renamed the "Advanced Usage" to "Usage Documentation" to make it more accicible - added tests where relevant * Added correction based on feedback * 1.14.20 Automatically generated by python-semantic-release * Update points table * Update points table * docs: clarify adding a model (#1222) * fix: Add RepLLaMA style models (#1223) * init commit * working and reproducing * lint * update hashes * warning * add pyproject * Update points table * 1.14.21 Automatically generated by python-semantic-release * docs: Update points (#1228) * Fix case * Fix casing * Fix case * Fix case * Create 971.jsonl * Update contrib * Add contributors * Update points table * docs: Add MTEB(code) dataset (#1237) * docs: Add MTEB(code) dataset * Fix linting * Update points table * Update of my affiliation (#1242) Update points.md * Add contributor (#1243) * fix: @mrshu's name in `points.md` (#1246) * Use the diacritic character to be inline with Slovak spelling. Signed-off-by: mr.Shu <[email protected]> * docs: Create benchmarks overview table (#1245) * fix get_benchmarks method * add create benchmark script * make lint * 1.14.22 Automatically generated by python-semantic-release * docs: Update affiliation (#1247) Update points.md * Added author-information * Add final author list * Update points table * docs: Added coordination point for Jimmy Lee (#1253) docs: Added coordination point for Jimmy lee for his work on the coordination of Crystina and Nandan * Update points table * fix: Add multilingual Benchmark (#1252) * fix: Add multilingual bench * Update mteb/benchmarks/benchmarks.py Co-authored-by: Niklas Muennighoff <[email protected]> * format --------- Co-authored-by: Niklas Muennighoff <[email protected]> * 1.14.23 Automatically generated by python-semantic-release * docs: Small point changes & more contributors (#1254) * Update points.md * Fix format * Fix attribution * Update points table * fix: Downsample large retrieval datasets (#1236) * most tasks * lint * fix other issues * refactor * lint and docs * add polish * keep case sensitive mteb paths * add potential points * fix points * fix test about metadata * update tasks and stats * lint * Update points table * Update tasks table * 1.14.24 Automatically generated by python-semantic-release * fix: Get meta from CrossEncoder (#1255) * remove indent after return * handle cross encoders for model meta * make lint * update filename since we now have model name * 1.14.25 Automatically generated by python-semantic-release * fix: Add listing all available benchmarks CLI option (#1256) * add benchmarks.md in README * add cli option * add benchmark cli test case * correct typo * 1.14.26 Automatically generated by python-semantic-release * docs: Update affiliation (#1248) * Update points.md * Update points.md --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * docs: Update mteb(eng) calculation (#1258) * Update mteb(eng) calculation * Fixed citations * Update MTEB(eng) + MTEB(multilingual) * feat: leverage SentenceTransformers' query/passage specific prompts (#1221) * feat: leverage SentenceTransformer models' query/passage specific prompts * refactor: remove E5Wrapper fix: wrong e5 revisions * fix: default prompt_type to None * fix: e4ce987 revision no longer exists for multilingual-e5-small on the Hub * fix: keep `prompt_name` in kwargs when model doesn't have a `prompts` attr * feat: use Enum for `prompt_type` * docs: specify how to use prompts with Sentence Transformers * feat: readd arctic models due to metadata * 1.15.0 Automatically generated by python-semantic-release * fix: Add Touche2020v3 and JMTEB (#1262) * add datasets * fix metrics * add Touche2020v3 * fix metadata * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <[email protected]> * upd name and supress * add benchmark class --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update tasks table * 1.15.1 Automatically generated by python-semantic-release * fix: Select benchmarks CLI option (#1261) * add test case for a list of Benchmarks * add selecting benchmarks CLI option * typos * use a separate attribute for benchmarks * try fixing tests * should accept string as well * revert filename change * use Benchmark and avoid circular import * fix: derive `results_directory` path from `results_repo` name (#1275) fix: don't hardcode repo name when downloading results * 1.15.2 Automatically generated by python-semantic-release * fix: sorting benchmark tasks by MTEB, then alphabetical (#1271) * sorted * fixed formatting * efficiency changes * fix test * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * 1.15.3 Automatically generated by python-semantic-release * ci: Removed 3.8 dependency (#1281) Changes include: - remove 3.8 from tests (added 3.11 and 3.12) - changed other CI to 3.9 - updated lint rules to use 3.8 * Update points table * fix: Allow Numpy >=2.0 (#1264) Allow Numpy >=2.0 * 1.15.4 Automatically generated by python-semantic-release * docs: points for paper writing (#1286) * Create 1004.jsonl * Create 1006.jsonl * Update docs/mmteb/points/1004.jsonl * Update docs/mmteb/points/1006.jsonl --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update points table * Update points table * Update points table * docs: Fix a link in the README (#1289) * Fix a link in the README And fix some typos. * Update README.md * Update points table * fix: Update benchmarks (#1288) * make benchmark var name uppercase * update touche to v3 * add MIRACLRetrievalHardNegatives to multilingual * add mteb(indic) * add eu benchmark * 1.15.5 Automatically generated by python-semantic-release * fix: Allow numpy<2.0.0 (#1291) * 1.15.6 Automatically generated by python-semantic-release * fix: Add metadata dict to QBQTC in C-MTEB (#1292) * fix QBQTC in C-MTEB * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * 1.15.7 Automatically generated by python-semantic-release * fix: Remove non-existent eval split of CMNLI (#1294) fix eval_splits of CMNLI * 1.15.8 Automatically generated by python-semantic-release * Leaderboard (#1235) * Add leaderboard dev * Renamed MTEBResults to TaskResult * Moved model and model meta loading utilities into overview.py * Added get_model_metas to retrieve filtered metadata for models * Restructured results object and made it into a class instead of a dict * Added utilities for filtering models on BenchmarkResults objects * Added to_table utility function to BenchmarkResults * Added serialization utilities to BenchmarkResults * Attempted fixing tests * Added get_model_metas to __init__ * Added get_benchmarks to __init__ and made it return all benchmarks by default * Added get_benchmarks to __init__ * Made tasks hashable * Added task filtering based on task objects on BenchmarkResults * Added BenchmarkResults to __init__ * Added additional arguments to get_scores on two classes * Made get_scores smarter on BenchmarkResult * Added basic multilingual benchmark * Modified benchmark to be able to easily access results * Added useful properties and filtering functions to BenchmarkResults * Added minimal functioning example * Added smarter table, task-list updating and tried fixing dropdown scrolling * Made restrict_results into a private function Co-authored-by: Kenneth Enevoldsen <[email protected]> * Removed old leaderboard scripts * Hardcoded max and min model size * Removed redundant utils file * Ran linting * added leaderboard dependencies as optional * Fixed union type error on Python 3.9 * Removed references to Dict in task aggregation * Fixed name errors in _restrict_task_results * Fixed _restrict_task_results * Made hf_subsets={'default'} when the task is monolingual in _restric_task_results * Task dropdown now gets filtered based on the other criteria * Ran linting again * Introduced hotfix for reranking test * Added BenchmarkResults to __all__ in __init__ * Fixed validate_and_filter_scores method, and replaced _restric_task_results with it --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * feat: Use prompts instead of encode_corpus and encode_queries (#1278) * add prompt per task type * fix prompt * upd test * lint * fix test * fix DeprecatedSummarizationEvaluator * fix prompts * add test * lint * logger info * use task type only in model_encode * lint * update interface * add prompt types to docs * fix test * mock tasks * mock task registry * remove last task_type * fix tests * lint * fix test * fix * use wrapper and new prompts * fix tests * lint * fix test * remove conftest * validate task to prompt_name * override model prompts * task to prompt name optional * fix tests * fix models * remove task_to_prompt_name * remove from mteb __init__ * update docs * load existing model prompts if model_prompts is None * fix * lint * change wrapper loader * add wrapper class * lint * add wrapper file * update logging * upd logging * refactor reranking * lint * remove prints * 1.16.0 Automatically generated by python-semantic-release * fix: Add Retrieval SK Quad dataset for Slovak search evaluation (#1276) * Add Retrieval SK Quad dataset for Slovak search evaluation This commit introduces the Retrieval SK Quad dataset, designed to assess Slovak search performance. The dataset is derived from SK-QuAD and includes questions with their best answers categorized post-annotation. This addition provides a significant resource for advancing Slovak language search evaluation and supporting further research and development. * Add Retrieval SK Quad dataset for Slovak search evaluation 2 Added the requested changes on the SKQuadRetrieval.py file * add task to init * add missing task metadata --------- Co-authored-by: Isaac Chung <[email protected]> * Update tasks table * 1.16.1 Automatically generated by python-semantic-release * fix: Add Slovak Hate Speech and Offensive Language Dataset (#1274) * Add Slovak Hate Speech and Offensive Language Dataset This commit introduces the Slovak Hate Speech and Offensive Language Database to MTEB. The dataset includes posts from a social network, annotated by humans for hate speech and offensive content. Additionally, the corresponding task has been added to the tasks.md table to reflect this update. * Add Slovak Hate Speech and Offensive Language Dataset - Updated __init__.py to include the new SlovakHateSpeechClassification task. - Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability. * Did requested changes: - Updated __init__.py to include the new SlovakHateSpeechClassification task. - Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability. * resolve linting issues by running `make lint` * Update tasks table * WIP: Leaderboard UI improvements (#1312) * Fixed typos in task_results * Fixed typos in task_results * Added Tailwind, reorganized layout and fixed scrolling * Ran linting * 1.16.2 Automatically generated by python-semantic-release * fix: remove duplicate multilingual * 1.16.3 Automatically generated by python-semantic-release * fix: Re-upload dataset to hub to avoid using script upload (#1322) * fix dataset upload * add linting * Update tasks table * 1.16.4 Automatically generated by python-semantic-release * fix: Add implementations of common reranker models (#1309) * init * revert * revert * add metadata * lint * add reqs * change to float16 * benchmark lint fix * 1.16.5 Automatically generated by python-semantic-release * Add multilingual mFollowIR dataset (#1308) * add mFollowIR * paper name * edit warning->info * convert to parquet * lint * Update tasks table * Cache the embeddings when requested (#1307) * add caching * update test to use close * change from json to pkl * fix for window * cleanup on Windows again * infer dimension * move cachewrapper * add wrapper * fix * updates * fix tests * fix lint * lint * add test * WIP: Leaderboard UI improvements (#1320) * Fixed typos in task_results * Fixed typos in task_results * Added Tailwind, reorganized layout and fixed scrolling * Ran linting * Removed faux benchmark * Updated layout * Changed table number format * Table highlights highest values by making them bold * Added rank to table, removed organization from model_name * Added mean rank to table * Ran linting * feat: Update metadata for all models (#1316) * Added model meta * format * fixed metadata * Metadata update for voyage models * Update mteb/models/cohere_models.py Co-authored-by: Roman Solomatin <[email protected]> * Update mteb/models/cohere_models.py Co-authored-by: Roman Solomatin <[email protected]> * Added corrections from review * fix spelling error --------- Co-authored-by: Roman Solomatin <[email protected]> * resolved bugs from pytest --collect-only * Avoid wrapping all models with the SentenceTransformerWrapper * Added normalize_embeddings_to_numpy to ensure standard embeddings during evaluations * fixed moved on correction from @Samoed * conditionally set .predict method on SentenceTransformerWrapper --------- Signed-off-by: mr.Shu <[email protected]> Co-authored-by: HSILA <[email protected]> Co-authored-by: Ali Shiraee <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Thomas van Dongen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Niklas Muennighoff <[email protected]> Co-authored-by: Orion Weller <[email protected]> Co-authored-by: John Yang <[email protected]> Co-authored-by: Imene Kerboua <[email protected]> Co-authored-by: Marek Šuppa <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: Xa9aX ツ <[email protected]> Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: Sathvik Nallamalli <[email protected]> Co-authored-by: Michael Graczyk <[email protected]> Co-authored-by: Mariya Hendriksen <[email protected]> Co-authored-by: Santiago Castro <[email protected]> Co-authored-by: Joey Xia <[email protected]> Co-authored-by: Márton Kardos <[email protected]> Co-authored-by: Oliver <[email protected]> * [mieb] Add OpenCLIP models (#1335) * add open clip models * Update __init__.py * lint * fix model overview * update jina clip --------- Co-authored-by: chenghao xiao <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] new version with downsampled train split to 32 per class (#1327) * new version with downsampled train split to 32 per class * force load truncated image file * make lint * add open clip models * Update __init__.py * lint * fix model overview * fix ImageCLS undersample; run birdsnap * make lint * make lint --------- Co-authored-by: chenghao xiao <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] Fix Jina CLIP (#1349) fix jina clip v1 * fix: Add clevr license (#1356) * Add BLINK as multi-choice tasks (#1348) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] add Eva CLIP models (#1369) * add Eva CLIP models * make lint * [mieb] add siglip, cohere multimodal & some fixes for final run (#1357) * fix dataset type error * fix clustering metrics * add siglip & cohere * update mieb run script * cohere-v import * fix * api key name * [mieb] fixes for final run (#1374) * e5_v device arg * dataloader num_workers * vista doc * vista doc * run mieb * fix * Update run_vista.md * [mieb] Fix torch no grad (#1378) Fix torch no grad * [mieb] Fix vlm2vec (#1380) * fix vlm2vec return dtype * make lint * [mieb] Remove null entries from corpus of ROxford, RParis (#1371) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format * remove null examples from corpus of ROxford and RParis --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] fixes (#1390) * Fix torch no grad * simplify * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * [MIEB] Remove non-existent method for blip (#1394) remove non-existent method for blip * [mieb] fix ALIGN; update Winoground revision id; update run script (#1391) * fix align & winoground * lint * Convert task category to i2i for tasks that only calls image encode * update categories should include img cls, clustering, and multi label clf * no op * no op * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * [mieb] Fix open clip for cv bench count (#1397) fix shape mismatch * [mieb] Update subtasks of BLINKIT2TMultiChoice and BLINKIT2IMultiChoice (#1403) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format * remove null examples from corpus of ROxford and RParis * fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice * update blink metadata * add updated BLINK results --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] Fix EVA CLIP for CV Bench (#1414) * unsqueeze after preprocess * make lint * [mieb] Add calculate probs for vlm2vec (#1418) * add method * make lint * [mieb] Fix siglip bug & add retrieval datasets (#1424) * fix siglip * add edis&gld-v2 i2i * results * siglip updated results * fix siglip non-dataloader tasks * [mieb] use Logistic Regression classifier for AbsTaskImageMultilabelClassification (#1420) * use moc-lr classifier * set n_experiments=5 * run dinov2 and some laion models * add dinov2-giant results * [mieb] mieb scripts (siglip rerun & linear probing ablation & params count) (#1429) * mieb scripts * lint * [MIEB] Change Flickr30k to test split (#1449) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format * remove null examples from corpus of ROxford and RParis * fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice * update blink metadata * add updated BLINK results * merge upstream mieb * change Flickr30k to test split * change flickr to test split --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] Fix VLM2vec dtype (#1462) * propagate dtype * fix fuse embeddings using list of PIL images * [mieb] run script for missing results (#1472) * task type fix * scripts * [mieb] Fix Moco model on CIFAR10Clustering (#1487) Fix Moco model on CIFAR10Clustering * [mieb] Fix Flickr30k I2T and T2I (#1505) * remake flickr30k it2 and t2i * add openai clip vit-b32 b16 and jina-clip results * make lint * [MIEB] add missing siglip models (#1533) * add udpates * lint errors * fix typo (#1535) * add udpates * lint errors * fix small typo * [mieb] Fix numbers of CIRR, Fashion200k, FashionIQ, Flickr30k, MSCOCO data statistics (#1544) fix numbers * Add Voyage's multimodal embedding (#1555) * add voyage multimodal & ran 17 tasks * lint * typo * clean * [mieb] update script for final re-run (#1576) * mieb final runs * lint * fix: no longer using same query text for all of BLINKIT2TMultiChoice (#1572) * fix: no longer using same query text for all of BLINKIT2TMultiChoice * fix: remove blink subtask * fix: remove subtask from blink it2i * fix: align BLINK retrieval to multi choice * add ROxford and RParis I2I multi choice * add retrieval metrics to multi choice evaluator * fix: remove wrong negatives from revisiting multichoice datasets * fix revisiting datasets * add new results for revisiting multichoice --------- Signed-off-by: mr.Shu <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: chenghao xiao <[email protected]> Co-authored-by: Jamie-Stirling <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: HSILA <[email protected]> Co-authored-by: Ali Shiraee <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Thomas van Dongen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Orion Weller <[email protected]> Co-authored-by: John Yang <[email protected]> Co-authored-by: Imene Kerboua <[email protected]> Co-authored-by: Marek Šuppa <[email protected]> Co-authored-by: Xa9aX ツ <[email protected]> Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: Sathvik Nallamalli <[email protected]> Co-authored-by: Michael Graczyk <[email protected]> Co-authored-by: Mariya Hendriksen <[email protected]> Co-authored-by: Santiago Castro <[email protected]> Co-authored-by: Joey Xia <[email protected]> Co-authored-by: Márton Kardos <[email protected]> Co-authored-by: Oliver <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> Co-authored-by: Saiteja Utpala <[email protected]> Co-authored-by: Xin Zhang <[email protected]>
MTEB(code)
to benchmarksMTEB(code)
to Appendix H