Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add model jxm/cde-small-v1 #1521

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
d894bbc
Fix verbosity handling in MTEB.py for consistent logging
YashDThapliyal Nov 4, 2024
0dc2a8a
updates
YashDThapliyal Nov 4, 2024
00032aa
update docstrings
YashDThapliyal Nov 4, 2024
7ae0583
linting code
YashDThapliyal Nov 4, 2024
838253d
Create cde-small-v1_model.py
YashDThapliyal Nov 27, 2024
eb04a87
update code for cde-small-v1 model
YashDThapliyal Nov 28, 2024
be6790e
Merge branch 'main' of https://github.com/embeddings-benchmark/mteb
YashDThapliyal Nov 28, 2024
5dc100f
Merge branch 'embeddings-benchmark:main' into main
YashDThapliyal Nov 28, 2024
c95b672
make lint and make test
YashDThapliyal Nov 28, 2024
3a692ec
Merge branch 'main' of https://github.com/YashDThapliyal/mteb
YashDThapliyal Nov 28, 2024
6ab8ffb
Update cde-small-v1_model.py
YashDThapliyal Nov 28, 2024
36d6702
Update cde-small-v1_model.py
YashDThapliyal Nov 28, 2024
200029d
create a test
YashDThapliyal Dec 23, 2024
5f48218
Merge branch 'embeddings-benchmark:main' into main
YashDThapliyal Dec 23, 2024
64b1c41
add model meta data card
YashDThapliyal Dec 23, 2024
93b5794
Merge branch 'main' of https://github.com/YashDThapliyal/mteb
YashDThapliyal Dec 24, 2024
9a25f97
remove zero_shot_benchmark as discussed on PR
YashDThapliyal Dec 24, 2024
06f5f01
clean up comments/add liscense
YashDThapliyal Dec 24, 2024
3a2d2e3
begin implementing cde
YashDThapliyal Dec 25, 2024
9222601
add corpus for the model to use
YashDThapliyal Dec 25, 2024
d5413fd
add model implementation via following HF refrence
YashDThapliyal Dec 25, 2024
e269da6
syntax error fix (delete ';' )
YashDThapliyal Dec 25, 2024
8af60b6
Update cde-small-v1_model.py
YashDThapliyal Dec 25, 2024
919c2b1
update implementation code
YashDThapliyal Dec 25, 2024
5be4ff5
Update cde-small-v1_model.py
YashDThapliyal Dec 25, 2024
5a05876
add results folder
YashDThapliyal Dec 25, 2024
449111e
Delete mteb/models/results directory
YashDThapliyal Dec 25, 2024
9815f36
results directory
YashDThapliyal Dec 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions mteb/models/cde-small-v1_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
from __future__ import annotations
import mteb
from mteb import MTEB
from sentence_transformers import SentenceTransformer
from functools import partial
from mteb.model_meta import ModelMeta, sentence_transformers_loader
import random


cde_small_v1_meta = ModelMeta(
loader=partial(
sentence_transformers_loader,
name="jxm/cde-small-v1",
revision="6a8c2f9f0a8184480f2e58f7d1413320b7b6392d",
model_prompts={
"query": "search_query: ",
"passage": "search_document: ",
}
),

name="jxm/cde-small-v1",
revision="6a8c2f9f0a8184480f2e58f7d1413320b7b6392d",
release_date="2024-10-01",
languages=["eng-Latn"],
n_parameters=281_000_000,
memory_usage=None,
max_tokens=512,
embed_dim=None,
license="mit",
open_weights=True,
public_training_data=None,
public_training_code=True,
framework=["Sentence Transformers", "PyTorch"],
reference="https://huggingface.co/jxm/cde-small-v1",
similarity_fn_name="cosine",
use_instructions=True,

)

#implement the model
model = SentenceTransformer("jxm/cde-small-v1", trust_remote_code=True)
model.prompts = {
"query": "search_query: ",
"passage": "search_document: ",# Use 'passage' instead of 'document' for consistency with MTEB
}

corpus_file = "random_strings_cde.txt"

with open(corpus_file, "r") as file:
random_corpus = [line.strip() for line in file]

minicorpus_size = 512
assert len(random_corpus) >= minicorpus_size, "Corpus size is smaller than required!"

minicorpus_docs = random.sample(random_corpus, k=minicorpus_size)

print("Generating dataset embeddings...")
dataset_embeddings = model.encode(
minicorpus_docs,
prompt_name="passage",
convert_to_tensor=True
)

print("Dataset embeddings shape:", dataset_embeddings.shape)


tasks=[
# classification
"AmazonCounterfactualClassification",
# clustering
"RedditClustering",
# pair classification
"TwitterSemEval2015",
# reranking
"AskUbuntuDupQuestions",
# retrieval
"SCIDOCS",
# semantic textual similarity
"STS22",
# summarization
"SummEval",
]


evaluation = MTEB(tasks=tasks)

print("Running MTEB evaluation...")
results = evaluation.run(
model=model,
output_folder="results",
extra_kwargs={"batch_size": 8},
overwrite_results=True,
)
Loading
Loading