SR Benchmark for Text Embedding

This repository contains the code and data for our paper [How Well Do Text Embedding Models Understand Syntax?] which has been accepted as the Findings of EMNLP 2023.

Overview

We establish an evaluation set, named SR, to scrutinize the capability for syntax understanding of text embedding models from two crucial syntactic aspects: Structural heuristics, and Relational understanding among concepts.

Data

Our SR benchmark contains source sentences from STS-B, CQADupStack, Twitter, BIOSSES, SICK-R, and AskUbuntu.

Environment

pip install -r requirements.txt

Evaluation

Sentence Encoder

Take SentenceTransformer as an example,

cd eval
python sbert_test.py

OpenAI Ada

Set your own keys in openai_eval.py

Stored embeddings

Take pinecone as an example, fill your index and API key in pinecone.py

Generate

Step1

Set your OpenAI keys in .env

Step2

Check the prompt used in /action This project is built on LangChain, feel free to search for your own prompt/template to generate your sentences. Because of batch inference, you need to change the template and json parser if you change the batch_size in collect.py.

Step3

Run a demo to see if you can run this project successfully. Then replace your file with example.csv

python collect.py

Citation

@inproceedings{zhang2023well,
  title={How Well Do Text Embedding Models Understand Syntax?},
  author={Zhang, Yan and Feng, Zhaopeng and Teng, Zhiyang and Liu, Zuozhu and Li, Haizhou},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},
  pages={9717--9728},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SR Benchmark for Text Embedding

Overview

Data

Environment

Evaluation

Sentence Encoder

OpenAI Ada

Stored embeddings

Generate

Step1

Step2

Step3

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
action		action
data		data
eval		eval
fig		fig
.env		.env
LICENSE		LICENSE
README.md		README.md
collect.py		collect.py
example.csv		example.csv
paper.pdf		paper.pdf
requirements.txt		requirements.txt

License

fzp0424/SR

Folders and files

Latest commit

History

Repository files navigation

SR Benchmark for Text Embedding

Overview

Data

Environment

Evaluation

Sentence Encoder

OpenAI Ada

Stored embeddings

Generate

Step1

Step2

Step3

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages