This repository contains the code and data for our paper [How Well Do Text Embedding Models Understand Syntax?] which has been accepted as the Findings of EMNLP 2023.
We establish an evaluation set, named SR, to scrutinize the capability for syntax understanding of text embedding models from two crucial syntactic aspects: Structural heuristics, and Relational understanding among concepts.
Our SR benchmark contains source sentences from STS-B
, CQADupStack
, Twitter
, BIOSSES
, SICK-R
, and AskUbuntu
.
pip install -r requirements.txt
Take SentenceTransformer as an example,
cd eval
python sbert_test.py
Set your own keys in openai_eval.py
Take pinecone as an example, fill your index and API key in pinecone.py
Set your OpenAI keys in .env
Check the prompt used in /action
This project is built on LangChain, feel free to search for your own prompt/template to generate your sentences. Because of batch inference, you need to change the template and json parser if you change the batch_size
in collect.py
.
Run a demo to see if you can run this project successfully. Then replace your file with example.csv
python collect.py
@inproceedings{zhang2023well,
title={How Well Do Text Embedding Models Understand Syntax?},
author={Zhang, Yan and Feng, Zhaopeng and Teng, Zhiyang and Liu, Zuozhu and Li, Haizhou},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},
pages={9717--9728},
year={2023}
}