Source code for ACL 2021 paper "Interpretable and Low-Resource Entity Matching via Decoupling Feature Learning from Decision Making".
We provide our external resources in the Tsinghua Cloud. They include our used fasttext word embeddings and our used conda environment.
Due to the limitation of file size, we zip and split the files into pieces. In particular, these files are zipped by:
tar -zcvf - fasttext.wiki.en.300d.bin | split -b 1024m - embedding.tar.gz.
tar -zcvf - em2 | split -b 2048m - em2.tar.gz.
cat embedding.tar.gz.a* embedding.tar.gz
tar -xf embedding.tar.gz
cat em2.tar.gz.a* em2.tar.gz
tar -xf em2.tar.gz
- Download
fasttext.wiki.en.300d.bin
from the Tsinghua Cloud. - Create a new directory at
$HOME/.vector_cache/fasttext
(if not exist). - Place
fasttext.wiki.en.300d.bin
at$HOME/.vector_cache/fasttext
- Check it by
ls -al ~/.vector_cache/fasttext/fasttext.wiki.en.300d.bin
, and you should get some output like this:
-rw-r--r-- 1 zijun zijun 8493673445 Jan 14 20:48 /home/zijun/.vector_cache/fasttext/fasttext.wiki.en.300d.bin
We would recommend you to install Anaconda (or Miniconda) and create a new environment for our code by cloning from the Tsinghua Cloud.
- Download our environment from the Tsinghua Cloud, and name it as
em2
- Create a new virtual environment:
conda create -n em --clone em2
. - Enter the new environment:
conda activate em
.
- Go to the
dataset
directory:cd dataset
- Run
1.bigtable-attrdrop-ind.py
,2.mag-table.py
,4.mag.py
, and5.traditinal_feature.py
in sequence.
Note that we have already provided data for reproducing Table 3 and Table 4.
For reproducing Figure 3, you need to prepare the dataset by running our data preprocessing code with different drop_rate
and train_rate
.
music: I-A_1
citation: D-S_1
citeacm: D-A_1
dmusic: I-A_2
dcitation: D-S_2
dciteacm: D-A_2
Due to commercial issues, we are not able to publish the Real dataset.
cd 1-HRF-dt
bash run.sh
cd 1-HRF-gini
bash run.sh
cd 1-HRF-xgb
bash run.sh
The final results are recorded in the logs
directory.
cd 1-HRF-dt
bash run_full.sh
cd 1-HRF-gini
bash run_full.sh
cd 1-HRF-xgb
bash run_full.sh
The final results are recorded in the logs
directory.
If you use the code, please cite this paper:
Zijun Yao, Chengjiang Li, Tiansi Dong, Xin Lv, Jifan Yu, Lei Hou, Juanzi Li, Yichi Zhang, Zelin Dai. Interpretable and Low-Resource Entity Matching via Decoupling Feature Learning from Decision Making. The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021).
via
@inproceedings{yao-etal-2021-interpretable,
title = "Interpretable and Low-Resource Entity Matching via Decoupling Feature Learning from Decision Making",
author = "Yao, Zijun and Li, Chengjiang and Dong, Tiansi and Lv, Xin and Yu, Jifan and Hou, Lei and Li, Juanzi and Zhang, Yichi and Dai, Zelin",
booktitle = "ACL'21",
year = "2021",
url = "https://aclanthology.org/2021.acl-long.215",
}