-
Notifications
You must be signed in to change notification settings - Fork 15
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
garrafao
committed
Jun 26, 2019
1 parent
ee55d2c
commit cbd9df4
Showing
124 changed files
with
8,377 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,53 @@ | ||
# LSCDetection | ||
Data Sets and Models for Evaluation of Lexical Semantic Change Detection | ||
Data Sets and Models for Evaluation of Lexical Semantic Change Detection. | ||
|
||
If you use this software for academic research, [please cite this paper](#bibtex) and make sure you give appropriate credit to the below-mentioned software this repository strongly depends on. | ||
|
||
The code heavily relies on [DISSECT](http://clic.cimec.unitn.it/composes/toolkit/introduction.html) (modules/composes). For aligning embeddings (SGNS/SVD/RI) we used [VecMap](https://github.com/artetxem/vecmap) (alignment/map_embeddings.py). We used the implementation of [gensim](https://github.com/rare-technologies/gensim) for SGNS. | ||
|
||
### Testsets | ||
|
||
In `testsets/` we provide the testset versions of DURel and SURel as used in the paper. | ||
|
||
### Usage Note | ||
|
||
The scripts should be run directly from the main directory. If you wish to do otherwise, you may have to change the path you add to the path attribute in `sys.path.append('./modules/')` in the scripts. All scripts can be run directly from the command line, e.g.: | ||
|
||
python representations/count.py <windowSize> <corpDir> <outPath> <lowerBound> <upperBound> | ||
|
||
We recommend you to run the scripts with the Python Anaconda distribution (Python 2.7.15), only for VecMap Python 3 is needed. You will have to install some additional packages such as: docopt, gensim, i.a. Those that aren't available from the Anaconda installer can be installed via EasyInstall, or by running `pip install -r requirements.txt`. | ||
|
||
### Pipeline | ||
|
||
Under `scripts/` you find an example of a full pipeline for the models on a small test corpus. Assuming you are working on a UNIX-based system, first make the scripts executable with | ||
|
||
chmod 755 scripts/*.sh | ||
|
||
Then run either of | ||
|
||
bash -e scripts/make_results_sim.sh | ||
bash -e scripts/make_results_disp.sh | ||
bash -e scripts/make_results_wi.sh | ||
|
||
The script `make_results_sim.sh` produces results for the similarity measures (Cosine Distance, Local Neighborhood Distance) for all vector space and alignment types except for Word Injection. It first reads the gzipped test corpus in `corpora/test/corpus.txt.gz` with each line in the following format: | ||
|
||
year [tab] word1 word2 word3... | ||
|
||
It then produces model predictions for the targets in `testsets/test/targets.tsv`, writes them under `results/` and correlates the predictions with the gold rank `testsets/test/gold.tsv`. It finally writes the Spearman correlation between each model prediction and the gold rank under `results/`. | ||
|
||
The scripts `make_results_disp.sh` and `make_results_wi.sh` do similarly for the dispersion measures (Frequency, Types, Entropy Difference) and the similarity measures for Word Injection. | ||
|
||
BibTex | ||
-------- | ||
|
||
``` | ||
@inproceedings{Schlechtwegetal19, | ||
title = {{A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains}}, | ||
author = {Dominik Schlechtweg and Anna H\"{a}tty and Marco del Tredici and Sabine {Schulte im Walde}}, | ||
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", | ||
year = "2019", | ||
address = "Florence, Italy", | ||
publisher = "Association for Computational Linguistics" | ||
} | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
import sys | ||
sys.path.append('./modules/') | ||
|
||
from docopt import docopt | ||
from dsm import load_pkl_files, save_pkl_files | ||
from composes.semantic_space.space import Space | ||
from composes.matrix.sparse_matrix import SparseMatrix | ||
from scipy.sparse import linalg | ||
import logging | ||
import time | ||
|
||
|
||
def main(): | ||
""" | ||
Align two sparse matrices by intersecting their columns. | ||
""" | ||
|
||
# Get the arguments | ||
args = docopt('''Align two sparse matrices by intersecting their columns. | ||
Usage: | ||
ci_align.py [-l] <outPath1> <outPath2> <spacePrefix1> <spacePrefix2> | ||
<outPath1> = output path for aligned space 1 | ||
<outPath2> = output path for aligned space 2 | ||
<spacePrefix1> = path to pickled space1 without suffix | ||
<spacePrefix2> = path to pickled space2 without suffix | ||
Options: | ||
-l, --len normalize final vectors to unit length | ||
''') | ||
|
||
is_len = args['--len'] | ||
spacePrefix1 = args['<spacePrefix1>'] | ||
spacePrefix2 = args['<spacePrefix2>'] | ||
outPath1 = args['<outPath1>'] | ||
outPath2 = args['<outPath2>'] | ||
|
||
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) | ||
logging.info(__file__.upper()) | ||
start_time = time.time() | ||
|
||
# Get the two matrices as spaces and intersect their columns | ||
space1 = load_pkl_files(spacePrefix1) | ||
space2 = load_pkl_files(spacePrefix2) | ||
id2row1 = space1.get_id2row() | ||
id2row2 = space2.get_id2row() | ||
id2column1 = space1.get_id2column() | ||
id2column2 = space2.get_id2column() | ||
column2id1 = space1.get_column2id() | ||
column2id2 = space2.get_column2id() | ||
intersected_columns = list(set(id2column1).intersection(id2column2)) | ||
intersected_columns_id1 = [column2id1[item] for item in intersected_columns] | ||
intersected_columns_id2 = [column2id2[item] for item in intersected_columns] | ||
reduced_matrix1 = space1.get_cooccurrence_matrix()[:, intersected_columns_id1].get_mat() | ||
reduced_matrix2 = space2.get_cooccurrence_matrix()[:, intersected_columns_id2].get_mat() | ||
|
||
if is_len: | ||
# L2-normalize vectors | ||
l2norm1 = linalg.norm(reduced_matrix1, axis=1, ord=2) | ||
l2norm2 = linalg.norm(reduced_matrix2, axis=1, ord=2) | ||
l2norm1[l2norm1==0.0] = 1.0 # Convert 0 values to 1 | ||
l2norm2[l2norm2==0.0] = 1.0 # Convert 0 values to 1 | ||
reduced_matrix1 /= l2norm1.reshape(len(l2norm1),1) | ||
reduced_matrix2 /= l2norm2.reshape(len(l2norm2),1) | ||
|
||
# Make new spaces | ||
reduced_space1 = Space(SparseMatrix(reduced_matrix1), id2row1, intersected_columns) | ||
reduced_space2 = Space(SparseMatrix(reduced_matrix2), id2row2, intersected_columns) | ||
|
||
if reduced_space1.get_id2column()!=reduced_space2.get_id2column(): | ||
sys.exit('Two spaces not properly aligned!') | ||
|
||
# Save the Space object in pickle format | ||
save_pkl_files(reduced_space1, outPath1 + '.sm', save_in_one_file=True) | ||
save_pkl_files(reduced_space2, outPath2 + '.sm', save_in_one_file=True) | ||
|
||
logging.info("--- %s seconds ---" % (time.time() - start_time)) | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
Oops, something went wrong.