Skip to content

Commit

Permalink
models, test corpus and datasets
Browse files Browse the repository at this point in the history
  • Loading branch information
garrafao committed Jun 26, 2019
1 parent ee55d2c commit cbd9df4
Show file tree
Hide file tree
Showing 124 changed files with 8,377 additions and 1 deletion.
53 changes: 52 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,53 @@
# LSCDetection
Data Sets and Models for Evaluation of Lexical Semantic Change Detection
Data Sets and Models for Evaluation of Lexical Semantic Change Detection.

If you use this software for academic research, [please cite this paper](#bibtex) and make sure you give appropriate credit to the below-mentioned software this repository strongly depends on.

The code heavily relies on [DISSECT](http://clic.cimec.unitn.it/composes/toolkit/introduction.html) (modules/composes). For aligning embeddings (SGNS/SVD/RI) we used [VecMap](https://github.com/artetxem/vecmap) (alignment/map_embeddings.py). We used the implementation of [gensim](https://github.com/rare-technologies/gensim) for SGNS.

### Testsets

In `testsets/` we provide the testset versions of DURel and SURel as used in the paper.

### Usage Note

The scripts should be run directly from the main directory. If you wish to do otherwise, you may have to change the path you add to the path attribute in `sys.path.append('./modules/')` in the scripts. All scripts can be run directly from the command line, e.g.:

python representations/count.py <windowSize> <corpDir> <outPath> <lowerBound> <upperBound>

We recommend you to run the scripts with the Python Anaconda distribution (Python 2.7.15), only for VecMap Python 3 is needed. You will have to install some additional packages such as: docopt, gensim, i.a. Those that aren't available from the Anaconda installer can be installed via EasyInstall, or by running `pip install -r requirements.txt`.

### Pipeline

Under `scripts/` you find an example of a full pipeline for the models on a small test corpus. Assuming you are working on a UNIX-based system, first make the scripts executable with

chmod 755 scripts/*.sh

Then run either of

bash -e scripts/make_results_sim.sh
bash -e scripts/make_results_disp.sh
bash -e scripts/make_results_wi.sh

The script `make_results_sim.sh` produces results for the similarity measures (Cosine Distance, Local Neighborhood Distance) for all vector space and alignment types except for Word Injection. It first reads the gzipped test corpus in `corpora/test/corpus.txt.gz` with each line in the following format:

year [tab] word1 word2 word3...

It then produces model predictions for the targets in `testsets/test/targets.tsv`, writes them under `results/` and correlates the predictions with the gold rank `testsets/test/gold.tsv`. It finally writes the Spearman correlation between each model prediction and the gold rank under `results/`.

The scripts `make_results_disp.sh` and `make_results_wi.sh` do similarly for the dispersion measures (Frequency, Types, Entropy Difference) and the similarity measures for Word Injection.

BibTex
--------

```
@inproceedings{Schlechtwegetal19,
title = {{A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains}},
author = {Dominik Schlechtweg and Anna H\"{a}tty and Marco del Tredici and Sabine {Schulte im Walde}},
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics"
}
```

83 changes: 83 additions & 0 deletions alignment/ci_align.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
import sys
sys.path.append('./modules/')

from docopt import docopt
from dsm import load_pkl_files, save_pkl_files
from composes.semantic_space.space import Space
from composes.matrix.sparse_matrix import SparseMatrix
from scipy.sparse import linalg
import logging
import time


def main():
"""
Align two sparse matrices by intersecting their columns.
"""

# Get the arguments
args = docopt('''Align two sparse matrices by intersecting their columns.
Usage:
ci_align.py [-l] <outPath1> <outPath2> <spacePrefix1> <spacePrefix2>
<outPath1> = output path for aligned space 1
<outPath2> = output path for aligned space 2
<spacePrefix1> = path to pickled space1 without suffix
<spacePrefix2> = path to pickled space2 without suffix
Options:
-l, --len normalize final vectors to unit length
''')

is_len = args['--len']
spacePrefix1 = args['<spacePrefix1>']
spacePrefix2 = args['<spacePrefix2>']
outPath1 = args['<outPath1>']
outPath2 = args['<outPath2>']

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logging.info(__file__.upper())
start_time = time.time()

# Get the two matrices as spaces and intersect their columns
space1 = load_pkl_files(spacePrefix1)
space2 = load_pkl_files(spacePrefix2)
id2row1 = space1.get_id2row()
id2row2 = space2.get_id2row()
id2column1 = space1.get_id2column()
id2column2 = space2.get_id2column()
column2id1 = space1.get_column2id()
column2id2 = space2.get_column2id()
intersected_columns = list(set(id2column1).intersection(id2column2))
intersected_columns_id1 = [column2id1[item] for item in intersected_columns]
intersected_columns_id2 = [column2id2[item] for item in intersected_columns]
reduced_matrix1 = space1.get_cooccurrence_matrix()[:, intersected_columns_id1].get_mat()
reduced_matrix2 = space2.get_cooccurrence_matrix()[:, intersected_columns_id2].get_mat()

if is_len:
# L2-normalize vectors
l2norm1 = linalg.norm(reduced_matrix1, axis=1, ord=2)
l2norm2 = linalg.norm(reduced_matrix2, axis=1, ord=2)
l2norm1[l2norm1==0.0] = 1.0 # Convert 0 values to 1
l2norm2[l2norm2==0.0] = 1.0 # Convert 0 values to 1
reduced_matrix1 /= l2norm1.reshape(len(l2norm1),1)
reduced_matrix2 /= l2norm2.reshape(len(l2norm2),1)

# Make new spaces
reduced_space1 = Space(SparseMatrix(reduced_matrix1), id2row1, intersected_columns)
reduced_space2 = Space(SparseMatrix(reduced_matrix2), id2row2, intersected_columns)

if reduced_space1.get_id2column()!=reduced_space2.get_id2column():
sys.exit('Two spaces not properly aligned!')

# Save the Space object in pickle format
save_pkl_files(reduced_space1, outPath1 + '.sm', save_in_one_file=True)
save_pkl_files(reduced_space2, outPath2 + '.sm', save_in_one_file=True)

logging.info("--- %s seconds ---" % (time.time() - start_time))


if __name__ == '__main__':
main()
Loading

0 comments on commit cbd9df4

Please sign in to comment.