models, test corpus and datasets

Garrafao · Jun 26, 2019 · cbd9df4 · cbd9df4
1 parent ee55d2c
commit cbd9df4
Show file tree

Hide file tree

Showing 124 changed files with 8,377 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -1,2 +1,53 @@
 # LSCDetection
-Data Sets and Models for Evaluation of Lexical Semantic Change Detection
+Data Sets and Models for Evaluation of Lexical Semantic Change Detection.
+
+If you use this software for academic research, [please cite this paper](#bibtex) and make sure you give appropriate credit to the below-mentioned software this repository strongly depends on.
+
+The code heavily relies on [DISSECT](http://clic.cimec.unitn.it/composes/toolkit/introduction.html) (modules/composes). For aligning embeddings (SGNS/SVD/RI) we used [VecMap](https://github.com/artetxem/vecmap) (alignment/map_embeddings.py). We used the implementation of [gensim](https://github.com/rare-technologies/gensim) for SGNS.
+
+### Testsets
+
+In `testsets/` we provide the testset versions of DURel and SURel as used in the paper.
+
+### Usage Note
+
+The scripts should be run directly from the main directory. If you wish to do otherwise, you may have to change the path you add to the path attribute in `sys.path.append('./modules/')` in the scripts. All scripts can be run directly from the command line, e.g.:
+
+	python representations/count.py <windowSize> <corpDir> <outPath> <lowerBound> <upperBound>
+
+We recommend you to run the scripts with the Python Anaconda distribution (Python 2.7.15), only for VecMap Python 3 is needed. You will have to install some additional packages such as: docopt, gensim, i.a. Those that aren't available from the Anaconda installer can be installed via EasyInstall, or by running `pip install -r requirements.txt`. 
+
+### Pipeline
+
+Under `scripts/` you find an example of a full pipeline for the models on a small test corpus. Assuming you are working on a UNIX-based system, first make the scripts executable with
+
+	chmod 755 scripts/*.sh
+
+Then run either of
+
+	bash -e scripts/make_results_sim.sh
+	bash -e scripts/make_results_disp.sh
+	bash -e scripts/make_results_wi.sh
+
+The script `make_results_sim.sh` produces results for the similarity measures (Cosine Distance, Local Neighborhood Distance) for all vector space and alignment types except for Word Injection. It first reads the gzipped test corpus in `corpora/test/corpus.txt.gz` with each line in the following format:
+
+	year [tab] word1 word2 word3...
+
+It then produces model predictions for the targets in `testsets/test/targets.tsv`, writes them under `results/` and correlates the predictions with the gold rank `testsets/test/gold.tsv`. It finally writes the Spearman correlation between each model prediction and the gold rank under `results/`.
+
+The scripts `make_results_disp.sh` and `make_results_wi.sh` do similarly for the dispersion measures (Frequency, Types, Entropy Difference) and the similarity measures for Word Injection.
+
+BibTex
+--------
+
+```
+@inproceedings{Schlechtwegetal19,
+title = {{A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains}},
+author = {Dominik Schlechtweg and Anna H\"{a}tty and Marco del Tredici and Sabine {Schulte im Walde}},
+booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
+year = "2019",
+address = "Florence, Italy",
+publisher = "Association for Computational Linguistics"
+}
+```
+
diff --git a/alignment/ci_align.py b/alignment/ci_align.py
@@ -0,0 +1,83 @@
+import sys
+sys.path.append('./modules/')
+
+from docopt import docopt
+from dsm import load_pkl_files, save_pkl_files
+from composes.semantic_space.space import Space
+from composes.matrix.sparse_matrix import SparseMatrix
+from scipy.sparse import linalg
+import logging
+import time
+
+
+def main():
+    """
+    Align two sparse matrices by intersecting their columns.
+    """
+
+    # Get the arguments
+    args = docopt('''Align two sparse matrices by intersecting their columns.
+
+    Usage:
+        ci_align.py [-l] <outPath1> <outPath2> <spacePrefix1> <spacePrefix2>
+
+        <outPath1> = output path for aligned space 1
+        <outPath2> = output path for aligned space 2
+        <spacePrefix1> = path to pickled space1 without suffix
+        <spacePrefix2> = path to pickled space2 without suffix
+
+    Options:
+        -l, --len   normalize final vectors to unit length
+    
+    ''')
+
+    is_len = args['--len']
+    spacePrefix1 = args['<spacePrefix1>']
+    spacePrefix2 = args['<spacePrefix2>']
+    outPath1 = args['<outPath1>']
+    outPath2 = args['<outPath2>']
+
+    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
+    logging.info(__file__.upper())
+    start_time = time.time()    
+
+    # Get the two matrices as spaces and intersect their columns
+    space1 = load_pkl_files(spacePrefix1)
+    space2 = load_pkl_files(spacePrefix2)
+    id2row1 = space1.get_id2row()
+    id2row2 = space2.get_id2row()
+    id2column1 = space1.get_id2column()
+    id2column2 = space2.get_id2column()
+    column2id1 = space1.get_column2id()
+    column2id2 = space2.get_column2id()
+    intersected_columns = list(set(id2column1).intersection(id2column2))
+    intersected_columns_id1 = [column2id1[item] for item in intersected_columns]
+    intersected_columns_id2 = [column2id2[item] for item in intersected_columns]
+    reduced_matrix1 = space1.get_cooccurrence_matrix()[:, intersected_columns_id1].get_mat()
+    reduced_matrix2 = space2.get_cooccurrence_matrix()[:, intersected_columns_id2].get_mat()
+
+    if is_len:
+        # L2-normalize vectors
+        l2norm1 = linalg.norm(reduced_matrix1, axis=1, ord=2)
+        l2norm2 = linalg.norm(reduced_matrix2, axis=1, ord=2)
+        l2norm1[l2norm1==0.0] = 1.0 # Convert 0 values to 1
+        l2norm2[l2norm2==0.0] = 1.0 # Convert 0 values to 1
+        reduced_matrix1 /= l2norm1.reshape(len(l2norm1),1)
+        reduced_matrix2 /= l2norm2.reshape(len(l2norm2),1)
+
+    # Make new spaces    
+    reduced_space1 = Space(SparseMatrix(reduced_matrix1), id2row1, intersected_columns)
+    reduced_space2 = Space(SparseMatrix(reduced_matrix2), id2row2, intersected_columns)
+
+    if reduced_space1.get_id2column()!=reduced_space2.get_id2column():
+        sys.exit('Two spaces not properly aligned!')
+
+    # Save the Space object in pickle format
+    save_pkl_files(reduced_space1, outPath1 + '.sm', save_in_one_file=True)
+    save_pkl_files(reduced_space2, outPath2 + '.sm', save_in_one_file=True)
+
+    logging.info("--- %s seconds ---" % (time.time() - start_time))                   
+
+
+if __name__ == '__main__':
+    main()