Python 2 to 3, full readme, pipeline, correct VI alignment

Garrafao · Sep 1, 2019 · 9ad3ba6 · 9ad3ba6
1 parent cbd9df4
commit 9ad3ba6
Show file tree

Hide file tree

Showing 121 changed files with 1,772 additions and 6,121 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,5 @@
+todo.txt
+matrices
+results
+corpora/durel
+corpora/surel
diff --git a/LICENSE.txt b/LICENSE.txt
diff --git a/README.md b/README.md
@@ -1,53 +1,168 @@
 # LSCDetection
 Data Sets and Models for Evaluation of Lexical Semantic Change Detection.
 
-If you use this software for academic research, [please cite this paper](#bibtex) and make sure you give appropriate credit to the below-mentioned software this repository strongly depends on.
+If you use this software for academic research, please [cite](#bibtex) this paper:
 
-The code heavily relies on [DISSECT](http://clic.cimec.unitn.it/composes/toolkit/introduction.html) (modules/composes). For aligning embeddings (SGNS/SVD/RI) we used [VecMap](https://github.com/artetxem/vecmap) (alignment/map_embeddings.py). We used the implementation of [gensim](https://github.com/rare-technologies/gensim) for SGNS.
+- Dominik Schlechtweg, Anna Hätty, Marco del Tredici, and Sabine Schulte im Walde. 2019. [A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains](https://www.aclweb.org/anthology/papers/P/P19/P19-1072/). In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 732-746, Florence, Italy. ACL.
 
-### Testsets
+Also make sure you give appropriate credit to the below-mentioned software this repository depends on.
 
-In `testsets/` we provide the testset versions of DURel and SURel as used in the paper.
+Parts of the code rely on [DISSECT](https://github.com/composes-toolkit/dissect), [gensim](https://github.com/rare-technologies/gensim), [numpy](https://pypi.org/project/numpy/), [scikit-learn](https://pypi.org/project/scikit-learn/), [scipy](https://pypi.org/project/scipy/), [VecMap](https://github.com/artetxem/vecmap).
 
-### Usage Note
+### Usage
 
-The scripts should be run directly from the main directory. If you wish to do otherwise, you may have to change the path you add to the path attribute in `sys.path.append('./modules/')` in the scripts. All scripts can be run directly from the command line, e.g.:
+The scripts should be run directly from the main directory. If you wish to do otherwise, you may have to change the path you add to the path attribute in `sys.path.append('./modules/')` in the scripts. All scripts can be run directly from the command line:
 
-	python representations/count.py <windowSize> <corpDir> <outPath> <lowerBound> <upperBound>
+	python3 representations/count.py <corpDir> <outPath> <windowSize>
 
-We recommend you to run the scripts with the Python Anaconda distribution (Python 2.7.15), only for VecMap Python 3 is needed. You will have to install some additional packages such as: docopt, gensim, i.a. Those that aren't available from the Anaconda installer can be installed via EasyInstall, or by running `pip install -r requirements.txt`. 
+e.g.
 
-### Pipeline
+	python3 representations/count.py corpora/test/corpus1/ test_matrix1 1
 
-Under `scripts/` you find an example of a full pipeline for the models on a small test corpus. Assuming you are working on a UNIX-based system, first make the scripts executable with
+The usage of each script can be understood by running it with help option `-h`, e.g.:
+
+	python3 representations/count.py -h
+
+We recommend you to run the scripts within a [virtual environment](https://pypi.org/project/virtualenv/) with Python 3.7.4. Install the required packages running `pip install -r requirements.txt`.
+
+### Models
+
+A standard model of LSC detection executes three consecutive steps:
+
+1. learn semantic representations from corpora (`representations/`)
+2. align representations (`alignment/`)
+3. measure change (`measures/`)
+
+As an example, consider a very simple model (CNT+CI+CD) going through these steps:
+
+1. learn count vectors from each corpus to compare (`representations/count.py`)
+2. align them by intersecting their columns (`alignment/ci_align.py`)
+3. measure change with cosine distance (`measures/cd.py`)
+
+You can apply this model to the testing data using the following commands:
+
+        python3 representations/count.py corpora/test/corpus1/ test_matrix1 1
+        python3 representations/count.py corpora/test/corpus2/ test_matrix2 1
+
+        python3 alignment/ci_align.py test_matrix1 test_matrix2 test_matrix1_aligned test_matrix2_aligned
+
+        python3 measures/cd.py -s testsets/test/targets.tsv test_matrix1_aligned test_matrix2_aligned test_results.tsv
+
+__Input Format__: All the scripts in this repository can handle two types of matrix input formats:
+
+- sparse scipy matrices stored in [npz format](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html)
+- dense matrices stored in [word2vec plain text format](https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.save_word2vec_format.html)
+
+To learn more about how matrices are loaded and stored check out `modules/utils_.py`.
+
+The scripts assume a corpus format of one sentence per line in UTF-8 encoded (optionally zipped) text files. You can specify either a file path or a folder. In the latter case the scripts will iterate over all files in the folder.
+
+#### Semantic Representations
+
+|Name | Code | Type |
+| --- | --- | --- |
+| Count | `representations/count.py` | VSM |
+| PPMI | `representations/ppmi.py` | VSM |
+| SVD | `representations/svd.py` | VSM |
+| RI | `representations/ri.py` | VSM |
+| SGNS | `representations/sgns.py` | VSM |
+| SCAN | [repository](https://github.com/ColiLea/scan) | TP |
+
+Table: VSM=Vector Space Model, TP=Topic Model
+
+Note that SCAN takes a slightly different corpus input format than the other models.
+
+#### Alignment
+
+|Name | Code | Applicability |
+| --- | --- | --- |
+| CI | `alignment/ci_align.py` | Count, PPMI |
+| SRV | `alignment/srv_align.py` | RI |
+| OP | `alignment/map_embeddings.py` | SVD, RI, SGNS |
+| VI | `alignment/sgns_vi.py` | SGNS |
+| WI | `alignment/wi.py` | Count, PPMI, SVD, RI, SGNS |
+
+The script `alignment/map_embeddings.py` is drawn from [VecMap](https://github.com/artetxem/vecmap), where you can find instructions how to use it. Find examples of how to obtain OP, OP- and OP+ under `scripts/`.
+
+Instead of WI, consider using the more advanced [Temporal Referencing](https://github.com/Garrafao/TemporalReferencing).
+
+#### Measures
+
+|Name | Code | Applicability |
+| --- | --- | --- |
+| CD | `measures/cd.py` | Count, PPMI, SVD, RI, SGNS |
+| LND | `measures/lnd.py` | Count, PPMI, SVD, RI, SGNS |
+| JSD | - | SCAN |
+| FD | `measures/freq.py` | from corpus |
+| TD | `measures/types.py` |Count|
+| HD | `measures/entropy.py` | Count |
+
+FD, TD and HD need additional applications of `measures/diff.py` and optionally `measures/trsf.py`.
+
+### Parameter Settings
+
+For better performance, RI and SRV should be run with `-a` option, instead of specifying the seed number manually.
+
+Consider the application of column mean centering to RI and SGNS embeddings before applying a change measure.
+
+Find more detailed notes on model performances and optimal parameter settings in [these papers](#bibtex).
+
+### Evaluation
+
+The evaluation framework of this repository is based on the comparison of a set of target words across two corpora. Hence, models can be evaluated on a triple (dataset, corpus1, corpus2), where the dataset provides gold values for the change of target words between corpus1 and corpus2.
+
+| Dataset | Corpus 1 | Corpus 2 | Download |
+| --- | --- | --- | --- |
+| DURel | DTA18 | DTA19  | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/durel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) |
+| SURel | SDEWAC | COOK | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/surel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) |
+
+You don't have to download the data manually. In `testsets/` we provide the testset versions of DURel and SURel as used in Schlechtweg et al. (2019). Additionally, we provide an evaluation pipeline, downloading the corpora and evaluating the models to the above-mentioned datasets, see [pipeline](#pipeline).
+
+#### Metrics
+
+|Name | Code | Applicability |
+| --- | --- | --- |
+| Spearman correlation | `evaluation/spearman.py` | DURel, SURel |
+
+The script `evaluation/spearman.py` outputs the Spearman correlation of the two input rankings (column 3), as well as the significance of the obtained result (column 4).
+
+Consider uploading your results for DURel as a submission to the shared task [Lexical Semantic Change Detection in German](https://codalab.lri.fr/competitions/560).
+
+#### Pipeline
+
+Under `scripts/` you find an example of a full evaluation pipeline for the models on two small test corpora. Assuming you are working on a UNIX-based system, first make the scripts executable with
 
 	chmod 755 scripts/*.sh
 
-Then run either of
+Then run
+
+	bash -e scripts/run_test.sh
+
+The script first reads the two gzipped test corpora `corpora/test/corpus1/` and `corpora/test/corpus2/`. Then it produces model predictions for the targets in `testsets/test/targets.tsv` and writes them under `results/`. It finally writes the Spearman correlation between each model's predictions and the gold rank (`testsets/test/gold.tsv`) under the respective folder in `results/`. Note that the gold values for the test data are meaningless, as they were randomly assigned.
 
-	bash -e scripts/make_results_sim.sh
-	bash -e scripts/make_results_disp.sh
-	bash -e scripts/make_results_wi.sh
+We also provide scripts to reproduce the results from Schlechtweg et al. (2019), including the corpus download. For this run either of
 
-The script `make_results_sim.sh` produces results for the similarity measures (Cosine Distance, Local Neighborhood Distance) for all vector space and alignment types except for Word Injection. It first reads the gzipped test corpus in `corpora/test/corpus.txt.gz` with each line in the following format:
+	bash -e scripts/run_durel.sh
+	bash -e scripts/run_surel.sh
 
-	year [tab] word1 word2 word3...
+You may want to change the parameters in `scripts/parameters_durel.sh` and `scripts/parameters_surel.sh` (e.g. vector dimensionality, iterations), as running the scripts on the full parameter set will take several days and require a large amount of disk space.
 
-It then produces model predictions for the targets in `testsets/test/targets.tsv`, writes them under `results/` and correlates the predictions with the gold rank `testsets/test/gold.tsv`. It finally writes the Spearman correlation between each model prediction and the gold rank under `results/`.
+### Important Changes
 
-The scripts `make_results_disp.sh` and `make_results_wi.sh` do similarly for the dispersion measures (Frequency, Types, Entropy Difference) and the similarity measures for Word Injection.
+September 1, 2019: Python scripts were updated from Python 2 to Python 3.
 
 BibTex
 --------
 
 ```
 @inproceedings{Schlechtwegetal19,
-title = {{A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains}},
-author = {Dominik Schlechtweg and Anna H\"{a}tty and Marco del Tredici and Sabine {Schulte im Walde}},
-booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
-year = "2019",
-address = "Florence, Italy",
-publisher = "Association for Computational Linguistics"
+	title = {{A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains}},
+	author = {Dominik Schlechtweg and Anna H\"{a}tty and Marco del Tredici and Sabine {Schulte im Walde}},
+    booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
+	year =  {2019},
+	address =  {Florence, Italy},
+	publisher =  {Association for Computational Linguistics},
+	pages     = {732--746}
 }
 ```
 
diff --git a/alignment/ci_align.py b/alignment/ci_align.py
@@ -2,12 +2,9 @@
 sys.path.append('./modules/')
 
 from docopt import docopt
-from dsm import load_pkl_files, save_pkl_files
-from composes.semantic_space.space import Space
-from composes.matrix.sparse_matrix import SparseMatrix
-from scipy.sparse import linalg
 import logging
 import time
+from utils_ import Space
 
 
 def main():
@@ -19,64 +16,48 @@ def main():
     args = docopt('''Align two sparse matrices by intersecting their columns.
 
     Usage:
-        ci_align.py [-l] <outPath1> <outPath2> <spacePrefix1> <spacePrefix2>
+        ci_align.py <matrix1> <matrix2> <outPath1> <outPath2>
 
-        <outPath1> = output path for aligned space 1
-        <outPath2> = output path for aligned space 2
-        <spacePrefix1> = path to pickled space1 without suffix
-        <spacePrefix2> = path to pickled space2 without suffix
-
-    Options:
-        -l, --len   normalize final vectors to unit length
+        <matrix1> = path to matrix1
+        <matrix2> = path to matrix2
+        <outPath1> = output path for aligned matrix 1
+        <outPath2> = output path for aligned matrix 2
     
     ''')
 
-    is_len = args['--len']
-    spacePrefix1 = args['<spacePrefix1>']
-    spacePrefix2 = args['<spacePrefix2>']
+    matrix1 = args['<matrix1>']
+    matrix2 = args['<matrix2>']
     outPath1 = args['<outPath1>']
     outPath2 = args['<outPath2>']
 
     logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
     logging.info(__file__.upper())
     start_time = time.time()    
 
-    # Get the two matrices as spaces and intersect their columns
-    space1 = load_pkl_files(spacePrefix1)
-    space2 = load_pkl_files(spacePrefix2)
-    id2row1 = space1.get_id2row()
-    id2row2 = space2.get_id2row()
-    id2column1 = space1.get_id2column()
-    id2column2 = space2.get_id2column()
-    column2id1 = space1.get_column2id()
-    column2id2 = space2.get_column2id()
-    intersected_columns = list(set(id2column1).intersection(id2column2))
+    # Load matrices, rows and columns
+    space1 = Space(matrix1)
+    space2 = Space(matrix2)
+    matrix1 = space1.matrix
+    rows1 = space1.rows
+    columns1 = space1.columns
+    column2id1 = space1.column2id
+    matrix2 = space2.matrix
+    rows2 = space2.rows
+    columns2 = space2.columns
+    column2id2 = space2.column2id
+
+    # Intersect columns of matrices
+    intersected_columns = sorted(list(set(columns1).intersection(columns2)))
     intersected_columns_id1 = [column2id1[item] for item in intersected_columns]
     intersected_columns_id2 = [column2id2[item] for item in intersected_columns]
-    reduced_matrix1 = space1.get_cooccurrence_matrix()[:, intersected_columns_id1].get_mat()
-    reduced_matrix2 = space2.get_cooccurrence_matrix()[:, intersected_columns_id2].get_mat()
-
-    if is_len:
-        # L2-normalize vectors
-        l2norm1 = linalg.norm(reduced_matrix1, axis=1, ord=2)
-        l2norm2 = linalg.norm(reduced_matrix2, axis=1, ord=2)
-        l2norm1[l2norm1==0.0] = 1.0 # Convert 0 values to 1
-        l2norm2[l2norm2==0.0] = 1.0 # Convert 0 values to 1
-        reduced_matrix1 /= l2norm1.reshape(len(l2norm1),1)
-        reduced_matrix2 /= l2norm2.reshape(len(l2norm2),1)
-
-    # Make new spaces    
-    reduced_space1 = Space(SparseMatrix(reduced_matrix1), id2row1, intersected_columns)
-    reduced_space2 = Space(SparseMatrix(reduced_matrix2), id2row2, intersected_columns)
-
-    if reduced_space1.get_id2column()!=reduced_space2.get_id2column():
-        sys.exit('Two spaces not properly aligned!')
+    reduced_matrix1 = matrix1[:, intersected_columns_id1]
+    reduced_matrix2 = matrix2[:, intersected_columns_id2]
 
-    # Save the Space object in pickle format
-    save_pkl_files(reduced_space1, outPath1 + '.sm', save_in_one_file=True)
-    save_pkl_files(reduced_space2, outPath2 + '.sm', save_in_one_file=True)
+    # Save matrices
+    Space(matrix=reduced_matrix1, rows=rows1, columns=intersected_columns).save(outPath1)
+    Space(matrix=reduced_matrix2, rows=rows2, columns=intersected_columns).save(outPath2)
 
-    logging.info("--- %s seconds ---" % (time.time() - start_time))                   
+    logging.info("--- %s seconds ---" % (time.time() - start_time))
 
 
 if __name__ == '__main__':