Skip to content

Commit

Permalink
Python 2 to 3, full readme, pipeline, correct VI alignment
Browse files Browse the repository at this point in the history
  • Loading branch information
garrafao committed Sep 1, 2019
1 parent cbd9df4 commit 9ad3ba6
Show file tree
Hide file tree
Showing 121 changed files with 1,772 additions and 6,121 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
todo.txt
matrices
results
corpora/durel
corpora/surel
674 changes: 674 additions & 0 deletions LICENSE.txt

Large diffs are not rendered by default.

163 changes: 139 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,53 +1,168 @@
# LSCDetection
Data Sets and Models for Evaluation of Lexical Semantic Change Detection.

If you use this software for academic research, [please cite this paper](#bibtex) and make sure you give appropriate credit to the below-mentioned software this repository strongly depends on.
If you use this software for academic research, please [cite](#bibtex) this paper:

The code heavily relies on [DISSECT](http://clic.cimec.unitn.it/composes/toolkit/introduction.html) (modules/composes). For aligning embeddings (SGNS/SVD/RI) we used [VecMap](https://github.com/artetxem/vecmap) (alignment/map_embeddings.py). We used the implementation of [gensim](https://github.com/rare-technologies/gensim) for SGNS.
- Dominik Schlechtweg, Anna Hätty, Marco del Tredici, and Sabine Schulte im Walde. 2019. [A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains](https://www.aclweb.org/anthology/papers/P/P19/P19-1072/). In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 732-746, Florence, Italy. ACL.

### Testsets
Also make sure you give appropriate credit to the below-mentioned software this repository depends on.

In `testsets/` we provide the testset versions of DURel and SURel as used in the paper.
Parts of the code rely on [DISSECT](https://github.com/composes-toolkit/dissect), [gensim](https://github.com/rare-technologies/gensim), [numpy](https://pypi.org/project/numpy/), [scikit-learn](https://pypi.org/project/scikit-learn/), [scipy](https://pypi.org/project/scipy/), [VecMap](https://github.com/artetxem/vecmap).

### Usage Note
### Usage

The scripts should be run directly from the main directory. If you wish to do otherwise, you may have to change the path you add to the path attribute in `sys.path.append('./modules/')` in the scripts. All scripts can be run directly from the command line, e.g.:
The scripts should be run directly from the main directory. If you wish to do otherwise, you may have to change the path you add to the path attribute in `sys.path.append('./modules/')` in the scripts. All scripts can be run directly from the command line:

python representations/count.py <windowSize> <corpDir> <outPath> <lowerBound> <upperBound>
python3 representations/count.py <corpDir> <outPath> <windowSize>

We recommend you to run the scripts with the Python Anaconda distribution (Python 2.7.15), only for VecMap Python 3 is needed. You will have to install some additional packages such as: docopt, gensim, i.a. Those that aren't available from the Anaconda installer can be installed via EasyInstall, or by running `pip install -r requirements.txt`.
e.g.

### Pipeline
python3 representations/count.py corpora/test/corpus1/ test_matrix1 1

Under `scripts/` you find an example of a full pipeline for the models on a small test corpus. Assuming you are working on a UNIX-based system, first make the scripts executable with
The usage of each script can be understood by running it with help option `-h`, e.g.:

python3 representations/count.py -h

We recommend you to run the scripts within a [virtual environment](https://pypi.org/project/virtualenv/) with Python 3.7.4. Install the required packages running `pip install -r requirements.txt`.

### Models

A standard model of LSC detection executes three consecutive steps:

1. learn semantic representations from corpora (`representations/`)
2. align representations (`alignment/`)
3. measure change (`measures/`)

As an example, consider a very simple model (CNT+CI+CD) going through these steps:

1. learn count vectors from each corpus to compare (`representations/count.py`)
2. align them by intersecting their columns (`alignment/ci_align.py`)
3. measure change with cosine distance (`measures/cd.py`)

You can apply this model to the testing data using the following commands:

python3 representations/count.py corpora/test/corpus1/ test_matrix1 1
python3 representations/count.py corpora/test/corpus2/ test_matrix2 1

python3 alignment/ci_align.py test_matrix1 test_matrix2 test_matrix1_aligned test_matrix2_aligned

python3 measures/cd.py -s testsets/test/targets.tsv test_matrix1_aligned test_matrix2_aligned test_results.tsv

__Input Format__: All the scripts in this repository can handle two types of matrix input formats:

- sparse scipy matrices stored in [npz format](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html)
- dense matrices stored in [word2vec plain text format](https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.save_word2vec_format.html)

To learn more about how matrices are loaded and stored check out `modules/utils_.py`.

The scripts assume a corpus format of one sentence per line in UTF-8 encoded (optionally zipped) text files. You can specify either a file path or a folder. In the latter case the scripts will iterate over all files in the folder.

#### Semantic Representations

|Name | Code | Type |
| --- | --- | --- |
| Count | `representations/count.py` | VSM |
| PPMI | `representations/ppmi.py` | VSM |
| SVD | `representations/svd.py` | VSM |
| RI | `representations/ri.py` | VSM |
| SGNS | `representations/sgns.py` | VSM |
| SCAN | [repository](https://github.com/ColiLea/scan) | TP |

Table: VSM=Vector Space Model, TP=Topic Model

Note that SCAN takes a slightly different corpus input format than the other models.

#### Alignment

|Name | Code | Applicability |
| --- | --- | --- |
| CI | `alignment/ci_align.py` | Count, PPMI |
| SRV | `alignment/srv_align.py` | RI |
| OP | `alignment/map_embeddings.py` | SVD, RI, SGNS |
| VI | `alignment/sgns_vi.py` | SGNS |
| WI | `alignment/wi.py` | Count, PPMI, SVD, RI, SGNS |

The script `alignment/map_embeddings.py` is drawn from [VecMap](https://github.com/artetxem/vecmap), where you can find instructions how to use it. Find examples of how to obtain OP, OP- and OP+ under `scripts/`.

Instead of WI, consider using the more advanced [Temporal Referencing](https://github.com/Garrafao/TemporalReferencing).

#### Measures

|Name | Code | Applicability |
| --- | --- | --- |
| CD | `measures/cd.py` | Count, PPMI, SVD, RI, SGNS |
| LND | `measures/lnd.py` | Count, PPMI, SVD, RI, SGNS |
| JSD | - | SCAN |
| FD | `measures/freq.py` | from corpus |
| TD | `measures/types.py` |Count|
| HD | `measures/entropy.py` | Count |

FD, TD and HD need additional applications of `measures/diff.py` and optionally `measures/trsf.py`.

### Parameter Settings

For better performance, RI and SRV should be run with `-a` option, instead of specifying the seed number manually.

Consider the application of column mean centering to RI and SGNS embeddings before applying a change measure.

Find more detailed notes on model performances and optimal parameter settings in [these papers](#bibtex).

### Evaluation

The evaluation framework of this repository is based on the comparison of a set of target words across two corpora. Hence, models can be evaluated on a triple (dataset, corpus1, corpus2), where the dataset provides gold values for the change of target words between corpus1 and corpus2.

| Dataset | Corpus 1 | Corpus 2 | Download |
| --- | --- | --- | --- |
| DURel | DTA18 | DTA19 | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/durel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) |
| SURel | SDEWAC | COOK | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/surel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) |

You don't have to download the data manually. In `testsets/` we provide the testset versions of DURel and SURel as used in Schlechtweg et al. (2019). Additionally, we provide an evaluation pipeline, downloading the corpora and evaluating the models to the above-mentioned datasets, see [pipeline](#pipeline).

#### Metrics

|Name | Code | Applicability |
| --- | --- | --- |
| Spearman correlation | `evaluation/spearman.py` | DURel, SURel |

The script `evaluation/spearman.py` outputs the Spearman correlation of the two input rankings (column 3), as well as the significance of the obtained result (column 4).

Consider uploading your results for DURel as a submission to the shared task [Lexical Semantic Change Detection in German](https://codalab.lri.fr/competitions/560).

#### Pipeline

Under `scripts/` you find an example of a full evaluation pipeline for the models on two small test corpora. Assuming you are working on a UNIX-based system, first make the scripts executable with

chmod 755 scripts/*.sh

Then run either of
Then run

bash -e scripts/run_test.sh

The script first reads the two gzipped test corpora `corpora/test/corpus1/` and `corpora/test/corpus2/`. Then it produces model predictions for the targets in `testsets/test/targets.tsv` and writes them under `results/`. It finally writes the Spearman correlation between each model's predictions and the gold rank (`testsets/test/gold.tsv`) under the respective folder in `results/`. Note that the gold values for the test data are meaningless, as they were randomly assigned.

bash -e scripts/make_results_sim.sh
bash -e scripts/make_results_disp.sh
bash -e scripts/make_results_wi.sh
We also provide scripts to reproduce the results from Schlechtweg et al. (2019), including the corpus download. For this run either of

The script `make_results_sim.sh` produces results for the similarity measures (Cosine Distance, Local Neighborhood Distance) for all vector space and alignment types except for Word Injection. It first reads the gzipped test corpus in `corpora/test/corpus.txt.gz` with each line in the following format:
bash -e scripts/run_durel.sh
bash -e scripts/run_surel.sh

year [tab] word1 word2 word3...
You may want to change the parameters in `scripts/parameters_durel.sh` and `scripts/parameters_surel.sh` (e.g. vector dimensionality, iterations), as running the scripts on the full parameter set will take several days and require a large amount of disk space.

It then produces model predictions for the targets in `testsets/test/targets.tsv`, writes them under `results/` and correlates the predictions with the gold rank `testsets/test/gold.tsv`. It finally writes the Spearman correlation between each model prediction and the gold rank under `results/`.
### Important Changes

The scripts `make_results_disp.sh` and `make_results_wi.sh` do similarly for the dispersion measures (Frequency, Types, Entropy Difference) and the similarity measures for Word Injection.
September 1, 2019: Python scripts were updated from Python 2 to Python 3.

BibTex
--------

```
@inproceedings{Schlechtwegetal19,
title = {{A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains}},
author = {Dominik Schlechtweg and Anna H\"{a}tty and Marco del Tredici and Sabine {Schulte im Walde}},
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics"
title = {{A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains}},
author = {Dominik Schlechtweg and Anna H\"{a}tty and Marco del Tredici and Sabine {Schulte im Walde}},
booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
year = {2019},
address = {Florence, Italy},
publisher = {Association for Computational Linguistics},
pages = {732--746}
}
```

75 changes: 28 additions & 47 deletions alignment/ci_align.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,9 @@
sys.path.append('./modules/')

from docopt import docopt
from dsm import load_pkl_files, save_pkl_files
from composes.semantic_space.space import Space
from composes.matrix.sparse_matrix import SparseMatrix
from scipy.sparse import linalg
import logging
import time
from utils_ import Space


def main():
Expand All @@ -19,64 +16,48 @@ def main():
args = docopt('''Align two sparse matrices by intersecting their columns.
Usage:
ci_align.py [-l] <outPath1> <outPath2> <spacePrefix1> <spacePrefix2>
ci_align.py <matrix1> <matrix2> <outPath1> <outPath2>
<outPath1> = output path for aligned space 1
<outPath2> = output path for aligned space 2
<spacePrefix1> = path to pickled space1 without suffix
<spacePrefix2> = path to pickled space2 without suffix
Options:
-l, --len normalize final vectors to unit length
<matrix1> = path to matrix1
<matrix2> = path to matrix2
<outPath1> = output path for aligned matrix 1
<outPath2> = output path for aligned matrix 2
''')

is_len = args['--len']
spacePrefix1 = args['<spacePrefix1>']
spacePrefix2 = args['<spacePrefix2>']
matrix1 = args['<matrix1>']
matrix2 = args['<matrix2>']
outPath1 = args['<outPath1>']
outPath2 = args['<outPath2>']

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logging.info(__file__.upper())
start_time = time.time()

# Get the two matrices as spaces and intersect their columns
space1 = load_pkl_files(spacePrefix1)
space2 = load_pkl_files(spacePrefix2)
id2row1 = space1.get_id2row()
id2row2 = space2.get_id2row()
id2column1 = space1.get_id2column()
id2column2 = space2.get_id2column()
column2id1 = space1.get_column2id()
column2id2 = space2.get_column2id()
intersected_columns = list(set(id2column1).intersection(id2column2))
# Load matrices, rows and columns
space1 = Space(matrix1)
space2 = Space(matrix2)
matrix1 = space1.matrix
rows1 = space1.rows
columns1 = space1.columns
column2id1 = space1.column2id
matrix2 = space2.matrix
rows2 = space2.rows
columns2 = space2.columns
column2id2 = space2.column2id

# Intersect columns of matrices
intersected_columns = sorted(list(set(columns1).intersection(columns2)))
intersected_columns_id1 = [column2id1[item] for item in intersected_columns]
intersected_columns_id2 = [column2id2[item] for item in intersected_columns]
reduced_matrix1 = space1.get_cooccurrence_matrix()[:, intersected_columns_id1].get_mat()
reduced_matrix2 = space2.get_cooccurrence_matrix()[:, intersected_columns_id2].get_mat()

if is_len:
# L2-normalize vectors
l2norm1 = linalg.norm(reduced_matrix1, axis=1, ord=2)
l2norm2 = linalg.norm(reduced_matrix2, axis=1, ord=2)
l2norm1[l2norm1==0.0] = 1.0 # Convert 0 values to 1
l2norm2[l2norm2==0.0] = 1.0 # Convert 0 values to 1
reduced_matrix1 /= l2norm1.reshape(len(l2norm1),1)
reduced_matrix2 /= l2norm2.reshape(len(l2norm2),1)

# Make new spaces
reduced_space1 = Space(SparseMatrix(reduced_matrix1), id2row1, intersected_columns)
reduced_space2 = Space(SparseMatrix(reduced_matrix2), id2row2, intersected_columns)

if reduced_space1.get_id2column()!=reduced_space2.get_id2column():
sys.exit('Two spaces not properly aligned!')
reduced_matrix1 = matrix1[:, intersected_columns_id1]
reduced_matrix2 = matrix2[:, intersected_columns_id2]

# Save the Space object in pickle format
save_pkl_files(reduced_space1, outPath1 + '.sm', save_in_one_file=True)
save_pkl_files(reduced_space2, outPath2 + '.sm', save_in_one_file=True)
# Save matrices
Space(matrix=reduced_matrix1, rows=rows1, columns=intersected_columns).save(outPath1)
Space(matrix=reduced_matrix2, rows=rows2, columns=intersected_columns).save(outPath2)

logging.info("--- %s seconds ---" % (time.time() - start_time))
logging.info("--- %s seconds ---" % (time.time() - start_time))


if __name__ == '__main__':
Expand Down
Loading

0 comments on commit 9ad3ba6

Please sign in to comment.