Skip to content

Commit

Permalink
add semcor pipeline, dimensionality parameter, epoch parameter
Browse files Browse the repository at this point in the history
  • Loading branch information
garrafao committed Jan 5, 2020
1 parent 7b72046 commit 361c173
Show file tree
Hide file tree
Showing 47 changed files with 521 additions and 200 deletions.
8 changes: 7 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,12 @@ matrices
results
corpora/durel
corpora/surel
corpora/semcor_lsc
testsets/semcor_lsc
modules/__pycache__
modules/*.pyc
update-git.sh
update-git.sh
evaluation/average_results.py
evaluation/average_results.sh
evaluation/average_results1.py
evaluation/average_results1.sh
38 changes: 25 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ Table: VSM=Vector Space Model, TPM=Topic Model
| CI | `alignment/ci_align.py` | Count, PPMI | |
| SRV | `alignment/srv_align.py` | RI | - use `-a` for good performance <br> - consider using the efficient and more powerful [TRIPY](https://github.com/Garrafao/TRIPY) |
| OP | `alignment/map_embeddings.py` | SVD, RI, SGNS | - drawn from [VecMap](https://github.com/artetxem/vecmap) <br> - for OP- and OP+ see `scripts/` |
| VI | `alignment/sgns_vi.py` | SGNS | - updated 27/12/19 (see script for details) |
| VI | `alignment/sgns_vi.py` | SGNS | - bug fixes 27/12/19 (see script for details) |
| WI | `alignment/wi.py` | Count, PPMI, SVD, RI, SGNS | - consider using the more advanced [Temporal Referencing](https://github.com/Garrafao/TemporalReferencing) |

#### Measures
Expand All @@ -99,20 +99,20 @@ Find detailed notes on model performances and optimal parameter settings in [the

The evaluation framework of this repository is based on the comparison of a set of target words across two corpora. Hence, models can be evaluated on a triple (dataset, corpus1, corpus2), where the dataset provides gold values for the change of target words between corpus1 and corpus2.

| Dataset | Corpus 1 | Corpus 2 | Download |
| --- | --- | --- | --- |
| DURel | DTA18 | DTA19 | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/durel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) |
| SURel | SDEWAC | COOK | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/surel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) |
| Dataset | Corpus 1 | Corpus 2 | Download | Comment |
| --- | --- | --- | --- | --- |
| DURel | DTA18 | DTA19 | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/durel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) | - version from Schlechtweg et al. (2019) at `testsets/durel/` |
| SURel | SDEWAC | COOK | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/surel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) | - version from Schlechtweg et al. (2019) at `testsets/surel/` |
| SemCor LSC | SEMCOR1 | SEMCOR2 | [Dataset](https://www.ims.uni-stuttgart.de/data/lsc-simul), [Corpora](https://www.ims.uni-stuttgart.de/data/lsc-simul) | |

You don't have to download the data manually. In `testsets/` we provide the testset versions of DURel and SURel as used in Schlechtweg et al. (2019). Additionally, we provide an evaluation pipeline, downloading the corpora and evaluating the models on the above-mentioned datasets, see [pipeline](#pipeline).
We provide several evaluation pipelines, downloading the corpora and evaluating the models on the above-mentioned datasets, see [pipeline](#pipeline).

#### Metrics

|Name | Code | Applicability |
| --- | --- | --- |
| Spearman correlation | `evaluation/spearman.py` | DURel, SURel |

The script `evaluation/spearman.py` outputs the Spearman correlation of the two input rankings (column 3), as well as the significance of the obtained result (column 4).
|Name | Code | Applicability | Comment |
| --- | --- | --- | --- |
| Spearman correlation | `evaluation/spr.py` | DURel, SURel, SemCor LSC | - outputs rho (column 3) and p-value (column 4) |
| Average Precision | `evaluation/ap.py` | SemCor LSC | - outputs AP (column 3) and random baseline (column 4) |

Consider uploading your results for DURel as a submission to the shared task [Lexical Semantic Change Detection in German](https://codalab.lri.fr/competitions/560).

Expand All @@ -128,16 +128,18 @@ Then run

The script first reads the two gzipped test corpora `corpora/test/corpus1/` and `corpora/test/corpus2/`. Then it produces model predictions for the targets in `testsets/test/targets.tsv` and writes them under `results/`. It finally writes the Spearman correlation between each model's predictions and the gold rank (`testsets/test/gold.tsv`) under the respective folder in `results/`. Note that the gold values for the test data are meaningless, as they were randomly assigned.

We also provide scripts to reproduce the results from Schlechtweg et al. (2019), including the corpus download. For this run either of
We also provide a script for each dataset running all the models on it including necessary downloads. For this run either of

bash -e scripts/run_durel.sh
bash -e scripts/run_surel.sh
bash -e scripts/run_semcor.sh

You may want to change the parameters in `scripts/parameters_durel.sh` and `scripts/parameters_surel.sh` (e.g. vector dimensionality, iterations), as running the scripts on the full parameter set will take several days and require a large amount of disk space.
As is the scripts will reproduce the results from Schlechtweg et al. (2019) and Schlechtweg & Schulte im Walde (2020). You may want to change the parameters in `scripts/parameters_durel.sh`, etc. (e.g. vector dimensionality, iterations), as running the scripts on the full parameter set may take several days and require a large amount of disk space.

### Important Changes

September 1, 2019: Python scripts were updated from Python 2 to Python 3.
December 27, 2019: bug fixes in `alignment/sgns_vi.py` (see script for details)

### Error Sources

Expand All @@ -157,4 +159,14 @@ BibTex
pages = {732--746}
}
```
```
@inproceedings{SchlechtwegWalde20,
author = {Dominik Schlechtweg and Sabine {Schulte im Walde}},
booktitle = {{The Evolution of Language: Proceedings of the 13th International Conference (EVOLANGXIII)}},
editor = {C. Cuskley and M. Flaherty and H. Little and Luke McCrohon and A. Ravignani and T. Verhoef},
title = {{Simulating Lexical Semantic Change from Sense-Annotated Data}},
year = {2020}
}
```

76 changes: 76 additions & 0 deletions evaluation/ap.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import sys
sys.path.append('./modules/')

import sys
from sklearn.metrics import average_precision_score
from collections import Counter
from docopt import docopt
import numpy as np
import logging
import time


def main():
"""
Calculate the Average Precision (AP) of full rank of targets.
"""

# Get the arguments
args = docopt("""Calculate the Average Precision (AP) of full rank of targets.
Usage:
ap.py <classFile> <resultFile> <classFileName> <resultFileName>
<classFile> = file with gold class assignments
<resultFile> = file with values assigned to targets
<classFileName> = name of class file to print
<resultFileName> = name of result file to print
Note:
Assumes tap-separated CSV files as input. Assumes same number and order of rows. classFile must contain class assignments in first column. resultFile must contain targets in first column and values in second column. Targets with nan are ignored.
""")

classFile = args['<classFile>']
resultFile = args['<resultFile>']
classFileName = args['<classFileName>']
resultFileName = args['<resultFileName>']

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logging.info(__file__.upper())
start_time = time.time()

# Get gold data
with open(classFile, 'r', encoding='utf-8') as f_in:
classes = [float(line.strip()) for line in f_in]

# Get predictions
with open(resultFile, 'r', encoding='utf-8') as f_in:
target2values = {line.strip().split('\t')[0]:float(line.strip().split('\t')[1]) for line in f_in}

target2class = {target:classes[i] for i, target in enumerate(target2values)}

# Read in values, exclude nan and targets not present in resultFile
gold = np.array([target2class[target] for (target, value) in target2values.items() if not np.isnan(value)])
values = np.array([value for (target, value) in target2values.items() if not np.isnan(value)])
targets = np.array([target for (target, value) in target2values.items() if not np.isnan(value)])

if len(classes)!=len(list(gold)):
print('nan encountered!')

# Compute average precision
try:
ap = average_precision_score(gold, values)
mc = Counter(gold)[1.0]
rb = mc/len(gold) # approximate random baseline
except IndexError as e:
logging.info(e)
ap, rb = float('nan'), float('nan')

print('\t'.join((classFileName, resultFileName, str(ap), str(rb))))

logging.info("--- %s seconds ---" % (time.time() - start_time))


if __name__ == "__main__":
main()
4 changes: 2 additions & 2 deletions evaluation/spearman.py → evaluation/spr.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ def main():
Usage:
spearman.py <filePath1> <filePath2> <filename1> <filename2> <col1> <col2>
spr.py <filePath1> <filePath2> <filename1> <filename2> <col1> <col2>
Arguments:
<filePath1> = path to file1
Expand Down Expand Up @@ -62,7 +62,7 @@ def main():
rho, p = spearmanr(data1, data2, nan_policy='omit')
except ValueError as e:
logging.info(e)
rho, p = 'nan', 'nan'
rho, p = float('nan'), float('nan')

print('\t'.join((filename1, filename2, str(rho), str(p))))

Expand Down
68 changes: 68 additions & 0 deletions measures/rand.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
import sys
sys.path.append('./modules/')

from docopt import docopt
import logging
import time
import random

def main():
"""
Measure assigning random values to targets (as baseline).
"""

# Get the arguments
args = docopt("""Measure assigning random values to targets (as baseline).
Usage:
rand.py [(-f | -s)] (-r) <testset> <outPath>
<testset> = path to file with tab-separated word pairs
<outPath> = output path for result file
Options:
-f, --fst write only first target in output file
-s, --scd write only second target in output file
-r, --rel assign random real numbers between 0 and 1
""")

is_fst = args['--fst']
is_scd = args['--scd']
is_rel = args['--rel']
testset = args['<testset>']
outPath = args['<outPath>']

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logging.info(__file__.upper())
start_time = time.time()

# Load targets
with open(testset, 'r', encoding='utf-8') as f_in:
targets = [(line.strip().split('\t')[0],line.strip().split('\t')[1]) for line in f_in]

scores = {}
for (t1, t2) in targets:

if is_rel:
score = random.uniform(0, 1)

scores[(t1, t2)] = score


with open(outPath, 'w', encoding='utf-8') as f_out:
for (t1, t2) in targets:
if is_fst: # output only first target string
f_out.write('\t'.join((t1, str(scores[(t1, t2)])+'\n')))
elif is_scd: # output only second target string
f_out.write('\t'.join((t2, str(scores[(t1, t2)])+'\n')))
else: # standard outputs both target strings
f_out.write('\t'.join(('%s,%s' % (t1,t2), str(scores[(t1, t2)])+'\n')))


logging.info("--- %s seconds ---" % (time.time() - start_time))



if __name__ == '__main__':
main()
2 changes: 1 addition & 1 deletion representations/sgns.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ def main():
hs=0, # negative sampling
negative=k, # number of negative samples
sample=t, # threshold for subsampling, if None, no subsampling is performed
size=dim, window=windowSize, min_count=minCount, iter=itera, workers=20)
size=dim, window=windowSize, min_count=minCount, iter=itera, workers=40)

# Initialize vocabulary
vocab_sentences = PathLineSentences(corpDir)
Expand Down
4 changes: 4 additions & 0 deletions scripts/make_results_disp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -79,3 +79,7 @@ infolder1=$entropyresultfolder1
infolder2=$entropyresultfolder2
outfolder=$resultfolder
source scripts/run_DIFF.sh # Subtract entropy (Entropy Difference)

# Create random predictions as baselines
outfolder=$resultfolder
source scripts/run_RAND.sh
4 changes: 4 additions & 0 deletions scripts/make_results_sim.sh
Original file line number Diff line number Diff line change
Expand Up @@ -74,3 +74,7 @@ matrixfolder2=$alignedmatrixfolder2
outfolder=$resultfolder
source scripts/run_CD.sh # Cosine Distance
source scripts/run_LND.sh # Local Neighborhood Distance

# Create random predictions as baselines
outfolder=$resultfolder
source scripts/run_RAND.sh
5 changes: 4 additions & 1 deletion scripts/make_results_wi.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ source scripts/run_PPMI.sh # PPMI
matrixfolder=$ppmimatrixfolderwi
outfolder=$svdmatrixfolderwi
source scripts/run_SVD.sh # SVD

# Get Predictions
for matrixfolder in "${matrixfolders[@]}"
do
Expand All @@ -33,3 +32,7 @@ do
source scripts/run_CD.sh # Cosine Distance
source scripts/run_LND.sh # Local Neighborhood Distance
done

# Create random predictions as baselines
outfolder=$resultfolder
source scripts/run_RAND.sh
24 changes: 24 additions & 0 deletions scripts/make_targets.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@

## Make target input files

if [ ! -f $targets ];
then
echo -e "Error: No target file found at $targets."
exit 0
fi

if [ ! -f $testset ];
then
for i in `cat $targets`
do
echo -e "$i\t$i" >> $testset # general input
done
fi

if [ ! -f $testsetwi ];
then
for i in `cat $targets`
do
echo -e "${i}_\t$i" >> $testsetwi # input for word injection
done
fi
40 changes: 23 additions & 17 deletions scripts/parameters_durel.sh
Original file line number Diff line number Diff line change
@@ -1,25 +1,31 @@
shopt -s extglob # For more powerful regular expressions in shell

### Define parameters ###
declare -a corpDir1="corpora/durel/corpus1/" # directory for corpus1 files (all files in directory will be read)
declare -a corpDir2="corpora/durel/corpus2/" # directory for corpus2 files (all files in directory will be read)
declare -a wiCorpDir="corpora/durel/corpus_wi/" # directory for word-injected corpus (only needed for Word Injection)
declare -a freqnorms=(26650530 40323497) # normalization constants for token frequency (total number of tokens in first and second corpus, *before cleaning*)
declare -a typesnorms=(252437 796365) # normalization constants for number of context types (total number of types in first and second corpus, *before cleaning*)
declare -a windowSizes=(2 5 10) # window sizes for all models
declare -a ks=(5 1) # values for shifting parameter k
declare -a ts=(0.001 None) # values for subsampling parameter t
declare -a iterations=(1 2 3 4 5) # list of iterations, each item is one iteration, for five iterations define: iterations=(1 2 3 4 5)
declare -a dim=300 # dimensionality of low-dimensional matrices (SVD/RI/SGNS)
declare -a testset="testsets/durel/targets.tsv" # target words for which change scores should be predicted (one target per line repeated twice with tab-separation, i.e., 'word\tword')
declare -a testsetwi="testsets/durel/targets_wi.tsv" # target words for Word Injection (one target per line, injected version in first column, non-injected version in second column, i.e., 'word_\tword')
declare -a goldscorefile="testsets/durel/gold.tsv" # file with gold scores for target words in same order as targets in testsets
corpDir1="corpora/durel/corpus1/" # directory for corpus1 files (all files in directory will be read)
corpDir2="corpora/durel/corpus2/" # directory for corpus2 files (all files in directory will be read)
wiCorpDir="corpora/durel/corpus_wi/" # directory for word-injected corpus (only needed for Word Injection)
freqnorms=(26650530 40323497) # normalization constants for token frequency (total number of tokens in first and second corpus, *before cleaning*)
typesnorms=(252437 796365) # normalization constants for number of context types (total number of types in first and second corpus, *before cleaning*)
windowSizes=(2 5 10) # window sizes for all models
ks=(5 1) # values for shifting parameter k
ts=(0.001 None) # values for subsampling parameter t
iterations=(1 2 3 4 5) # list of iterations, each item is one iteration, for five iterations define: iterations=(1 2 3 4 5)
dims=(300) # dimensionality of low-dimensional matrices (SVD/RI/SGNS)
eps=(5) # training epochs for SGNS
targets="testsets/durel/targets.tsv" # target words for which change scores should be predicted (one target per line)
testset="testsets/durel/targets_input.tsv" # target words in input format (one target per line repeated twice with tab-separation, i.e., 'word\tword', will be created)
testsetwi="testsets/durel/targets_wi.tsv" # target words in word injection format (one target per line, injected version in first column, non-injected version in second column, i.e., 'word_\tword', will be created)
goldrankfile="testsets/durel/rank.tsv" # file with gold scores for target words in same order as targets in testsets
goldclassfile="" # file with gold classes for target words in same order as targets in testsets (leave undefined if non-existent)

# Get normalization constants for dispersion measures
declare -a freqnorm1=${freqnorms[0]}
declare -a freqnorm2=${freqnorms[1]}
declare -a typesnorm1=${typesnorms[0]}
declare -a typesnorm2=${typesnorms[1]}
freqnorm1=${freqnorms[0]}
freqnorm2=${freqnorms[1]}
typesnorm1=${typesnorms[0]}
typesnorm2=${typesnorms[1]}

### Make folder structure ###
source scripts/make_folders.sh

### Make target input files ###
source scripts/make_targets.sh
Loading

0 comments on commit 361c173

Please sign in to comment.