add semcor pipeline, dimensionality parameter, epoch parameter

Garrafao · Jan 5, 2020 · 361c173 · 361c173
1 parent 7b72046
commit 361c173
Show file tree

Hide file tree

Showing 47 changed files with 521 additions and 200 deletions.
diff --git a/.gitignore b/.gitignore
@@ -3,6 +3,12 @@ matrices
 results
 corpora/durel
 corpora/surel
+corpora/semcor_lsc
+testsets/semcor_lsc
 modules/__pycache__
 modules/*.pyc
-update-git.sh
+update-git.sh
+evaluation/average_results.py
+evaluation/average_results.sh
+evaluation/average_results1.py
+evaluation/average_results1.sh
diff --git a/README.md b/README.md
@@ -77,7 +77,7 @@ Table: VSM=Vector Space Model, TPM=Topic Model
 | CI | `alignment/ci_align.py` | Count, PPMI | |
 | SRV | `alignment/srv_align.py` | RI | - use `-a` for good performance <br> - consider using the efficient and more powerful [TRIPY](https://github.com/Garrafao/TRIPY) |
 | OP | `alignment/map_embeddings.py` | SVD, RI, SGNS | - drawn from [VecMap](https://github.com/artetxem/vecmap) <br> - for OP- and OP+ see `scripts/` |
-| VI | `alignment/sgns_vi.py` | SGNS | - updated 27/12/19 (see script for details) |
+| VI | `alignment/sgns_vi.py` | SGNS | - bug fixes 27/12/19 (see script for details) |
 | WI | `alignment/wi.py` | Count, PPMI, SVD, RI, SGNS | - consider using the more advanced [Temporal Referencing](https://github.com/Garrafao/TemporalReferencing) |
 
 #### Measures
@@ -99,20 +99,20 @@ Find detailed notes on model performances and optimal parameter settings in [the
 
 The evaluation framework of this repository is based on the comparison of a set of target words across two corpora. Hence, models can be evaluated on a triple (dataset, corpus1, corpus2), where the dataset provides gold values for the change of target words between corpus1 and corpus2.
 
-| Dataset | Corpus 1 | Corpus 2 | Download |
-| --- | --- | --- | --- |
-| DURel | DTA18 | DTA19  | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/durel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) |
-| SURel | SDEWAC | COOK | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/surel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) |
+| Dataset | Corpus 1 | Corpus 2 | Download | Comment |
+| --- | --- | --- | --- | --- |
+| DURel | DTA18 | DTA19  | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/durel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) | - version from Schlechtweg et al. (2019) at `testsets/durel/` |
+| SURel | SDEWAC | COOK | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/surel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) | - version from Schlechtweg et al. (2019) at `testsets/surel/` |
+| SemCor LSC | SEMCOR1 | SEMCOR2 | [Dataset](https://www.ims.uni-stuttgart.de/data/lsc-simul), [Corpora](https://www.ims.uni-stuttgart.de/data/lsc-simul) | |
 
-You don't have to download the data manually. In `testsets/` we provide the testset versions of DURel and SURel as used in Schlechtweg et al. (2019). Additionally, we provide an evaluation pipeline, downloading the corpora and evaluating the models on the above-mentioned datasets, see [pipeline](#pipeline).
+We provide several evaluation pipelines, downloading the corpora and evaluating the models on the above-mentioned datasets, see [pipeline](#pipeline).
 
 #### Metrics
 
-|Name | Code | Applicability |
-| --- | --- | --- |
-| Spearman correlation | `evaluation/spearman.py` | DURel, SURel |
-
-The script `evaluation/spearman.py` outputs the Spearman correlation of the two input rankings (column 3), as well as the significance of the obtained result (column 4).
+|Name | Code | Applicability | Comment |
+| --- | --- | --- | --- |
+| Spearman correlation | `evaluation/spr.py` | DURel, SURel, SemCor LSC | - outputs rho (column 3) and p-value (column 4) |
+| Average Precision | `evaluation/ap.py` | SemCor LSC | - outputs AP (column 3) and random baseline (column 4) |
 
 Consider uploading your results for DURel as a submission to the shared task [Lexical Semantic Change Detection in German](https://codalab.lri.fr/competitions/560).
 
@@ -128,16 +128,18 @@ Then run
 
 The script first reads the two gzipped test corpora `corpora/test/corpus1/` and `corpora/test/corpus2/`. Then it produces model predictions for the targets in `testsets/test/targets.tsv` and writes them under `results/`. It finally writes the Spearman correlation between each model's predictions and the gold rank (`testsets/test/gold.tsv`) under the respective folder in `results/`. Note that the gold values for the test data are meaningless, as they were randomly assigned.
 
-We also provide scripts to reproduce the results from Schlechtweg et al. (2019), including the corpus download. For this run either of
+We also provide a script for each dataset running all the models on it including necessary downloads. For this run either of 
 
 	bash -e scripts/run_durel.sh
 	bash -e scripts/run_surel.sh
+	bash -e scripts/run_semcor.sh
 
-You may want to change the parameters in `scripts/parameters_durel.sh` and `scripts/parameters_surel.sh` (e.g. vector dimensionality, iterations), as running the scripts on the full parameter set will take several days and require a large amount of disk space.
+As is the scripts will reproduce the results from Schlechtweg et al. (2019) and Schlechtweg & Schulte im Walde (2020). You may want to change the parameters in `scripts/parameters_durel.sh`, etc. (e.g. vector dimensionality, iterations), as running the scripts on the full parameter set may take several days and require a large amount of disk space.
 
 ### Important Changes
 
 September 1, 2019: Python scripts were updated from Python 2 to Python 3.
+December 27, 2019: bug fixes in `alignment/sgns_vi.py` (see script for details)
 
 ### Error Sources
 
@@ -157,4 +159,14 @@ BibTex
 	pages     = {732--746}
 }
 ```
+```	
+@inproceedings{SchlechtwegWalde20,
+	author = {Dominik Schlechtweg and Sabine {Schulte im Walde}},
+	booktitle = {{The Evolution of Language: Proceedings of the 13th International Conference (EVOLANGXIII)}},
+	editor = {C. Cuskley and M. Flaherty and H. Little and Luke McCrohon and A. Ravignani and T. Verhoef},
+	title = {{Simulating Lexical Semantic Change from Sense-Annotated Data}},
+	year = {2020}
+}
+	
+```
 
diff --git a/evaluation/ap.py b/evaluation/ap.py
@@ -0,0 +1,76 @@
+import sys
+sys.path.append('./modules/')
+
+import sys
+from sklearn.metrics import average_precision_score
+from collections import Counter
+from docopt import docopt
+import numpy as np
+import logging
+import time
+
+
+def main():
+    """
+    Calculate the Average Precision (AP) of full rank of targets.
+    """
+
+    # Get the arguments
+    args = docopt("""Calculate the Average Precision (AP) of full rank of targets.
+
+    Usage:
+        ap.py <classFile> <resultFile> <classFileName> <resultFileName>
+
+        <classFile> = file with gold class assignments
+        <resultFile> = file with values assigned to targets
+        <classFileName> = name of class file to print
+        <resultFileName> = name of result file to print
+
+    Note:
+        Assumes tap-separated CSV files as input. Assumes same number and order of rows. classFile must contain class assignments in first column. resultFile must contain targets in first column and values in second column. Targets with nan are ignored.
+
+    """)
+
+    classFile = args['<classFile>']
+    resultFile = args['<resultFile>']
+    classFileName = args['<classFileName>']
+    resultFileName = args['<resultFileName>']
+
+    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
+    logging.info(__file__.upper())
+    start_time = time.time()        
+
+    # Get gold data
+    with open(classFile, 'r', encoding='utf-8') as f_in:
+        classes = [float(line.strip()) for line in f_in]
+
+    # Get predictions        
+    with open(resultFile, 'r', encoding='utf-8') as f_in:
+        target2values = {line.strip().split('\t')[0]:float(line.strip().split('\t')[1]) for line in f_in}
+
+    target2class = {target:classes[i] for i, target in enumerate(target2values)}
+
+    # Read in values, exclude nan and targets not present in resultFile
+    gold = np.array([target2class[target] for (target, value) in target2values.items() if not np.isnan(value)])
+    values = np.array([value for (target, value) in target2values.items() if not np.isnan(value)])
+    targets = np.array([target for (target, value) in target2values.items() if not np.isnan(value)])
+
+    if len(classes)!=len(list(gold)):
+        print('nan encountered!')
+
+    # Compute average precision
+    try:
+        ap = average_precision_score(gold, values)
+        mc = Counter(gold)[1.0]
+        rb = mc/len(gold) # approximate random baseline
+    except IndexError as e:
+        logging.info(e)
+        ap, rb = float('nan'), float('nan')
+
+    print('\t'.join((classFileName, resultFileName, str(ap), str(rb))))
+
+    logging.info("--- %s seconds ---" % (time.time() - start_time))                   
+
+
+if __name__ == "__main__":
+    main()
diff --git a/evaluation/spearman.py → evaluation/spr.py b/evaluation/spearman.py → evaluation/spr.py
@@ -18,7 +18,7 @@ def main():
 
 
     Usage:
-        spearman.py <filePath1> <filePath2> <filename1> <filename2> <col1> <col2>
+        spr.py <filePath1> <filePath2> <filename1> <filename2> <col1> <col2>
         
     Arguments:
         <filePath1> = path to file1
@@ -62,7 +62,7 @@ def main():
         rho, p = spearmanr(data1, data2, nan_policy='omit')
     except ValueError as e:
         logging.info(e)
-        rho, p = 'nan', 'nan'
+        rho, p = float('nan'), float('nan')
 
     print('\t'.join((filename1, filename2, str(rho), str(p))))
 

diff --git a/measures/rand.py b/measures/rand.py
@@ -0,0 +1,68 @@
+import sys
+sys.path.append('./modules/')
+
+from docopt import docopt
+import logging
+import time
+import random
+
+def main():
+    """
+    Measure assigning random values to targets (as baseline).
+    """
+
+    # Get the arguments
+    args = docopt("""Measure assigning random values to targets (as baseline).
+
+    Usage:
+        rand.py [(-f | -s)] (-r) <testset> <outPath>
+
+        <testset> = path to file with tab-separated word pairs
+        <outPath> = output path for result file
+
+    Options:
+        -f, --fst   write only first target in output file
+        -s, --scd   write only second target in output file
+        -r, --rel   assign random real numbers between 0 and 1
+        
+    """)
+
+    is_fst = args['--fst']
+    is_scd = args['--scd']
+    is_rel = args['--rel']
+    testset = args['<testset>']
+    outPath = args['<outPath>']
+
+    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
+    logging.info(__file__.upper())
+    start_time = time.time()    
+
+    # Load targets
+    with open(testset, 'r', encoding='utf-8') as f_in:
+        targets = [(line.strip().split('\t')[0],line.strip().split('\t')[1]) for line in f_in]
+
+    scores = {}
+    for (t1, t2) in targets:
+
+        if is_rel:
+            score = random.uniform(0, 1)
+
+        scores[(t1, t2)] = score
+
+
+    with open(outPath, 'w', encoding='utf-8') as f_out:
+        for (t1, t2) in targets:
+            if is_fst: # output only first target string
+                f_out.write('\t'.join((t1, str(scores[(t1, t2)])+'\n')))
+            elif is_scd: # output only second target string
+                f_out.write('\t'.join((t2, str(scores[(t1, t2)])+'\n')))
+            else: # standard outputs both target strings    
+                f_out.write('\t'.join(('%s,%s' % (t1,t2), str(scores[(t1, t2)])+'\n')))
+
+
+    logging.info("--- %s seconds ---" % (time.time() - start_time))                   
+
+
+
+if __name__ == '__main__':
+    main()
diff --git a/representations/sgns.py b/representations/sgns.py
@@ -57,7 +57,7 @@ def main():
     							   hs=0, # negative sampling
     							   negative=k, # number of negative samples
     							   sample=t, # threshold for subsampling, if None, no subsampling is performed
-    							   size=dim, window=windowSize, min_count=minCount, iter=itera, workers=20)
+    							   size=dim, window=windowSize, min_count=minCount, iter=itera, workers=40)
 
     # Initialize vocabulary
     vocab_sentences = PathLineSentences(corpDir)

diff --git a/scripts/make_results_disp.sh b/scripts/make_results_disp.sh
@@ -79,3 +79,7 @@ infolder1=$entropyresultfolder1
 infolder2=$entropyresultfolder2
 outfolder=$resultfolder
 source scripts/run_DIFF.sh # Subtract entropy (Entropy Difference)
+
+# Create random predictions as baselines
+outfolder=$resultfolder
+source scripts/run_RAND.sh
diff --git a/scripts/make_results_sim.sh b/scripts/make_results_sim.sh
@@ -74,3 +74,7 @@ matrixfolder2=$alignedmatrixfolder2
 outfolder=$resultfolder
 source scripts/run_CD.sh # Cosine Distance
 source scripts/run_LND.sh # Local Neighborhood Distance
+
+# Create random predictions as baselines
+outfolder=$resultfolder
+source scripts/run_RAND.sh
diff --git a/scripts/make_results_wi.sh b/scripts/make_results_wi.sh
@@ -22,7 +22,6 @@ source scripts/run_PPMI.sh # PPMI
 matrixfolder=$ppmimatrixfolderwi
 outfolder=$svdmatrixfolderwi
 source scripts/run_SVD.sh # SVD
-
 # Get Predictions
 for matrixfolder in "${matrixfolders[@]}"
 do
@@ -33,3 +32,7 @@ do
     source scripts/run_CD.sh # Cosine Distance
     source scripts/run_LND.sh # Local Neighborhood Distance
 done
+
+# Create random predictions as baselines
+outfolder=$resultfolder
+source scripts/run_RAND.sh
diff --git a/scripts/make_targets.sh b/scripts/make_targets.sh
@@ -0,0 +1,24 @@
+
+## Make target input files
+
+if [ ! -f $targets ];
+then
+    echo -e "Error: No target file found at $targets."
+    exit 0
+fi
+
+if [ ! -f $testset ];
+then
+    for i in `cat $targets`
+    do
+	echo -e "$i\t$i" >> $testset # general input
+    done
+fi
+
+if [ ! -f $testsetwi ];
+then
+    for i in `cat $targets`
+    do
+	echo -e "${i}_\t$i" >> $testsetwi # input for word injection
+    done
+fi
diff --git a/scripts/parameters_durel.sh b/scripts/parameters_durel.sh
@@ -1,25 +1,31 @@
 shopt -s extglob # For more powerful regular expressions in shell
 
 ### Define parameters ###
-declare -a corpDir1="corpora/durel/corpus1/" # directory for corpus1 files (all files in directory will be read)
-declare -a corpDir2="corpora/durel/corpus2/" # directory for corpus2 files (all files in directory will be read)
-declare -a wiCorpDir="corpora/durel/corpus_wi/" # directory for word-injected corpus (only needed for Word Injection)
-declare -a freqnorms=(26650530 40323497) # normalization constants for token frequency (total number of tokens in first and second corpus, *before cleaning*)
-declare -a typesnorms=(252437 796365) # normalization constants for number of context types (total number of types in first and second corpus, *before cleaning*)
-declare -a windowSizes=(2 5 10) # window sizes for all models
-declare -a ks=(5 1) # values for shifting parameter k
-declare -a ts=(0.001 None) # values for subsampling parameter t
-declare -a iterations=(1 2 3 4 5) # list of iterations, each item is one iteration, for five iterations define: iterations=(1 2 3 4 5)
-declare -a dim=300 # dimensionality of low-dimensional matrices (SVD/RI/SGNS)
-declare -a testset="testsets/durel/targets.tsv" # target words for which change scores should be predicted (one target per line repeated twice with tab-separation, i.e., 'word\tword')
-declare -a testsetwi="testsets/durel/targets_wi.tsv" # target words for Word Injection (one target per line, injected version in first column, non-injected version in second column, i.e., 'word_\tword')
-declare -a goldscorefile="testsets/durel/gold.tsv" # file with gold scores for target words in same order as targets in testsets
+corpDir1="corpora/durel/corpus1/" # directory for corpus1 files (all files in directory will be read)
+corpDir2="corpora/durel/corpus2/" # directory for corpus2 files (all files in directory will be read)
+wiCorpDir="corpora/durel/corpus_wi/" # directory for word-injected corpus (only needed for Word Injection)
+freqnorms=(26650530 40323497) # normalization constants for token frequency (total number of tokens in first and second corpus, *before cleaning*)
+typesnorms=(252437 796365) # normalization constants for number of context types (total number of types in first and second corpus, *before cleaning*)
+windowSizes=(2 5 10) # window sizes for all models
+ks=(5 1) # values for shifting parameter k
+ts=(0.001 None) # values for subsampling parameter t
+iterations=(1 2 3 4 5) # list of iterations, each item is one iteration, for five iterations define: iterations=(1 2 3 4 5)
+dims=(300) # dimensionality of low-dimensional matrices (SVD/RI/SGNS)
+eps=(5) # training epochs for SGNS
+targets="testsets/durel/targets.tsv" # target words for which change scores should be predicted (one target per line)
+testset="testsets/durel/targets_input.tsv" # target words in input format (one target per line repeated twice with tab-separation, i.e., 'word\tword', will be created)
+testsetwi="testsets/durel/targets_wi.tsv"  # target words in word injection format (one target per line, injected version in first column, non-injected version in second column, i.e., 'word_\tword', will be created)
+goldrankfile="testsets/durel/rank.tsv" # file with gold scores for target words in same order as targets in testsets
+goldclassfile="" # file with gold classes for target words in same order as targets in testsets (leave undefined if non-existent)
 
 # Get normalization constants for dispersion measures
-declare -a freqnorm1=${freqnorms[0]}
-declare -a freqnorm2=${freqnorms[1]}
-declare -a typesnorm1=${typesnorms[0]}
-declare -a typesnorm2=${typesnorms[1]}
+freqnorm1=${freqnorms[0]}
+freqnorm2=${freqnorms[1]}
+typesnorm1=${typesnorms[0]}
+typesnorm2=${typesnorms[1]}
 
 ### Make folder structure ###
 source scripts/make_folders.sh
+
+### Make target input files ###
+source scripts/make_targets.sh