diff --git a/.gitignore b/.gitignore index 86ec302c..c5ad30ba 100644 --- a/.gitignore +++ b/.gitignore @@ -3,6 +3,12 @@ matrices results corpora/durel corpora/surel +corpora/semcor_lsc +testsets/semcor_lsc modules/__pycache__ modules/*.pyc -update-git.sh \ No newline at end of file +update-git.sh +evaluation/average_results.py +evaluation/average_results.sh +evaluation/average_results1.py +evaluation/average_results1.sh \ No newline at end of file diff --git a/README.md b/README.md index 5e29d61b..a5d0e7cb 100644 --- a/README.md +++ b/README.md @@ -77,7 +77,7 @@ Table: VSM=Vector Space Model, TPM=Topic Model | CI | `alignment/ci_align.py` | Count, PPMI | | | SRV | `alignment/srv_align.py` | RI | - use `-a` for good performance
- consider using the efficient and more powerful [TRIPY](https://github.com/Garrafao/TRIPY) | | OP | `alignment/map_embeddings.py` | SVD, RI, SGNS | - drawn from [VecMap](https://github.com/artetxem/vecmap)
- for OP- and OP+ see `scripts/` | -| VI | `alignment/sgns_vi.py` | SGNS | - updated 27/12/19 (see script for details) | +| VI | `alignment/sgns_vi.py` | SGNS | - bug fixes 27/12/19 (see script for details) | | WI | `alignment/wi.py` | Count, PPMI, SVD, RI, SGNS | - consider using the more advanced [Temporal Referencing](https://github.com/Garrafao/TemporalReferencing) | #### Measures @@ -99,20 +99,20 @@ Find detailed notes on model performances and optimal parameter settings in [the The evaluation framework of this repository is based on the comparison of a set of target words across two corpora. Hence, models can be evaluated on a triple (dataset, corpus1, corpus2), where the dataset provides gold values for the change of target words between corpus1 and corpus2. -| Dataset | Corpus 1 | Corpus 2 | Download | -| --- | --- | --- | --- | -| DURel | DTA18 | DTA19 | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/durel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) | -| SURel | SDEWAC | COOK | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/surel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) | +| Dataset | Corpus 1 | Corpus 2 | Download | Comment | +| --- | --- | --- | --- | --- | +| DURel | DTA18 | DTA19 | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/durel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) | - version from Schlechtweg et al. (2019) at `testsets/durel/` | +| SURel | SDEWAC | COOK | [Dataset](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/surel.html), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc.html) | - version from Schlechtweg et al. (2019) at `testsets/surel/` | +| SemCor LSC | SEMCOR1 | SEMCOR2 | [Dataset](https://www.ims.uni-stuttgart.de/data/lsc-simul), [Corpora](https://www.ims.uni-stuttgart.de/data/lsc-simul) | | -You don't have to download the data manually. In `testsets/` we provide the testset versions of DURel and SURel as used in Schlechtweg et al. (2019). Additionally, we provide an evaluation pipeline, downloading the corpora and evaluating the models on the above-mentioned datasets, see [pipeline](#pipeline). +We provide several evaluation pipelines, downloading the corpora and evaluating the models on the above-mentioned datasets, see [pipeline](#pipeline). #### Metrics -|Name | Code | Applicability | -| --- | --- | --- | -| Spearman correlation | `evaluation/spearman.py` | DURel, SURel | - -The script `evaluation/spearman.py` outputs the Spearman correlation of the two input rankings (column 3), as well as the significance of the obtained result (column 4). +|Name | Code | Applicability | Comment | +| --- | --- | --- | --- | +| Spearman correlation | `evaluation/spr.py` | DURel, SURel, SemCor LSC | - outputs rho (column 3) and p-value (column 4) | +| Average Precision | `evaluation/ap.py` | SemCor LSC | - outputs AP (column 3) and random baseline (column 4) | Consider uploading your results for DURel as a submission to the shared task [Lexical Semantic Change Detection in German](https://codalab.lri.fr/competitions/560). @@ -128,16 +128,18 @@ Then run The script first reads the two gzipped test corpora `corpora/test/corpus1/` and `corpora/test/corpus2/`. Then it produces model predictions for the targets in `testsets/test/targets.tsv` and writes them under `results/`. It finally writes the Spearman correlation between each model's predictions and the gold rank (`testsets/test/gold.tsv`) under the respective folder in `results/`. Note that the gold values for the test data are meaningless, as they were randomly assigned. -We also provide scripts to reproduce the results from Schlechtweg et al. (2019), including the corpus download. For this run either of +We also provide a script for each dataset running all the models on it including necessary downloads. For this run either of bash -e scripts/run_durel.sh bash -e scripts/run_surel.sh + bash -e scripts/run_semcor.sh -You may want to change the parameters in `scripts/parameters_durel.sh` and `scripts/parameters_surel.sh` (e.g. vector dimensionality, iterations), as running the scripts on the full parameter set will take several days and require a large amount of disk space. +As is the scripts will reproduce the results from Schlechtweg et al. (2019) and Schlechtweg & Schulte im Walde (2020). You may want to change the parameters in `scripts/parameters_durel.sh`, etc. (e.g. vector dimensionality, iterations), as running the scripts on the full parameter set may take several days and require a large amount of disk space. ### Important Changes September 1, 2019: Python scripts were updated from Python 2 to Python 3. +December 27, 2019: bug fixes in `alignment/sgns_vi.py` (see script for details) ### Error Sources @@ -157,4 +159,14 @@ BibTex pages = {732--746} } ``` +``` +@inproceedings{SchlechtwegWalde20, + author = {Dominik Schlechtweg and Sabine {Schulte im Walde}}, + booktitle = {{The Evolution of Language: Proceedings of the 13th International Conference (EVOLANGXIII)}}, + editor = {C. Cuskley and M. Flaherty and H. Little and Luke McCrohon and A. Ravignani and T. Verhoef}, + title = {{Simulating Lexical Semantic Change from Sense-Annotated Data}}, + year = {2020} +} + +``` diff --git a/evaluation/ap.py b/evaluation/ap.py new file mode 100644 index 00000000..58edef0f --- /dev/null +++ b/evaluation/ap.py @@ -0,0 +1,76 @@ +import sys +sys.path.append('./modules/') + +import sys +from sklearn.metrics import average_precision_score +from collections import Counter +from docopt import docopt +import numpy as np +import logging +import time + + +def main(): + """ + Calculate the Average Precision (AP) of full rank of targets. + """ + + # Get the arguments + args = docopt("""Calculate the Average Precision (AP) of full rank of targets. + + Usage: + ap.py + + = file with gold class assignments + = file with values assigned to targets + = name of class file to print + = name of result file to print + + Note: + Assumes tap-separated CSV files as input. Assumes same number and order of rows. classFile must contain class assignments in first column. resultFile must contain targets in first column and values in second column. Targets with nan are ignored. + + """) + + classFile = args[''] + resultFile = args[''] + classFileName = args[''] + resultFileName = args[''] + + logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) + logging.info(__file__.upper()) + start_time = time.time() + + # Get gold data + with open(classFile, 'r', encoding='utf-8') as f_in: + classes = [float(line.strip()) for line in f_in] + + # Get predictions + with open(resultFile, 'r', encoding='utf-8') as f_in: + target2values = {line.strip().split('\t')[0]:float(line.strip().split('\t')[1]) for line in f_in} + + target2class = {target:classes[i] for i, target in enumerate(target2values)} + + # Read in values, exclude nan and targets not present in resultFile + gold = np.array([target2class[target] for (target, value) in target2values.items() if not np.isnan(value)]) + values = np.array([value for (target, value) in target2values.items() if not np.isnan(value)]) + targets = np.array([target for (target, value) in target2values.items() if not np.isnan(value)]) + + if len(classes)!=len(list(gold)): + print('nan encountered!') + + # Compute average precision + try: + ap = average_precision_score(gold, values) + mc = Counter(gold)[1.0] + rb = mc/len(gold) # approximate random baseline + except IndexError as e: + logging.info(e) + ap, rb = float('nan'), float('nan') + + print('\t'.join((classFileName, resultFileName, str(ap), str(rb)))) + + logging.info("--- %s seconds ---" % (time.time() - start_time)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/spearman.py b/evaluation/spr.py similarity index 91% rename from evaluation/spearman.py rename to evaluation/spr.py index ffec90ba..d7a8823b 100644 --- a/evaluation/spearman.py +++ b/evaluation/spr.py @@ -18,7 +18,7 @@ def main(): Usage: - spearman.py + spr.py Arguments: = path to file1 @@ -62,7 +62,7 @@ def main(): rho, p = spearmanr(data1, data2, nan_policy='omit') except ValueError as e: logging.info(e) - rho, p = 'nan', 'nan' + rho, p = float('nan'), float('nan') print('\t'.join((filename1, filename2, str(rho), str(p)))) diff --git a/measures/rand.py b/measures/rand.py new file mode 100644 index 00000000..eb28577a --- /dev/null +++ b/measures/rand.py @@ -0,0 +1,68 @@ +import sys +sys.path.append('./modules/') + +from docopt import docopt +import logging +import time +import random + +def main(): + """ + Measure assigning random values to targets (as baseline). + """ + + # Get the arguments + args = docopt("""Measure assigning random values to targets (as baseline). + + Usage: + rand.py [(-f | -s)] (-r) + + = path to file with tab-separated word pairs + = output path for result file + + Options: + -f, --fst write only first target in output file + -s, --scd write only second target in output file + -r, --rel assign random real numbers between 0 and 1 + + """) + + is_fst = args['--fst'] + is_scd = args['--scd'] + is_rel = args['--rel'] + testset = args[''] + outPath = args[''] + + logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) + logging.info(__file__.upper()) + start_time = time.time() + + # Load targets + with open(testset, 'r', encoding='utf-8') as f_in: + targets = [(line.strip().split('\t')[0],line.strip().split('\t')[1]) for line in f_in] + + scores = {} + for (t1, t2) in targets: + + if is_rel: + score = random.uniform(0, 1) + + scores[(t1, t2)] = score + + + with open(outPath, 'w', encoding='utf-8') as f_out: + for (t1, t2) in targets: + if is_fst: # output only first target string + f_out.write('\t'.join((t1, str(scores[(t1, t2)])+'\n'))) + elif is_scd: # output only second target string + f_out.write('\t'.join((t2, str(scores[(t1, t2)])+'\n'))) + else: # standard outputs both target strings + f_out.write('\t'.join(('%s,%s' % (t1,t2), str(scores[(t1, t2)])+'\n'))) + + + logging.info("--- %s seconds ---" % (time.time() - start_time)) + + + +if __name__ == '__main__': + main() diff --git a/representations/sgns.py b/representations/sgns.py index be4934f8..77afa01a 100644 --- a/representations/sgns.py +++ b/representations/sgns.py @@ -57,7 +57,7 @@ def main(): hs=0, # negative sampling negative=k, # number of negative samples sample=t, # threshold for subsampling, if None, no subsampling is performed - size=dim, window=windowSize, min_count=minCount, iter=itera, workers=20) + size=dim, window=windowSize, min_count=minCount, iter=itera, workers=40) # Initialize vocabulary vocab_sentences = PathLineSentences(corpDir) diff --git a/scripts/make_results_disp.sh b/scripts/make_results_disp.sh index a4364d7d..78a677c9 100644 --- a/scripts/make_results_disp.sh +++ b/scripts/make_results_disp.sh @@ -79,3 +79,7 @@ infolder1=$entropyresultfolder1 infolder2=$entropyresultfolder2 outfolder=$resultfolder source scripts/run_DIFF.sh # Subtract entropy (Entropy Difference) + +# Create random predictions as baselines +outfolder=$resultfolder +source scripts/run_RAND.sh diff --git a/scripts/make_results_sim.sh b/scripts/make_results_sim.sh index 122b7952..51d7d802 100644 --- a/scripts/make_results_sim.sh +++ b/scripts/make_results_sim.sh @@ -74,3 +74,7 @@ matrixfolder2=$alignedmatrixfolder2 outfolder=$resultfolder source scripts/run_CD.sh # Cosine Distance source scripts/run_LND.sh # Local Neighborhood Distance + +# Create random predictions as baselines +outfolder=$resultfolder +source scripts/run_RAND.sh diff --git a/scripts/make_results_wi.sh b/scripts/make_results_wi.sh index 42a9cd69..16ccc71f 100644 --- a/scripts/make_results_wi.sh +++ b/scripts/make_results_wi.sh @@ -22,7 +22,6 @@ source scripts/run_PPMI.sh # PPMI matrixfolder=$ppmimatrixfolderwi outfolder=$svdmatrixfolderwi source scripts/run_SVD.sh # SVD - # Get Predictions for matrixfolder in "${matrixfolders[@]}" do @@ -33,3 +32,7 @@ do source scripts/run_CD.sh # Cosine Distance source scripts/run_LND.sh # Local Neighborhood Distance done + +# Create random predictions as baselines +outfolder=$resultfolder +source scripts/run_RAND.sh diff --git a/scripts/make_targets.sh b/scripts/make_targets.sh new file mode 100644 index 00000000..12bf91b6 --- /dev/null +++ b/scripts/make_targets.sh @@ -0,0 +1,24 @@ + +## Make target input files + +if [ ! -f $targets ]; +then + echo -e "Error: No target file found at $targets." + exit 0 +fi + +if [ ! -f $testset ]; +then + for i in `cat $targets` + do + echo -e "$i\t$i" >> $testset # general input + done +fi + +if [ ! -f $testsetwi ]; +then + for i in `cat $targets` + do + echo -e "${i}_\t$i" >> $testsetwi # input for word injection + done +fi diff --git a/scripts/parameters_durel.sh b/scripts/parameters_durel.sh index 21411cbe..88c6b645 100644 --- a/scripts/parameters_durel.sh +++ b/scripts/parameters_durel.sh @@ -1,25 +1,31 @@ shopt -s extglob # For more powerful regular expressions in shell ### Define parameters ### -declare -a corpDir1="corpora/durel/corpus1/" # directory for corpus1 files (all files in directory will be read) -declare -a corpDir2="corpora/durel/corpus2/" # directory for corpus2 files (all files in directory will be read) -declare -a wiCorpDir="corpora/durel/corpus_wi/" # directory for word-injected corpus (only needed for Word Injection) -declare -a freqnorms=(26650530 40323497) # normalization constants for token frequency (total number of tokens in first and second corpus, *before cleaning*) -declare -a typesnorms=(252437 796365) # normalization constants for number of context types (total number of types in first and second corpus, *before cleaning*) -declare -a windowSizes=(2 5 10) # window sizes for all models -declare -a ks=(5 1) # values for shifting parameter k -declare -a ts=(0.001 None) # values for subsampling parameter t -declare -a iterations=(1 2 3 4 5) # list of iterations, each item is one iteration, for five iterations define: iterations=(1 2 3 4 5) -declare -a dim=300 # dimensionality of low-dimensional matrices (SVD/RI/SGNS) -declare -a testset="testsets/durel/targets.tsv" # target words for which change scores should be predicted (one target per line repeated twice with tab-separation, i.e., 'word\tword') -declare -a testsetwi="testsets/durel/targets_wi.tsv" # target words for Word Injection (one target per line, injected version in first column, non-injected version in second column, i.e., 'word_\tword') -declare -a goldscorefile="testsets/durel/gold.tsv" # file with gold scores for target words in same order as targets in testsets +corpDir1="corpora/durel/corpus1/" # directory for corpus1 files (all files in directory will be read) +corpDir2="corpora/durel/corpus2/" # directory for corpus2 files (all files in directory will be read) +wiCorpDir="corpora/durel/corpus_wi/" # directory for word-injected corpus (only needed for Word Injection) +freqnorms=(26650530 40323497) # normalization constants for token frequency (total number of tokens in first and second corpus, *before cleaning*) +typesnorms=(252437 796365) # normalization constants for number of context types (total number of types in first and second corpus, *before cleaning*) +windowSizes=(2 5 10) # window sizes for all models +ks=(5 1) # values for shifting parameter k +ts=(0.001 None) # values for subsampling parameter t +iterations=(1 2 3 4 5) # list of iterations, each item is one iteration, for five iterations define: iterations=(1 2 3 4 5) +dims=(300) # dimensionality of low-dimensional matrices (SVD/RI/SGNS) +eps=(5) # training epochs for SGNS +targets="testsets/durel/targets.tsv" # target words for which change scores should be predicted (one target per line) +testset="testsets/durel/targets_input.tsv" # target words in input format (one target per line repeated twice with tab-separation, i.e., 'word\tword', will be created) +testsetwi="testsets/durel/targets_wi.tsv" # target words in word injection format (one target per line, injected version in first column, non-injected version in second column, i.e., 'word_\tword', will be created) +goldrankfile="testsets/durel/rank.tsv" # file with gold scores for target words in same order as targets in testsets +goldclassfile="" # file with gold classes for target words in same order as targets in testsets (leave undefined if non-existent) # Get normalization constants for dispersion measures -declare -a freqnorm1=${freqnorms[0]} -declare -a freqnorm2=${freqnorms[1]} -declare -a typesnorm1=${typesnorms[0]} -declare -a typesnorm2=${typesnorms[1]} +freqnorm1=${freqnorms[0]} +freqnorm2=${freqnorms[1]} +typesnorm1=${typesnorms[0]} +typesnorm2=${typesnorms[1]} ### Make folder structure ### source scripts/make_folders.sh + +### Make target input files ### +source scripts/make_targets.sh diff --git a/scripts/parameters_semcor.sh b/scripts/parameters_semcor.sh new file mode 100644 index 00000000..1cd18f7f --- /dev/null +++ b/scripts/parameters_semcor.sh @@ -0,0 +1,31 @@ +shopt -s extglob # For more powerful regular expressions in shell + +### Define parameters ### +corpDir1="corpora/semcor_lsc/corpus1/" # directory for corpus1 files (all files in directory will be read) +corpDir2="corpora/semcor_lsc/corpus2/" # directory for corpus2 files (all files in directory will be read) +wiCorpDir="corpora/semcor_lsc/corpus_wi_full/" # directory for word-injected corpus (only needed for Word Injection) +freqnorms=(343395 366784) # normalization constants for token frequency (total number of tokens in first and second corpus) +typesnorms=(23553 23995) # normalization constants for number of context types (total number of types in first and second corpus) +windowSizes=(10) # window sizes for all models +ks=(5) # values for shifting parameter k +ts=(None) # values for subsampling parameter t +iterations=(1 2 3 4 5) # list of iterations, each item is one iteration, for five iterations define: iterations=(1 2 3 4 5) +dims=(30 100) # dimensionality of low-dimensional matrices (SVD/RI/SGNS) +eps=(30) # training epochs for SGNS +targets="testsets/semcor_lsc/testset/targets.tsv" # target words for which change scores should be predicted (one target per line) +testset="testsets/semcor_lsc/testset/targets_in.tsv" # target words in input format (one target per line repeated twice with tab-separation, i.e., 'word\tword', will be created) +testsetwi="testsets/semcor_lsc/testset/targets_wi.tsv" # target words in word injection format (one target per line, injected version in first column, non-injected version in second column, i.e., 'word_\tword', will be created) +goldrankfile="testsets/semcor_lsc/testset/graded.tsv" # file with gold scores for target words in same order as targets in testsets +goldclassfile="testsets/semcor_lsc/testset/binary.tsv" # file with gold classes for target words in same order as targets in testsets (leave undefined if non-existent) + +# Get normalization constants for dispersion measures +freqnorm1=${freqnorms[0]} +freqnorm2=${freqnorms[1]} +typesnorm1=${typesnorms[0]} +typesnorm2=${typesnorms[1]} + +### Make folder structure ### +source scripts/make_folders.sh + +### Make target input files ### +source scripts/make_targets.sh diff --git a/scripts/parameters_surel.sh b/scripts/parameters_surel.sh index 58eb546c..d1c6f94c 100644 --- a/scripts/parameters_surel.sh +++ b/scripts/parameters_surel.sh @@ -1,25 +1,31 @@ shopt -s extglob # For more powerful regular expressions in shell ### Define parameters ### -declare -a corpDir1="corpora/surel/corpus1/" # directory for corpus1 files (all files in directory will be read) -declare -a corpDir2="corpora/surel/corpus2/" # directory for corpus2 files (all files in directory will be read) -declare -a wiCorpDir="corpora/surel/corpus_wi/" # directory for word-injected corpus (only needed for Word Injection) -declare -a freqnorms=(109731661 1049573) # normalization constants for token frequency (total number of tokens in first and second corpus) -declare -a typesnorms=(2417171 49187) # normalization constants for number of context types (total number of types in first and second corpus) -declare -a windowSizes=(2 5 10) # window sizes for all models -declare -a ks=(5 1) # values for shifting parameter k -declare -a ts=(0.001 None) # values for subsampling parameter t -declare -a iterations=(1 2 3 4 5) # list of iterations, each item is one iteration, for five iterations define: iterations=(1 2 3 4 5) -declare -a dim=300 # dimensionality of low-dimensional matrices (SVD/RI/SGNS) -declare -a testset="testsets/surel/targets.tsv" # target words for which change scores should be predicted (one target per line repeated twice with tab-separation, i.e., 'word\tword') -declare -a testsetwi="testsets/surel/targets_wi.tsv" # target words for Word Injection (one target per line, injected version in first column, non-injected version in second column, i.e., 'word_\tword') -declare -a goldscorefile="testsets/surel/gold.tsv" # file with gold scores for target words in same order as targets in testsets +corpDir1="corpora/surel/corpus1/" # directory for corpus1 files (all files in directory will be read) +corpDir2="corpora/surel/corpus2/" # directory for corpus2 files (all files in directory will be read) +wiCorpDir="corpora/surel/corpus_wi/" # directory for word-injected corpus (only needed for Word Injection) +freqnorms=(109731661 1049573) # normalization constants for token frequency (total number of tokens in first and second corpus) +typesnorms=(2417171 49187) # normalization constants for number of context types (total number of types in first and second corpus) +windowSizes=(2 5 10) # window sizes for all models +ks=(5 1) # values for shifting parameter k +ts=(0.001 None) # values for subsampling parameter t +iterations=(1 2 3 4 5) # list of iterations, each item is one iteration, for five iterations define: iterations=(1 2 3 4 5) +dims=(300) # dimensionality of low-dimensional matrices (SVD/RI/SGNS) +eps=(5) # training epochs for SGNS +targets="testsets/surel/targets.tsv" # target words for which change scores should be predicted (one target per line) +testset="testsets/surel/targets_in.tsv" # target words in input format (one target per line repeated twice with tab-separation, i.e., 'word\tword', will be created) +testsetwi="testsets/surel/targets_wi.tsv" # target words in word injection format (one target per line, injected version in first column, non-injected version in second column, i.e., 'word_\tword', will be created) +goldrankfile="testsets/surel/rank.tsv" # file with gold scores for target words in same order as targets in testsets +goldclassfile="" # file with gold classes for target words in same order as targets in testsets (leave undefined if non-existent) # Get normalization constants for dispersion measures -declare -a freqnorm1=${freqnorms[0]} -declare -a freqnorm2=${freqnorms[1]} -declare -a typesnorm1=${typesnorms[0]} -declare -a typesnorm2=${typesnorms[1]} +freqnorm1=${freqnorms[0]} +freqnorm2=${freqnorms[1]} +typesnorm1=${typesnorms[0]} +typesnorm2=${typesnorms[1]} ### Make folder structure ### source scripts/make_folders.sh + +### Make target input files ### +source scripts/make_targets.sh diff --git a/scripts/parameters_test.sh b/scripts/parameters_test.sh index e28d3bbf..6ed10296 100644 --- a/scripts/parameters_test.sh +++ b/scripts/parameters_test.sh @@ -1,25 +1,31 @@ shopt -s extglob # For more powerful regular expressions in shell ### Define parameters ### -declare -a corpDir1="corpora/test/corpus1/" # directory for corpus1 files (all files in directory will be read) -declare -a corpDir2="corpora/test/corpus2/" # directory for corpus2 files (all files in directory will be read) -declare -a wiCorpDir="corpora/test/corpus_wi/" # directory for word-injected corpus (only needed for Word Injection) -declare -a freqnorms=(35329 54486) # normalization constants for token frequency (total number of tokens in first and second corpus) -declare -a typesnorms=(6358 9510) # normalization constants for number of context types (total number of types in first and second corpus) -declare -a windowSizes=(1) # window sizes for all models -declare -a ks=(1) # values for shifting parameter k -declare -a ts=(None) # values for subsampling parameter t -declare -a iterations=(1) # list of iterations, each item is one iteration, for five iterations define: iterations=(1 2 3 4 5) -declare -a dim=30 # dimensionality of low-dimensional matrices (SVD/RI/SGNS) -declare -a testset="testsets/test/targets.tsv" # target words for which change scores should be predicted (one target per line repeated twice with tab-separation, i.e., 'word\tword') -declare -a testsetwi="testsets/test/targets_wi.tsv" # target words for Word Injection (one target per line, injected version in first column, non-injected version in second column, i.e., 'word_\tword') -declare -a goldscorefile="testsets/test/gold.tsv" # file with gold scores for target words in same order as targets in testsets +corpDir1="corpora/test/corpus1/" # directory for corpus1 files (all files in directory will be read) +corpDir2="corpora/test/corpus2/" # directory for corpus2 files (all files in directory will be read) +wiCorpDir="corpora/test/corpus_wi/" # directory for word-injected corpus (only needed for Word Injection) +freqnorms=(35329 54486) # normalization constants for token frequency (total number of tokens in first and second corpus) +typesnorms=(6358 9510) # normalization constants for number of context types (total number of types in first and second corpus) +windowSizes=(1) # window sizes for all models +ks=(1) # values for shifting parameter k +ts=(None) # values for subsampling parameter t +iterations=(1) # list of iterations, each item is one iteration, for five iterations define: iterations=(1 2 3 4 5) +dims=(30) # dimensionality of low-dimensional matrices (SVD/RI/SGNS) +eps=(1) # training epochs for SGNS +targets="testsets/test/targets.tsv" # target words for which change scores should be predicted (one target per line) +testset="testsets/test/targets_input.tsv" # target words in input format (one target per line repeated twice with tab-separation, i.e., 'word\tword', will be created) +testsetwi="testsets/test/targets_wi.tsv" # target words in word injection format (one target per line, injected version in first column, non-injected version in second column, i.e., 'word_\tword', will be created) +goldrankfile="testsets/test/rank.tsv" # file with gold scores for target words in same order as targets in testsets +goldclassfile="testsets/test/class.tsv" # # file with gold classes for target words in same order as targets in testsets (leave undefined if non-existent) # Get normalization constants for dispersion measures -declare -a freqnorm1=${freqnorms[0]} -declare -a freqnorm2=${freqnorms[1]} -declare -a typesnorm1=${typesnorms[0]} -declare -a typesnorm2=${typesnorms[1]} +freqnorm1=${freqnorms[0]} +freqnorm2=${freqnorms[1]} +typesnorm1=${typesnorms[0]} +typesnorm2=${typesnorms[1]} ### Make folder structure ### source scripts/make_folders.sh + +### Make target input files ### +source scripts/make_targets.sh diff --git a/scripts/run_AP.sh b/scripts/run_AP.sh new file mode 100644 index 00000000..b4a1b6ff --- /dev/null +++ b/scripts/run_AP.sh @@ -0,0 +1,12 @@ + +if [ -f $goldclassfile ]; # Check whether gold class file exists +then + resultfiles=($resultfolder/*.tsv) + for resultfile in "${resultfiles[@]}" + do + resultfileshort=${resultfile#$(dirname "$(dirname "$resultfile")")/} + python3 evaluation/ap.py $goldclassfile $resultfile $(basename "$goldclassfile") $resultfileshort >> $outfolder/ap.tsv # evaluate results with Average Precision + done +else + echo -e "Warning: No gold class file found at $goldclassfile." +fi diff --git a/scripts/run_CD.sh b/scripts/run_CD.sh index a0d38d20..d4ddb760 100644 --- a/scripts/run_CD.sh +++ b/scripts/run_CD.sh @@ -1,5 +1,5 @@ -matrices=($matrixfolder1/!(*@(_rows|_columns|.model))) +matrices=($matrixfolder1/!(*@(_rows|_columns|.model*))) for matrix in "${matrices[@]}" do diff --git a/scripts/run_CI.sh b/scripts/run_CI.sh index ee2f7967..9438b51b 100644 --- a/scripts/run_CI.sh +++ b/scripts/run_CI.sh @@ -1,5 +1,5 @@ -matrices=($matrixfolder1/!(*@(_rows|_columns|.model))) +matrices=($matrixfolder1/!(*@(_rows|_columns|.model*))) for matrix in "${matrices[@]}" do diff --git a/scripts/run_ENTR.sh b/scripts/run_ENTR.sh index 643a2f3f..4c370cdf 100644 --- a/scripts/run_ENTR.sh +++ b/scripts/run_ENTR.sh @@ -1,5 +1,5 @@ -matrices=($matrixfolder/!(*@(_rows|_columns|.model))) +matrices=($matrixfolder/!(*@(_rows|_columns|.model*))) for matrix in "${matrices[@]}" do diff --git a/scripts/run_LND.sh b/scripts/run_LND.sh index d2c5c332..c24dbe84 100644 --- a/scripts/run_LND.sh +++ b/scripts/run_LND.sh @@ -1,5 +1,5 @@ -matrices=($matrixfolder1/!(*@(_rows|_columns|.model))) +matrices=($matrixfolder1/!(*@(_rows|_columns|.model*))) for matrix in "${matrices[@]}" do diff --git a/scripts/run_NENTR.sh b/scripts/run_NENTR.sh index 5b59c6fe..be275c5f 100644 --- a/scripts/run_NENTR.sh +++ b/scripts/run_NENTR.sh @@ -1,5 +1,5 @@ -matrices=($matrixfolder/!(*@(_rows|_columns|.model))) +matrices=($matrixfolder/!(*@(_rows|_columns|.model*))) for matrix in "${matrices[@]}" do diff --git a/scripts/run_NTYPE.sh b/scripts/run_NTYPE.sh index e194928c..0e5f7e38 100644 --- a/scripts/run_NTYPE.sh +++ b/scripts/run_NTYPE.sh @@ -1,5 +1,5 @@ -matrices=($matrixfolder/!(*@(_rows|_columns|.model))) +matrices=($matrixfolder/!(*@(_rows|_columns|.model*))) for matrix in "${matrices[@]}" do diff --git a/scripts/run_OP+.sh b/scripts/run_OP+.sh index bc924556..0603966f 100644 --- a/scripts/run_OP+.sh +++ b/scripts/run_OP+.sh @@ -1,5 +1,5 @@ -matrices=($matrixfolder1/!(*@(_rows|_columns|.model))) +matrices=($matrixfolder1/!(*@(_rows|_columns|.model*))) for matrix in "${matrices[@]}" do diff --git a/scripts/run_OP-.sh b/scripts/run_OP-.sh index 134f696b..fcc8547a 100644 --- a/scripts/run_OP-.sh +++ b/scripts/run_OP-.sh @@ -1,5 +1,5 @@ -matrices=($matrixfolder1/!(*@(_rows|_columns|.model))) +matrices=($matrixfolder1/!(*@(_rows|_columns|.model*))) for matrix in "${matrices[@]}" do diff --git a/scripts/run_OP.sh b/scripts/run_OP.sh index 382e7599..e8574d36 100644 --- a/scripts/run_OP.sh +++ b/scripts/run_OP.sh @@ -1,5 +1,5 @@ -matrices=($matrixfolder1/!(*@(_rows|_columns|.model))) +matrices=($matrixfolder1/!(*@(_rows|_columns|.model*))) for matrix in "${matrices[@]}" do diff --git a/scripts/run_PPMI.sh b/scripts/run_PPMI.sh index 7705bea9..889dcdda 100644 --- a/scripts/run_PPMI.sh +++ b/scripts/run_PPMI.sh @@ -1,5 +1,5 @@ -matrices=($matrixfolder/!(*@(_rows|_columns|.model))) +matrices=($matrixfolder/!(*@(_rows|_columns|.model*))) for matrix in "${matrices[@]}" do diff --git a/scripts/run_RAND.sh b/scripts/run_RAND.sh new file mode 100644 index 00000000..7a07985a --- /dev/null +++ b/scripts/run_RAND.sh @@ -0,0 +1,5 @@ + +for i in {1..10} +do + python3 measures/rand.py -s -r $testset $outfolder/RAND-$i.tsv # random predictions as baseline +done diff --git a/scripts/run_RI.sh b/scripts/run_RI.sh index c3e3d721..2ab233ed 100644 --- a/scripts/run_RI.sh +++ b/scripts/run_RI.sh @@ -1,5 +1,5 @@ -matrices=($matrixfolder/!(*@(_rows|_columns|.model))) +matrices=($matrixfolder/!(*@(_rows|_columns|.model*))) for matrix in "${matrices[@]}" do @@ -7,7 +7,10 @@ do do for t in "${ts[@]}" do - python3 representations/ri.py -s 2 $matrix $outfolder/$(basename "$matrix")-t$t-iter$iteration.ri $outfolder/$(basename "$matrix")-t$t-iter$iteration.ri-elemental-space $dim $t # reduce matrix by random indexing + for dim in "${dims[@]}" + do + python3 representations/ri.py -s 2 $matrix $outfolder/$(basename "$matrix")-t$t-dim$dim-iter$iteration.ri $outfolder/$(basename "$matrix")-t$t-dim$dim-iter$iteration.ri-elemental-space $dim $t # reduce matrix by random indexing + done done done done diff --git a/scripts/run_SGNS.sh b/scripts/run_SGNS.sh index 675d95ef..6fcc76b6 100644 --- a/scripts/run_SGNS.sh +++ b/scripts/run_SGNS.sh @@ -7,7 +7,13 @@ do do for iteration in "${iterations[@]}" do - python3 representations/sgns.py $corpDir $outfolder/win$windowSize-k$k-t$t-iter$iteration.sgns $windowSize $dim $k $t 0 5 # construct word2vec skip-gram embeddings + for dim in "${dims[@]}" + do + for ep in "${eps[@]}" + do + python3 representations/sgns.py $corpDir $outfolder/win$windowSize-k$k-t$t-dim$dim-ep$ep-iter$iteration.sgns $windowSize $dim $k $t 0 $ep # construct word2vec skip-gram embeddings + done + done done done done diff --git a/scripts/run_SGNS_VI.sh b/scripts/run_SGNS_VI.sh index 327d5ecf..92be3a9f 100644 --- a/scripts/run_SGNS_VI.sh +++ b/scripts/run_SGNS_VI.sh @@ -7,8 +7,14 @@ do do for iteration in "${iterations[@]}" do - python3 alignment/sgns_vi.py $infolder/win$windowSize-k$k-t$t-iter$iteration.sgns.model $corpDir2 $outfolder2/win$windowSize-k$k-t$t-iter$iteration.sgns-VI # construct word2vec skip-gram embeddings with vector initialization - scp $infolder/win$windowSize-k$k-t$t-iter$iteration.sgns $outfolder1/win$windowSize-k$k-t$t-iter$iteration.sgns-VI # copy initialization vectors as matrix for first time period + for dim in "${dims[@]}" + do + for ep in "${eps[@]}" + do + python3 alignment/sgns_vi.py $infolder/win$windowSize-k$k-t$t-dim$dim-ep$ep-iter$iteration.sgns.model $corpDir2 $outfolder2/win$windowSize-k$k-t$t-dim$dim-ep$ep-iter$iteration.sgns-VI # construct word2vec skip-gram embeddings with vector initialization + scp $infolder/win$windowSize-k$k-t$t-dim$dim-ep$ep-iter$iteration.sgns $outfolder1/win$windowSize-k$k-t$t-dim$dim-ep$ep-iter$iteration.sgns-VI # copy initialization vectors as matrix for first time period + done + done done done done diff --git a/scripts/run_SPR.sh b/scripts/run_SPR.sh index 02a7f854..52692b31 100644 --- a/scripts/run_SPR.sh +++ b/scripts/run_SPR.sh @@ -1,7 +1,12 @@ -resultfiles=($resultfolder/*.tsv) -for resultfile in "${resultfiles[@]}" -do - resultfileshort=${resultfile#$(dirname "$(dirname "$resultfile")")/} - python3 evaluation/spearman.py $goldscorefile $resultfile $(basename "$goldscorefile") $resultfileshort 0 1 >> $outfolder/spearman_scores.tsv # evaluate results with Spearman correlation -done +if [ -f $goldrankfile ]; # Check whether gold rank file exists +then + resultfiles=($resultfolder/*.tsv) + for resultfile in "${resultfiles[@]}" + do + resultfileshort=${resultfile#$(dirname "$(dirname "$resultfile")")/} + python3 evaluation/spr.py $goldrankfile $resultfile $(basename "$goldrankfile") $resultfileshort 0 1 >> $outfolder/spr.tsv # evaluate results with Spearman correlation + done +else + echo -e "Warning: No gold rank file found at $goldrankfile." +fi diff --git a/scripts/run_SRV.sh b/scripts/run_SRV.sh index 372a0fac..2d42e86f 100644 --- a/scripts/run_SRV.sh +++ b/scripts/run_SRV.sh @@ -1,5 +1,5 @@ -matrices=($matrixfolder1/!(*@(_rows|_columns|.model))) +matrices=($matrixfolder1/!(*@(_rows|_columns|.model*))) for matrix in "${matrices[@]}" do @@ -7,7 +7,10 @@ do do for t in "${ts[@]}" do - python3 alignment/srv_align.py -s 2 $matrix $matrixfolder2/$(basename "$matrix") $outfolder1/$(basename "$matrix")-t$t-iter$iteration-SRV $outfolder2/$(basename "$matrix")-t$t-iter$iteration-SRV $outfolder1/$(basename "$matrix")-t$t-iter$iteration-elemental-space $dim $t # construct random indexing matrices from count matrices with shared random vectors + for dim in "${dims[@]}" + do + python3 alignment/srv_align.py -s 2 $matrix $matrixfolder2/$(basename "$matrix") $outfolder1/$(basename "$matrix")-t$t-dim$dim-iter$iteration-SRV $outfolder2/$(basename "$matrix")-t$t-dim$dim-iter$iteration-SRV $outfolder1/$(basename "$matrix")-t$t-dim$dim-iter$iteration-elemental-space $dim $t # construct random indexing matrices from count matrices with shared random vectors + done done done done diff --git a/scripts/run_SVD.sh b/scripts/run_SVD.sh index a40aeeff..d3490e84 100644 --- a/scripts/run_SVD.sh +++ b/scripts/run_SVD.sh @@ -1,10 +1,13 @@ -matrices=($matrixfolder/!(*@(_rows|_columns|.model))) +matrices=($matrixfolder/!(*@(_rows|_columns|.model*))) for matrix in "${matrices[@]}" do for iteration in "${iterations[@]}" do - python3 representations/svd.py $matrix $outfolder/$(basename "$matrix")-iter$iteration.svd $dim 0.0 # reduce matrix by SVD + for dim in "${dims[@]}" + do + python3 representations/svd.py $matrix $outfolder/$(basename "$matrix")-dim$dim-iter$iteration.svd $dim 0.0 # reduce matrix by SVD + done done done diff --git a/scripts/run_TYPE.sh b/scripts/run_TYPE.sh index adb8cd00..1e7ca7c3 100644 --- a/scripts/run_TYPE.sh +++ b/scripts/run_TYPE.sh @@ -1,5 +1,5 @@ -matrices=($matrixfolder/!(*@(_rows|_columns|.model))) +matrices=($matrixfolder/!(*@(_rows|_columns|.model*))) for matrix in "${matrices[@]}" do diff --git a/scripts/run_durel.sh b/scripts/run_durel.sh index d63b64fb..cf67a539 100644 --- a/scripts/run_durel.sh +++ b/scripts/run_durel.sh @@ -6,12 +6,12 @@ wget https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc/dta19.tx ## Define global parameters ## # DURel parameters -declare -a parameterfile=scripts/parameters_durel.sh # corpus- and testset-specific parameter specifications +parameterfile=scripts/parameters_durel.sh # corpus- and testset-specific parameter specifications ## Get predictions from models ## # All models with similarity measures -declare -a globalmatrixfolderprefix=matrices/durel_sim # parent folder for matrices -declare -a globalresultfolderprefix=results/durel_sim # parent folder for results +globalmatrixfolderprefix=matrices/durel_sim # parent folder for matrices +globalresultfolderprefix=results/durel_sim # parent folder for results source $parameterfile # get corpus- and testset-specific parameters source scripts/make_results_sim.sh # Evaluate results @@ -20,8 +20,8 @@ outfolder=$globalresultfolder source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores # All models with dispersion measures -declare -a globalmatrixfolderprefix=matrices/durel_disp # parent folder for matrices -declare -a globalresultfolderprefix=results/durel_disp # parent folder for results +globalmatrixfolderprefix=matrices/durel_disp # parent folder for matrices +globalresultfolderprefix=results/durel_disp # parent folder for results source $parameterfile # get corpus- and testset-specific parameters source scripts/make_results_disp.sh # Evaluate results @@ -30,8 +30,8 @@ outfolder=$globalresultfolder source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores # All models with word injection -declare -a globalmatrixfolderprefix=matrices/durel_wi # parent folder for matrices -declare -a globalresultfolderprefix=results/durel_wi # parent folder for results +globalmatrixfolderprefix=matrices/durel_wi # parent folder for matrices +globalresultfolderprefix=results/durel_wi # parent folder for results source $parameterfile # get corpus- and testset-specific parameters ## Make word-injected corpus ## diff --git a/scripts/run_semcor.sh b/scripts/run_semcor.sh new file mode 100644 index 00000000..74072bda --- /dev/null +++ b/scripts/run_semcor.sh @@ -0,0 +1,61 @@ +### THIS SCRIPT PRODUCES PREDICTIONS AND EVALUATES THEM FOR ALL MODELS WITH SEMCOR PARAMETERS ### + +## Download corpora and testsets ## +wget https://www.ims.uni-stuttgart.de/documents/ressourcen/experiment-daten/semcor_lsc.zip -nc -P testsets/ +cd testsets/ && unzip -o semcor_lsc.zip && rm semcor_lsc.zip && cd .. +if [ ! -d corpora/semcor_lsc ]; +then + mv testsets/semcor_lsc/corpora corpora/semcor_lsc +else + rm -r testsets/semcor_lsc/corpora +fi + +## Define global parameters ## +# SEMCOR parameters +parameterfile=scripts/parameters_semcor.sh # corpus- and testset-specific parameter specifications + +## Get predictions from models ## +# All models with similarity measures +globalmatrixfolderprefix=matrices/semcor_sim # parent folder for matrices +globalresultfolderprefix=results/semcor_sim # parent folder for results +source $parameterfile # get corpus- and testset-specific parameters +source scripts/make_results_sim.sh +# Evaluate results +resultfolder=$resultfolder +outfolder=$globalresultfolder +source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores +source scripts/run_AP.sh # Get Average Precision of measure predictions with gold classes + +# All models with dispersion measures +globalmatrixfolderprefix=matrices/semcor_disp # parent folder for matrices +globalresultfolderprefix=results/semcor_disp # parent folder for results +source $parameterfile # get corpus- and testset-specific parameters +source scripts/make_results_disp.sh +# Evaluate results +resultfolder=$resultfolder +outfolder=$globalresultfolder +source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores +source scripts/run_AP.sh # Get Average Precision of measure predictions with gold classes + +# All models with word injection +globalmatrixfolderprefix=matrices/semcor_wi # parent folder for matrices +globalresultfolderprefix=results/semcor_wi # parent folder for results +source $parameterfile # get corpus- and testset-specific parameters + +## Make word-injected corpus ## +if [ ! -f $wiCorpDir/corpus_wi.txt.gz ]; +then + mkdir -p $wiCorpDir + corpDir1=$corpDir1 + corpDir2=$corpDir2 + outfile=$wiCorpDir/corpus_wi.txt + source scripts/run_WI.sh # Create combined word-injected corpus from corpus1 and corpus2 +fi + +source scripts/make_results_wi.sh + +# Evaluate results +resultfolder=$resultfolder +outfolder=$globalresultfolder +source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores +source scripts/run_AP.sh # Get Average Precision of measure predictions with gold classes diff --git a/scripts/run_surel.sh b/scripts/run_surel.sh index c6117bd2..812f4d54 100644 --- a/scripts/run_surel.sh +++ b/scripts/run_surel.sh @@ -8,12 +8,12 @@ wget https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc/cook.txt ## Define global parameters ## # SURel parameters -declare -a parameterfile=scripts/parameters_surel.sh # corpus- and testset-specific parameter specifications +parameterfile=scripts/parameters_surel.sh # corpus- and testset-specific parameter specifications ## Get predictions from models ## # All models with similarity measures -declare -a globalmatrixfolderprefix=matrices/surel_sim # parent folder for matrices -declare -a globalresultfolderprefix=results/surel_sim # parent folder for results +globalmatrixfolderprefix=matrices/surel_sim # parent folder for matrices +globalresultfolderprefix=results/surel_sim # parent folder for results source $parameterfile # get corpus- and testset-specific parameters source scripts/make_results_sim.sh # Evaluate results @@ -22,8 +22,8 @@ outfolder=$globalresultfolder source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores # All models with dispersion measures -declare -a globalmatrixfolderprefix=matrices/surel_disp # parent folder for matrices -declare -a globalresultfolderprefix=results/surel_disp # parent folder for results +globalmatrixfolderprefix=matrices/surel_disp # parent folder for matrices +globalresultfolderprefix=results/surel_disp # parent folder for results source $parameterfile # get corpus- and testset-specific parameters source scripts/make_results_disp.sh # Evaluate results @@ -32,8 +32,8 @@ outfolder=$globalresultfolder source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores # All models with word injection -declare -a globalmatrixfolderprefix=matrices/surel_wi # parent folder for matrices -declare -a globalresultfolderprefix=results/surel_wi # parent folder for results +globalmatrixfolderprefix=matrices/surel_wi # parent folder for matrices +globalresultfolderprefix=results/surel_wi # parent folder for results source $parameterfile # get corpus- and testset-specific parameters ## Make word-injected corpus ## diff --git a/scripts/run_test.sh b/scripts/run_test.sh index 7eb6a6d6..c9aaa2d1 100644 --- a/scripts/run_test.sh +++ b/scripts/run_test.sh @@ -2,35 +2,38 @@ ## Define global parameters ## # Test parameters -declare -a parameterfile=scripts/parameters_test.sh # corpus- and testset-specific parameter specifications +parameterfile=scripts/parameters_test.sh # corpus- and testset-specific parameter specifications ## Get predictions from models ## # All models with similarity measures -declare -a globalmatrixfolderprefix=matrices/test_sim # parent folder for matrices -declare -a globalresultfolderprefix=results/test_sim # parent folder for results +globalmatrixfolderprefix=matrices/test_sim # parent folder for matrices +globalresultfolderprefix=results/test_sim # parent folder for results source $parameterfile # get corpus- and testset-specific parameters source scripts/make_results_sim.sh # Evaluate results resultfolder=$resultfolder outfolder=$globalresultfolder source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores +source scripts/run_AP.sh # Get Average Precision of measure predictions with gold classes # All models with dispersion measures -declare -a globalmatrixfolderprefix=matrices/test_disp # parent folder for matrices -declare -a globalresultfolderprefix=results/test_disp # parent folder for results +globalmatrixfolderprefix=matrices/test_disp # parent folder for matrices +globalresultfolderprefix=results/test_disp # parent folder for results source $parameterfile # get corpus- and testset-specific parameters source scripts/make_results_disp.sh # Evaluate results resultfolder=$resultfolder outfolder=$globalresultfolder source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores +source scripts/run_AP.sh # Get Average Precision of measure predictions with gold classes # All models with word injection -declare -a globalmatrixfolderprefix=matrices/test_wi # parent folder for matrices -declare -a globalresultfolderprefix=results/test_wi # parent folder for results +globalmatrixfolderprefix=matrices/test_wi # parent folder for matrices +globalresultfolderprefix=results/test_wi # parent folder for results source $parameterfile # get corpus- and testset-specific parameters source scripts/make_results_wi.sh # Evaluate results resultfolder=$resultfolder outfolder=$globalresultfolder source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores +source scripts/run_AP.sh # Get Average Precision of measure predictions with gold classes diff --git a/testsets/durel/gold.tsv b/testsets/durel/rank.tsv similarity index 100% rename from testsets/durel/gold.tsv rename to testsets/durel/rank.tsv diff --git a/testsets/durel/targets.tsv b/testsets/durel/targets.tsv index 19803af6..dfee3c5f 100644 --- a/testsets/durel/targets.tsv +++ b/testsets/durel/targets.tsv @@ -1,19 +1,19 @@ -Abend Abend -Anstalt Anstalt -Anstellung Anstellung -Bilanz Bilanz -billig billig -Donnerwetter Donnerwetter -englisch englisch -Feder Feder -Feine Feine -geharnischt geharnischt -locker locker -Motiv Motiv -Museum Museum -packen packen -Presse Presse -Reichstag Reichstag -technisch technisch -Vorwort Vorwort -Zufall Zufall +Abend +Anstalt +Anstellung +Bilanz +billig +Donnerwetter +englisch +Feder +Feine +geharnischt +locker +Motiv +Museum +packen +Presse +Reichstag +technisch +Vorwort +Zufall diff --git a/testsets/durel/targets_wi.tsv b/testsets/durel/targets_wi.tsv deleted file mode 100644 index 89f8426f..00000000 --- a/testsets/durel/targets_wi.tsv +++ /dev/null @@ -1,19 +0,0 @@ -Abend_ Abend -Anstalt_ Anstalt -Anstellung_ Anstellung -Bilanz_ Bilanz -billig_ billig -Donnerwetter_ Donnerwetter -englisch_ englisch -Feder_ Feder -Feine_ Feine -geharnischt_ geharnischt -locker_ locker -Motiv_ Motiv -Museum_ Museum -packen_ packen -Presse_ Presse -Reichstag_ Reichstag -technisch_ technisch -Vorwort_ Vorwort -Zufall_ Zufall diff --git a/testsets/surel/gold.tsv b/testsets/surel/rank.tsv similarity index 100% rename from testsets/surel/gold.tsv rename to testsets/surel/rank.tsv diff --git a/testsets/surel/targets.tsv b/testsets/surel/targets.tsv index beb4f65b..d9c2f3c5 100644 --- a/testsets/surel/targets.tsv +++ b/testsets/surel/targets.tsv @@ -1,21 +1,21 @@ -abschrecken abschrecken -Blech Blech -Eiweiß Eiweiß -Form Form -Gemüse Gemüse -Gericht Gericht -Glas Glas -Hamburger Hamburger -Mandel Mandel -Messer Messer -Paprika Paprika -Prise Prise -Rum Rum -Salz Salz -schlagen schlagen -Schnee Schnee -Schnittlauch Schnittlauch -Schokolade Schokolade -Schuß Schuß -Strudel Strudel -trennen trennen +abschrecken +Blech +Eiweiß +Form +Gemüse +Gericht +Glas +Hamburger +Mandel +Messer +Paprika +Prise +Rum +Salz +schlagen +Schnee +Schnittlauch +Schokolade +Schuß +Strudel +trennen diff --git a/testsets/surel/targets_wi.tsv b/testsets/surel/targets_wi.tsv deleted file mode 100644 index d58d7d20..00000000 --- a/testsets/surel/targets_wi.tsv +++ /dev/null @@ -1,21 +0,0 @@ -abschrecken_ abschrecken -Blech_ Blech -Eiweiß_ Eiweiß -Form_ Form -Gemüse_ Gemüse -Gericht_ Gericht -Glas_ Glas -Hamburger_ Hamburger -Mandel_ Mandel -Messer_ Messer -Paprika_ Paprika -Prise_ Prise -Rum_ Rum -Salz_ Salz -schlagen_ schlagen -Schnee_ Schnee -Schnittlauch_ Schnittlauch -Schokolade_ Schokolade -Schuß_ Schuß -Strudel_ Strudel -trennen_ trennen diff --git a/testsets/test/class.tsv b/testsets/test/class.tsv new file mode 100644 index 00000000..5a4d7b6a --- /dev/null +++ b/testsets/test/class.tsv @@ -0,0 +1,4 @@ +0 +1 +0 +0 diff --git a/testsets/test/gold.tsv b/testsets/test/rank.tsv similarity index 100% rename from testsets/test/gold.tsv rename to testsets/test/rank.tsv diff --git a/testsets/test/targets.tsv b/testsets/test/targets.tsv index 0ed987f5..d2d79946 100644 --- a/testsets/test/targets.tsv +++ b/testsets/test/targets.tsv @@ -1,4 +1,4 @@ -Gott Gott -und und -haben haben -göttlich göttlich +Gott +und +haben +göttlich diff --git a/testsets/test/targets_input.tsv b/testsets/test/targets_input.tsv new file mode 100644 index 00000000..0ed987f5 --- /dev/null +++ b/testsets/test/targets_input.tsv @@ -0,0 +1,4 @@ +Gott Gott +und und +haben haben +göttlich göttlich