former -a option standard, improve efficiency

Garrafao · Mar 23, 2020 · a1cf695 · a1cf695
1 parent c3def01
commit a1cf695
Show file tree

Hide file tree

Showing 7 changed files with 69 additions and 239 deletions.
diff --git a/README.md b/README.md
@@ -64,7 +64,7 @@ The scripts assume a corpus format of one sentence per line in UTF-8 encoded (op
 | Count | `representations/count.py` | VSM | |
 | PPMI | `representations/ppmi.py` | VSM | |
 | SVD | `representations/svd.py` | VSM | |
-| RI | `representations/ri.py` | VSM | - use `-a` for good performance |
+| RI | `representations/ri.py` | VSM |  |
 | SGNS | `representations/sgns.py` | VSM | |
 | SCAN | [repository](https://github.com/ColiLea/scan) | TPM | - different corpus input format |
 
@@ -75,7 +75,7 @@ Table: VSM=Vector Space Model, TPM=Topic Model
 |Name | Code | Applicability | Comment |
 | --- | --- | --- | --- |
 | CI | `alignment/ci_align.py` | Count, PPMI | |
-| SRV | `alignment/srv_align.py` | RI | - use `-a` for good performance <br> - consider using the efficient and more powerful [TRIPY](https://github.com/Garrafao/TRIPY) |
+| SRV | `alignment/srv_align.py` | RI | - consider using more powerful [TRIPY](https://github.com/Garrafao/TRIPY) |
 | OP | `alignment/map_embeddings.py` | SVD, RI, SGNS | - drawn from [VecMap](https://github.com/artetxem/vecmap) <br> - for OP- and OP+ see `scripts/` |
 | VI | `alignment/sgns_vi.py` | SGNS | - bug fixes 27/12/19 (see script for details) |
 | WI | `alignment/wi.py` | Count, PPMI, SVD, RI, SGNS | - consider using the more advanced [Temporal Referencing](https://github.com/Garrafao/TemporalReferencing) |
@@ -99,11 +99,11 @@ Find detailed notes on model performances and optimal parameter settings in [the
 
 The evaluation framework of this repository is based on the comparison of a set of target words across two corpora. Hence, models can be evaluated on a triple (dataset, corpus1, corpus2), where the dataset provides gold values for the change of target words between corpus1 and corpus2.
 
-| Dataset | Corpus 1 | Corpus 2 | Download | Comment |
-| --- | --- | --- | --- | --- |
-| DURel | DTA18 | DTA19  | [Dataset](https://www.ims.uni-stuttgart.de/data/durel), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc) | - version from Schlechtweg et al. (2019) at `testsets/durel/` |
-| SURel | SDEWAC | COOK | [Dataset](https://www.ims.uni-stuttgart.de/data/surel), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc) | - version from Schlechtweg et al. (2019) at `testsets/surel/` |
-| SemCor LSC | SEMCOR1 | SEMCOR2 | [Dataset](https://www.ims.uni-stuttgart.de/data/lsc-simul), [Corpora](https://www.ims.uni-stuttgart.de/data/lsc-simul) | |
+| Dataset | Language | Corpus 1 | Corpus 2 | Download | Comment |
+| --- | --- | --- | --- | --- | --- |
+| DURel | German | DTA18 | DTA19  | [Dataset](https://www.ims.uni-stuttgart.de/data/durel), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc) | - version from Schlechtweg et al. (2019) at `testsets/durel/` |
+| SURel | German | SDEWAC | COOK | [Dataset](https://www.ims.uni-stuttgart.de/data/surel), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc) | - version from Schlechtweg et al. (2019) at `testsets/surel/` |
+| SemCor LSC | English | SEMCOR1 | SEMCOR2 | [Dataset](https://www.ims.uni-stuttgart.de/data/lsc-simul), [Corpora](https://www.ims.uni-stuttgart.de/data/lsc-simul) | |
 
 We provide several evaluation pipelines, downloading the corpora and evaluating the models on the above-mentioned datasets, see [pipeline](#pipeline).
 
@@ -140,6 +140,7 @@ As is the scripts will reproduce the results from Schlechtweg et al. (2019) and
 
 - September 1, 2019: Python scripts were updated from Python 2 to Python 3.
 - December 27, 2019: bug fixes in `alignment/sgns_vi.py` (see script for details)
+- March 23, 2020: updates in `representations/ri.py` and `alignment/srv_align.py` (see scripts for details)
 
 ### Error Sources
 
@@ -153,19 +154,21 @@ BibTex
 	title = {{A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains}},
 	author = {Dominik Schlechtweg and Anna H\"{a}tty and Marco del Tredici and Sabine {Schulte im Walde}},
     booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
-	year =  {2019},
-	address =  {Florence, Italy},
-	publisher =  {Association for Computational Linguistics},
-	pages     = {732--746}
+	year = {2019},
+	address = {Florence, Italy},
+	publisher = {Association for Computational Linguistics},
+	pages = {732--746},
+    doi = {10.18653/v1/P19-1072}
 }
 ```
 ```	
 @inproceedings{SchlechtwegWalde20,
 	title = {{Simulating Lexical Semantic Change from Sense-Annotated Data}},
 	author = {Dominik Schlechtweg and Sabine {Schulte im Walde}},
 	year = {2020}
-	booktitle = {{The Evolution of Language: Proceedings of the 13th International Conference (EVOLANGXIII)}},
-	editor = {C. Cuskley and M. Flaherty and H. Little and Luke McCrohon and A. Ravignani and T. Verhoef},
-	publisher = {Online at {}},
+	booktitle = {{The Evolution of Language: Proceedings of the 13th International Conference (EvoLang13)}},
+	editor = {Ravignani, A. and Barbieri, C. and Martins, M. and Flaherty, M. and Jadoul, Y. and Lattenkamp, E. and Little, H. and Mudd, K. and Verhoef, T.},
+	url = {http://brussels.evolang.org/proceedings/paper.html?nr=9},
+	doi = {10.17617/2.3190925}
 }
 ```
diff --git a/alignment/sgns_vi.py b/alignment/sgns_vi.py
@@ -26,7 +26,7 @@ def main():
     Arguments:
        
         <modelPath> = model for initialization
-        <corpDir> = path to corpus directory with zipped files, each sentence in form 'year\tword1 word2 word3...'
+        <corpDir> = path to corpus directory with zipped files
         <outPath> = output path for vectors
 
     Options:
@@ -58,7 +58,7 @@ def main():
     # Load model
     model = Word2Vec.load(modelPath)
 
-    # Intersect vocabulary
+    # Build vocabulary
     vocab_sentences = PathLineSentences(corpDir)
     logging.getLogger('gensim').setLevel(logging.ERROR)    
     model.build_vocab(vocab_sentences, update=True)

diff --git a/alignment/srv_align.py b/alignment/srv_align.py
@@ -6,7 +6,7 @@
 import time
 import numpy as np
 from sklearn.random_projection import sparse_random_matrix
-from scipy.sparse import lil_matrix, csc_matrix, hstack, vstack
+from scipy.sparse import csr_matrix
 from utils_ import Space
 
 
@@ -20,21 +20,19 @@ def main():
     args = docopt('''Create two aligned low-dimensional vector spaces by sparse random indexing from two co-occurrence matrices.
 
     Usage:
-        srv_align.py [-l] (-s <seeds> | -a) <matrixPath1> <matrixPath2> <outPath1> <outPath2> <outPathElement> <dim> <t>
+        srv_align.py [-l] <matrixPath1> <matrixPath2> <outPath1> <outPath2> <dim>
 
-        <seeds> = number of non-zero values in each random vector
         <matrixPath1> = path to matrix1
         <matrixPath2> = path to matrix2
         <outPath1> = output path for aligned space 1
         <outPath2> = output path for aligned space 2
-        <outPathElement> = output path for elemental space (context vectors)
         <dim> = number of dimensions for random vectors
-        <t> = threshold for downsampling (if t=None, no subsampling is applied)
 
     Options:
         -l, --len   normalize final vectors to unit length
-        -s, --see   specify number of seeds manually
-        -a, --aut   calculate number of seeds automatically as proposed in [1,2]
+
+    Note:
+        Assumes intersected and ordered columns. Paramaters -s, -a and <t> have been removed from an earlier version for efficiency. Also columns are now intersected instead of unified.
   
     References:
         [1] Ping Li, T. Hastie and K. W. Church, 2006,
@@ -46,134 +44,37 @@ def main():
     ''')
 
     is_len = args['--len']       
-    is_seeds = args['--see']
-    if is_seeds:
-        seeds = int(args['<seeds>'])
-    is_aut = args['--aut']
     matrixPath1 = args['<matrixPath1>']
     matrixPath2 = args['<matrixPath2>']
     outPath1 = args['<outPath1>']
     outPath2 = args['<outPath2>']
-    outPathElement = args['<outPathElement>']
     dim = int(args['<dim>'])
-    if args['<t>']=='None':
-        t = None
-    else:
-        t = float(args['<t>'])
 
 
     logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
     logging.info(__file__.upper())
     start_time = time.time()    
 
     # Load input matrices
-    space1 = Space(matrixPath1)   
-    matrix1 = space1.matrix
-    space2 = Space(matrixPath2)   
-    matrix2 = space2.matrix
-
-    # Get mappings between rows/columns and words
-    rows1 = space1.rows
-    id2row1 = space1.id2row
-    row2id1 = space1.row2id
-    columns1 = space1.columns
-    column2id1 = space1.column2id
-    rows2 = space2.rows
-    id2row2 = space2.id2row
-    row2id2 = space2.row2id
-    columns2 = space2.columns
-    column2id2 = space2.column2id
+    countSpace1 = Space(matrixPath1)   
+    countMatrix1 = countSpace1.matrix
+    rows1 = countSpace1.rows
+    columns1 = countSpace1.columns
+
+    countSpace2 = Space(matrixPath2)   
+    countMatrix2 = countSpace2.matrix
+    rows2 = countSpace2.rows
+    columns2 = countSpace2.columns
 
-    # Get union of rows and columns in both spaces
-    unified_rows = sorted(list(set(rows1).union(rows2)))
-    unified_columns = sorted(list(set(columns1).union(columns2)))
-    columns_diff1 = sorted(list(set(unified_columns) - set(columns1)))
-    columns_diff2 = sorted(list(set(unified_columns) - set(columns2)))
+    # Generate random vectors
+    randomMatrix = csr_matrix(sparse_random_matrix(dim,len(columns1)).toarray().T)
 
-    # Get mappings of indices of columns in original spaces to indices of columns in unified space
-    c2i = {w: i for i, w in enumerate(unified_columns)}
-    cj2i1 = {j: c2i[w] for j, w in enumerate(columns1+columns_diff1)}
-    cj2i2 = {j: c2i[w] for j, w in enumerate(columns2+columns_diff2)}
-
-    if t!=None:
-        rows_diff1 = list(set(unified_rows) - set(rows1))
-        rows_diff2 = list(set(unified_rows) - set(rows2))
-
-        r2i = {w: i for i, w in enumerate(unified_rows)}
-        rj2i1 = {j: r2i[w] for j, w in enumerate(rows1+rows_diff1)}
-        rj2i2 = {j: r2i[w] for j, w in enumerate(rows2+rows_diff2)}
-
-        # Build spaces with unified COLUMNS
-        new_columns1 = csc_matrix((len(rows1),len(columns_diff1))) # Get empty columns for additional context words
-        unified_matrix1 = csc_matrix(hstack((matrix1,new_columns1)))[:,sorted(cj2i1, key=cj2i1.get)] # First concatenate matrix and empty columns and then order columns according to unified_columns
-
-        new_columns2 = csc_matrix((len(rows2),len(columns_diff2)))
-        unified_matrix2 = csc_matrix(hstack((matrix2,new_columns2)))[:,sorted(cj2i2, key=cj2i2.get)]
+    logging.info("Multiplying matrices")
+    reducedMatrix1 = np.dot(countMatrix1,randomMatrix)    
+    reducedMatrix2 = np.dot(countMatrix2,randomMatrix)
 
-        # Build spaces with unified ROWS
-        new_rows1 = csc_matrix((len(rows_diff1),len(unified_columns)))
-        final_unified_matrix1 = csc_matrix(vstack((unified_matrix1,new_rows1)))[sorted(rj2i1, key=rj2i1.get)]
-
-        new_rows2 = csc_matrix((len(rows_diff2),len(unified_columns)))
-        final_unified_matrix2 = csc_matrix(vstack((unified_matrix2,new_rows2)))[sorted(rj2i2, key=rj2i2.get)]
-
-        # Add up final unified matrices
-        common_unified_matrix = np.add(final_unified_matrix1,final_unified_matrix2)
-
-        # Get number of total occurrences of any word
-        totalOcc = np.sum(common_unified_matrix)
-
-        # Define function for downsampling
-        downsample = lambda f: np.sqrt(float(t)/f) if f>t else 1.0
-        downsample = np.vectorize(downsample)
-
-        # Get total normalized co-occurrence frequency of all contexts in both spaces
-        context_freqs = np.array(common_unified_matrix.sum(axis=0)/totalOcc)[0]
-
-
-    ## Generate ternary random vectors
-    if is_seeds:        
-        elementalMatrix = lil_matrix((len(unified_columns),dim))    
-        # Generate base vector for random vectors
-        baseVector = np.zeros(dim) # Note: Make sure that number of seeds is not greater than dimensions
-        for i in range(0,int(seeds/2)):
-            baseVector[i] = 1.0
-        for i in range(int(seeds/2),seeds):
-            baseVector[i] = -1.0        
-        for i in range(len(unified_columns)): # To-do: make this more efficient by generating random indices for a whole array
-            np.random.shuffle(baseVector)
-            elementalMatrix[i] = baseVector
-    if is_aut:
-        elementalMatrix = sparse_random_matrix(dim,len(unified_columns)).T
-
-    # Initialize target vectors
-    alignedMatrix1 = np.zeros((len(rows1),dim))    
-    alignedMatrix2 = np.zeros((len(rows2),dim))
-
-
-    # Iterate over rows of space, find context words and update aligned matrix with low-dimensional random vectors of these context words
-    for (matrix,id2row,cj2i,alignedMatrix) in [(matrix1,id2row1,cj2i1,alignedMatrix1),(matrix2,id2row2,cj2i2,alignedMatrix2)]:
-        # Iterate over targets
-        for i in id2row:
-            # Get co-occurrence values as matrix
-            m = matrix[i]
-            # Get nonzero indexes
-            nonzeros = m.nonzero()
-            nonzeros = [cj2i[j] for j in nonzeros[1]]
-            data = m.data
-            pos_context_vectors = elementalMatrix[nonzeros]
-            if t!=None:
-                # Apply subsampling
-                rfs = context_freqs[nonzeros]
-                rfs = downsample(rfs)
-                data *= rfs
-            # Weight context vectors by occurrence frequency
-            pos_context_vectors = pos_context_vectors.multiply(data.reshape(-1,1))
-            # Add up context vectors and store as row for target
-            alignedMatrix[i] = np.sum(pos_context_vectors, axis=0)                
-
-    outSpace1 = Space(matrix=alignedMatrix1, rows=rows1, columns=[])
-    outSpace2 = Space(matrix=alignedMatrix2, rows=rows2, columns=[])
+    outSpace1 = Space(matrix=reducedMatrix1, rows=rows1, columns=[])
+    outSpace2 = Space(matrix=reducedMatrix2, rows=rows2, columns=[])
 
     if is_len:
         # L2-normalize vectors
@@ -183,7 +84,6 @@ def main():
     # Save the matrices
     outSpace1.save(outPath1)
     outSpace2.save(outPath2)
-    Space(matrix=elementalMatrix, rows=unified_columns, columns=[]).save(outPathElement)
 
     logging.info("--- %s seconds ---" % (time.time() - start_time))