Skip to content

Commit

Permalink
former -a option standard, improve efficiency
Browse files Browse the repository at this point in the history
  • Loading branch information
garrafao committed Mar 23, 2020
1 parent c3def01 commit a1cf695
Show file tree
Hide file tree
Showing 7 changed files with 69 additions and 239 deletions.
31 changes: 17 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ The scripts assume a corpus format of one sentence per line in UTF-8 encoded (op
| Count | `representations/count.py` | VSM | |
| PPMI | `representations/ppmi.py` | VSM | |
| SVD | `representations/svd.py` | VSM | |
| RI | `representations/ri.py` | VSM | - use `-a` for good performance |
| RI | `representations/ri.py` | VSM | |
| SGNS | `representations/sgns.py` | VSM | |
| SCAN | [repository](https://github.com/ColiLea/scan) | TPM | - different corpus input format |

Expand All @@ -75,7 +75,7 @@ Table: VSM=Vector Space Model, TPM=Topic Model
|Name | Code | Applicability | Comment |
| --- | --- | --- | --- |
| CI | `alignment/ci_align.py` | Count, PPMI | |
| SRV | `alignment/srv_align.py` | RI | - use `-a` for good performance <br> - consider using the efficient and more powerful [TRIPY](https://github.com/Garrafao/TRIPY) |
| SRV | `alignment/srv_align.py` | RI | - consider using more powerful [TRIPY](https://github.com/Garrafao/TRIPY) |
| OP | `alignment/map_embeddings.py` | SVD, RI, SGNS | - drawn from [VecMap](https://github.com/artetxem/vecmap) <br> - for OP- and OP+ see `scripts/` |
| VI | `alignment/sgns_vi.py` | SGNS | - bug fixes 27/12/19 (see script for details) |
| WI | `alignment/wi.py` | Count, PPMI, SVD, RI, SGNS | - consider using the more advanced [Temporal Referencing](https://github.com/Garrafao/TemporalReferencing) |
Expand All @@ -99,11 +99,11 @@ Find detailed notes on model performances and optimal parameter settings in [the

The evaluation framework of this repository is based on the comparison of a set of target words across two corpora. Hence, models can be evaluated on a triple (dataset, corpus1, corpus2), where the dataset provides gold values for the change of target words between corpus1 and corpus2.

| Dataset | Corpus 1 | Corpus 2 | Download | Comment |
| --- | --- | --- | --- | --- |
| DURel | DTA18 | DTA19 | [Dataset](https://www.ims.uni-stuttgart.de/data/durel), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc) | - version from Schlechtweg et al. (2019) at `testsets/durel/` |
| SURel | SDEWAC | COOK | [Dataset](https://www.ims.uni-stuttgart.de/data/surel), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc) | - version from Schlechtweg et al. (2019) at `testsets/surel/` |
| SemCor LSC | SEMCOR1 | SEMCOR2 | [Dataset](https://www.ims.uni-stuttgart.de/data/lsc-simul), [Corpora](https://www.ims.uni-stuttgart.de/data/lsc-simul) | |
| Dataset | Language | Corpus 1 | Corpus 2 | Download | Comment |
| --- | --- | --- | --- | --- | --- |
| DURel | German | DTA18 | DTA19 | [Dataset](https://www.ims.uni-stuttgart.de/data/durel), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc) | - version from Schlechtweg et al. (2019) at `testsets/durel/` |
| SURel | German | SDEWAC | COOK | [Dataset](https://www.ims.uni-stuttgart.de/data/surel), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc) | - version from Schlechtweg et al. (2019) at `testsets/surel/` |
| SemCor LSC | English | SEMCOR1 | SEMCOR2 | [Dataset](https://www.ims.uni-stuttgart.de/data/lsc-simul), [Corpora](https://www.ims.uni-stuttgart.de/data/lsc-simul) | |

We provide several evaluation pipelines, downloading the corpora and evaluating the models on the above-mentioned datasets, see [pipeline](#pipeline).

Expand Down Expand Up @@ -140,6 +140,7 @@ As is the scripts will reproduce the results from Schlechtweg et al. (2019) and

- September 1, 2019: Python scripts were updated from Python 2 to Python 3.
- December 27, 2019: bug fixes in `alignment/sgns_vi.py` (see script for details)
- March 23, 2020: updates in `representations/ri.py` and `alignment/srv_align.py` (see scripts for details)

### Error Sources

Expand All @@ -153,19 +154,21 @@ BibTex
title = {{A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains}},
author = {Dominik Schlechtweg and Anna H\"{a}tty and Marco del Tredici and Sabine {Schulte im Walde}},
booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
year = {2019},
address = {Florence, Italy},
publisher = {Association for Computational Linguistics},
pages = {732--746}
year = {2019},
address = {Florence, Italy},
publisher = {Association for Computational Linguistics},
pages = {732--746},
doi = {10.18653/v1/P19-1072}
}
```
```
@inproceedings{SchlechtwegWalde20,
title = {{Simulating Lexical Semantic Change from Sense-Annotated Data}},
author = {Dominik Schlechtweg and Sabine {Schulte im Walde}},
year = {2020}
booktitle = {{The Evolution of Language: Proceedings of the 13th International Conference (EVOLANGXIII)}},
editor = {C. Cuskley and M. Flaherty and H. Little and Luke McCrohon and A. Ravignani and T. Verhoef},
publisher = {Online at {}},
booktitle = {{The Evolution of Language: Proceedings of the 13th International Conference (EvoLang13)}},
editor = {Ravignani, A. and Barbieri, C. and Martins, M. and Flaherty, M. and Jadoul, Y. and Lattenkamp, E. and Little, H. and Mudd, K. and Verhoef, T.},
url = {http://brussels.evolang.org/proceedings/paper.html?nr=9},
doi = {10.17617/2.3190925}
}
```
4 changes: 2 additions & 2 deletions alignment/sgns_vi.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ def main():
Arguments:
<modelPath> = model for initialization
<corpDir> = path to corpus directory with zipped files, each sentence in form 'year\tword1 word2 word3...'
<corpDir> = path to corpus directory with zipped files
<outPath> = output path for vectors
Options:
Expand Down Expand Up @@ -58,7 +58,7 @@ def main():
# Load model
model = Word2Vec.load(modelPath)

# Intersect vocabulary
# Build vocabulary
vocab_sentences = PathLineSentences(corpDir)
logging.getLogger('gensim').setLevel(logging.ERROR)
model.build_vocab(vocab_sentences, update=True)
Expand Down
142 changes: 21 additions & 121 deletions alignment/srv_align.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import time
import numpy as np
from sklearn.random_projection import sparse_random_matrix
from scipy.sparse import lil_matrix, csc_matrix, hstack, vstack
from scipy.sparse import csr_matrix
from utils_ import Space


Expand All @@ -20,21 +20,19 @@ def main():
args = docopt('''Create two aligned low-dimensional vector spaces by sparse random indexing from two co-occurrence matrices.
Usage:
srv_align.py [-l] (-s <seeds> | -a) <matrixPath1> <matrixPath2> <outPath1> <outPath2> <outPathElement> <dim> <t>
srv_align.py [-l] <matrixPath1> <matrixPath2> <outPath1> <outPath2> <dim>
<seeds> = number of non-zero values in each random vector
<matrixPath1> = path to matrix1
<matrixPath2> = path to matrix2
<outPath1> = output path for aligned space 1
<outPath2> = output path for aligned space 2
<outPathElement> = output path for elemental space (context vectors)
<dim> = number of dimensions for random vectors
<t> = threshold for downsampling (if t=None, no subsampling is applied)
Options:
-l, --len normalize final vectors to unit length
-s, --see specify number of seeds manually
-a, --aut calculate number of seeds automatically as proposed in [1,2]
Note:
Assumes intersected and ordered columns. Paramaters -s, -a and <t> have been removed from an earlier version for efficiency. Also columns are now intersected instead of unified.
References:
[1] Ping Li, T. Hastie and K. W. Church, 2006,
Expand All @@ -46,134 +44,37 @@ def main():
''')

is_len = args['--len']
is_seeds = args['--see']
if is_seeds:
seeds = int(args['<seeds>'])
is_aut = args['--aut']
matrixPath1 = args['<matrixPath1>']
matrixPath2 = args['<matrixPath2>']
outPath1 = args['<outPath1>']
outPath2 = args['<outPath2>']
outPathElement = args['<outPathElement>']
dim = int(args['<dim>'])
if args['<t>']=='None':
t = None
else:
t = float(args['<t>'])


logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logging.info(__file__.upper())
start_time = time.time()

# Load input matrices
space1 = Space(matrixPath1)
matrix1 = space1.matrix
space2 = Space(matrixPath2)
matrix2 = space2.matrix

# Get mappings between rows/columns and words
rows1 = space1.rows
id2row1 = space1.id2row
row2id1 = space1.row2id
columns1 = space1.columns
column2id1 = space1.column2id
rows2 = space2.rows
id2row2 = space2.id2row
row2id2 = space2.row2id
columns2 = space2.columns
column2id2 = space2.column2id
countSpace1 = Space(matrixPath1)
countMatrix1 = countSpace1.matrix
rows1 = countSpace1.rows
columns1 = countSpace1.columns

countSpace2 = Space(matrixPath2)
countMatrix2 = countSpace2.matrix
rows2 = countSpace2.rows
columns2 = countSpace2.columns

# Get union of rows and columns in both spaces
unified_rows = sorted(list(set(rows1).union(rows2)))
unified_columns = sorted(list(set(columns1).union(columns2)))
columns_diff1 = sorted(list(set(unified_columns) - set(columns1)))
columns_diff2 = sorted(list(set(unified_columns) - set(columns2)))
# Generate random vectors
randomMatrix = csr_matrix(sparse_random_matrix(dim,len(columns1)).toarray().T)

# Get mappings of indices of columns in original spaces to indices of columns in unified space
c2i = {w: i for i, w in enumerate(unified_columns)}
cj2i1 = {j: c2i[w] for j, w in enumerate(columns1+columns_diff1)}
cj2i2 = {j: c2i[w] for j, w in enumerate(columns2+columns_diff2)}

if t!=None:
rows_diff1 = list(set(unified_rows) - set(rows1))
rows_diff2 = list(set(unified_rows) - set(rows2))

r2i = {w: i for i, w in enumerate(unified_rows)}
rj2i1 = {j: r2i[w] for j, w in enumerate(rows1+rows_diff1)}
rj2i2 = {j: r2i[w] for j, w in enumerate(rows2+rows_diff2)}

# Build spaces with unified COLUMNS
new_columns1 = csc_matrix((len(rows1),len(columns_diff1))) # Get empty columns for additional context words
unified_matrix1 = csc_matrix(hstack((matrix1,new_columns1)))[:,sorted(cj2i1, key=cj2i1.get)] # First concatenate matrix and empty columns and then order columns according to unified_columns

new_columns2 = csc_matrix((len(rows2),len(columns_diff2)))
unified_matrix2 = csc_matrix(hstack((matrix2,new_columns2)))[:,sorted(cj2i2, key=cj2i2.get)]
logging.info("Multiplying matrices")
reducedMatrix1 = np.dot(countMatrix1,randomMatrix)
reducedMatrix2 = np.dot(countMatrix2,randomMatrix)

# Build spaces with unified ROWS
new_rows1 = csc_matrix((len(rows_diff1),len(unified_columns)))
final_unified_matrix1 = csc_matrix(vstack((unified_matrix1,new_rows1)))[sorted(rj2i1, key=rj2i1.get)]

new_rows2 = csc_matrix((len(rows_diff2),len(unified_columns)))
final_unified_matrix2 = csc_matrix(vstack((unified_matrix2,new_rows2)))[sorted(rj2i2, key=rj2i2.get)]

# Add up final unified matrices
common_unified_matrix = np.add(final_unified_matrix1,final_unified_matrix2)

# Get number of total occurrences of any word
totalOcc = np.sum(common_unified_matrix)

# Define function for downsampling
downsample = lambda f: np.sqrt(float(t)/f) if f>t else 1.0
downsample = np.vectorize(downsample)

# Get total normalized co-occurrence frequency of all contexts in both spaces
context_freqs = np.array(common_unified_matrix.sum(axis=0)/totalOcc)[0]


## Generate ternary random vectors
if is_seeds:
elementalMatrix = lil_matrix((len(unified_columns),dim))
# Generate base vector for random vectors
baseVector = np.zeros(dim) # Note: Make sure that number of seeds is not greater than dimensions
for i in range(0,int(seeds/2)):
baseVector[i] = 1.0
for i in range(int(seeds/2),seeds):
baseVector[i] = -1.0
for i in range(len(unified_columns)): # To-do: make this more efficient by generating random indices for a whole array
np.random.shuffle(baseVector)
elementalMatrix[i] = baseVector
if is_aut:
elementalMatrix = sparse_random_matrix(dim,len(unified_columns)).T

# Initialize target vectors
alignedMatrix1 = np.zeros((len(rows1),dim))
alignedMatrix2 = np.zeros((len(rows2),dim))


# Iterate over rows of space, find context words and update aligned matrix with low-dimensional random vectors of these context words
for (matrix,id2row,cj2i,alignedMatrix) in [(matrix1,id2row1,cj2i1,alignedMatrix1),(matrix2,id2row2,cj2i2,alignedMatrix2)]:
# Iterate over targets
for i in id2row:
# Get co-occurrence values as matrix
m = matrix[i]
# Get nonzero indexes
nonzeros = m.nonzero()
nonzeros = [cj2i[j] for j in nonzeros[1]]
data = m.data
pos_context_vectors = elementalMatrix[nonzeros]
if t!=None:
# Apply subsampling
rfs = context_freqs[nonzeros]
rfs = downsample(rfs)
data *= rfs
# Weight context vectors by occurrence frequency
pos_context_vectors = pos_context_vectors.multiply(data.reshape(-1,1))
# Add up context vectors and store as row for target
alignedMatrix[i] = np.sum(pos_context_vectors, axis=0)

outSpace1 = Space(matrix=alignedMatrix1, rows=rows1, columns=[])
outSpace2 = Space(matrix=alignedMatrix2, rows=rows2, columns=[])
outSpace1 = Space(matrix=reducedMatrix1, rows=rows1, columns=[])
outSpace2 = Space(matrix=reducedMatrix2, rows=rows2, columns=[])

if is_len:
# L2-normalize vectors
Expand All @@ -183,7 +84,6 @@ def main():
# Save the matrices
outSpace1.save(outPath1)
outSpace2.save(outPath2)
Space(matrix=elementalMatrix, rows=unified_columns, columns=[]).save(outPathElement)

logging.info("--- %s seconds ---" % (time.time() - start_time))

Expand Down
Loading

0 comments on commit a1cf695

Please sign in to comment.