Skip to content

Commit

Permalink
merge sgns_vi.py and sgns_vi2.py, initialize on full model, join voca…
Browse files Browse the repository at this point in the history
…bulary
  • Loading branch information
garrafao committed Dec 27, 2019
1 parent 73c694b commit 7b72046
Show file tree
Hide file tree
Showing 4 changed files with 27 additions and 185 deletions.
60 changes: 24 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,53 +59,41 @@ The scripts assume a corpus format of one sentence per line in UTF-8 encoded (op

#### Semantic Representations

|Name | Code | Type |
| --- | --- | --- |
| Count | `representations/count.py` | VSM |
| PPMI | `representations/ppmi.py` | VSM |
| SVD | `representations/svd.py` | VSM |
| RI | `representations/ri.py` | VSM |
| SGNS | `representations/sgns.py` | VSM |
| SCAN | [repository](https://github.com/ColiLea/scan) | TPM |
|Name | Code | Type | Comment |
| --- | --- | --- | --- |
| Count | `representations/count.py` | VSM | |
| PPMI | `representations/ppmi.py` | VSM | |
| SVD | `representations/svd.py` | VSM | |
| RI | `representations/ri.py` | VSM | - use `-a` for good performance |
| SGNS | `representations/sgns.py` | VSM | |
| SCAN | [repository](https://github.com/ColiLea/scan) | TPM | - different corpus input format |

Table: VSM=Vector Space Model, TPM=Topic Model

Note that SCAN takes a slightly different corpus input format than the other models.

#### Alignment

|Name | Code | Applicability |
| --- | --- | --- |
| CI | `alignment/ci_align.py` | Count, PPMI |
| SRV | `alignment/srv_align.py` | RI |
| OP | `alignment/map_embeddings.py` | SVD, RI, SGNS |
| VI | `alignment/sgns_vi.py` | SGNS |
| WI | `alignment/wi.py` | Count, PPMI, SVD, RI, SGNS |

The script `alignment/map_embeddings.py` is drawn from [VecMap](https://github.com/artetxem/vecmap), where you can find instructions how to use it. Find examples of how to obtain OP, OP- and OP+ under `scripts/`.

For SRV, consider using the efficient and more powerful [TRIPY](https://github.com/Garrafao/TRIPY). Instead of WI, consider using the more advanced [Temporal Referencing](https://github.com/Garrafao/TemporalReferencing).
|Name | Code | Applicability | Comment |
| --- | --- | --- | --- |
| CI | `alignment/ci_align.py` | Count, PPMI | |
| SRV | `alignment/srv_align.py` | RI | - use `-a` for good performance <br> - consider using the efficient and more powerful [TRIPY](https://github.com/Garrafao/TRIPY) |
| OP | `alignment/map_embeddings.py` | SVD, RI, SGNS | - drawn from [VecMap](https://github.com/artetxem/vecmap) <br> - for OP- and OP+ see `scripts/` |
| VI | `alignment/sgns_vi.py` | SGNS | - updated 27/12/19 (see script for details) |
| WI | `alignment/wi.py` | Count, PPMI, SVD, RI, SGNS | - consider using the more advanced [Temporal Referencing](https://github.com/Garrafao/TemporalReferencing) |

#### Measures

|Name | Code | Applicability |
| --- | --- | --- |
| CD | `measures/cd.py` | Count, PPMI, SVD, RI, SGNS |
| LND | `measures/lnd.py` | Count, PPMI, SVD, RI, SGNS |
| JSD | - | SCAN |
| FD | `measures/freq.py` | from corpus |
| TD | `measures/typs.py` |Count|
| HD | `measures/entropy.py` | Count |

FD, TD and HD need additional applications of `measures/diff.py` and optionally `measures/trsf.py`.
|Name | Code | Applicability | Comment |
| --- | --- | --- | --- |
| CD | `measures/cd.py` | Count, PPMI, SVD, RI, SGNS | |
| LND | `measures/lnd.py` | Count, PPMI, SVD, RI, SGNS | |
| JSD | - | SCAN | |
| FD | `measures/freq.py` | from corpus | - log-transform with `measures/trsf.py` <br> - get difference with `measures/diff.py` |
| TD | `measures/typs.py` | Count | as above |
| HD | `measures/entropy.py` | Count | as above |

### Parameter Settings

For better performance, RI and SRV should be run with `-a` option, instead of specifying the seed number manually.

Consider the application of column mean centering after L2-normalization to RI and SGNS embeddings before applying a change measure.

Find more detailed notes on model performances and optimal parameter settings in [these papers](#bibtex).
Find detailed notes on model performances and optimal parameter settings in [these papers](#bibtex).

### Evaluation

Expand Down
19 changes: 2 additions & 17 deletions alignment/sgns_vi.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,19 +21,13 @@ def main():
args = docopt("""Make comparable embedding vector spaces with Skip-Gram with Negative Sampling and Vector Initialization from corpus.
Usage:
sgns_vi.py [-l] <modelPath> <corpDir> <outPath> <windowSize> <dim> <k> <t> <minCount> <itera>
sgns_vi.py [-l] <modelPath> <corpDir> <outPath>
Arguments:
<modelPath> = model for initialization
<corpDir> = path to corpus directory with zipped files, each sentence in form 'year\tword1 word2 word3...'
<outPath> = output path for vectors
<windowSize> = the linear distance of context words to consider in each direction
<dim> = dimensionality of embeddings
<k> = number of negative samples parameter (equivalent to shifting parameter for PPMI)
<t> = threshold for subsampling
<minCount> = number of occurrences for a word to be included in the vocabulary
<itera> = number of iterations
Options:
-l, --len normalize final vectors to unit length
Expand All @@ -48,23 +42,14 @@ def main():
Differences:
In the original version for training on the second corpus only the previously created Embedding Matrix was loaded into the new model, so the Context matrix is newly initialized with random values. In the updated version the whole model is reused for training on the second corpus, that includes the Embedding Matrix as well as the Context matrix.
Additionally, vocabulary
Additionally, the vocabulary of the two corpora are now unified, before they were intersected.
""")

is_len = args['--len']
modelPath = args['<modelPath>']
corpDir = args['<corpDir>']
outPath = args['<outPath>']
windowSize = int(args['<windowSize>'])
dim = int(args['<dim>'])
k = int(args['<k>'])
if args['<t>']=='None':
t = None
else:
t = float(args['<t>'])
minCount = int(args['<minCount>'])
itera = int(args['<itera>'])

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logging.info(__file__.upper())
Expand Down
131 changes: 0 additions & 131 deletions alignment/sgns_vi2.py

This file was deleted.

2 changes: 1 addition & 1 deletion scripts/run_SGNS_VI.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ do
do
for iteration in "${iterations[@]}"
do
python3 alignment/sgns_vi.py $infolder/win$windowSize-k$k-t$t-iter$iteration.sgns.model $corpDir2 $outfolder2/win$windowSize-k$k-t$t-iter$iteration.sgns-VI $windowSize $dim $k $t 0 5 # construct word2vec skip-gram embeddings with vector initialization
python3 alignment/sgns_vi.py $infolder/win$windowSize-k$k-t$t-iter$iteration.sgns.model $corpDir2 $outfolder2/win$windowSize-k$k-t$t-iter$iteration.sgns-VI # construct word2vec skip-gram embeddings with vector initialization
scp $infolder/win$windowSize-k$k-t$t-iter$iteration.sgns $outfolder1/win$windowSize-k$k-t$t-iter$iteration.sgns-VI # copy initialization vectors as matrix for first time period
done
done
Expand Down

0 comments on commit 7b72046

Please sign in to comment.