merge sgns_vi.py and sgns_vi2.py, initialize on full model, join voca…

…bulary
Garrafao · Dec 27, 2019 · 7b72046 · 7b72046
1 parent 73c694b
commit 7b72046
Show file tree

Hide file tree

Showing 4 changed files with 27 additions and 185 deletions.
diff --git a/README.md b/README.md
@@ -59,53 +59,41 @@ The scripts assume a corpus format of one sentence per line in UTF-8 encoded (op
 
 #### Semantic Representations
 
-|Name | Code | Type |
-| --- | --- | --- |
-| Count | `representations/count.py` | VSM |
-| PPMI | `representations/ppmi.py` | VSM |
-| SVD | `representations/svd.py` | VSM |
-| RI | `representations/ri.py` | VSM |
-| SGNS | `representations/sgns.py` | VSM |
-| SCAN | [repository](https://github.com/ColiLea/scan) | TPM |
+|Name | Code | Type | Comment |
+| --- | --- | --- | --- |
+| Count | `representations/count.py` | VSM | |
+| PPMI | `representations/ppmi.py` | VSM | |
+| SVD | `representations/svd.py` | VSM | |
+| RI | `representations/ri.py` | VSM | - use `-a` for good performance |
+| SGNS | `representations/sgns.py` | VSM | |
+| SCAN | [repository](https://github.com/ColiLea/scan) | TPM | - different corpus input format |
 
 Table: VSM=Vector Space Model, TPM=Topic Model
 
-Note that SCAN takes a slightly different corpus input format than the other models.
-
 #### Alignment
 
-|Name | Code | Applicability |
-| --- | --- | --- |
-| CI | `alignment/ci_align.py` | Count, PPMI |
-| SRV | `alignment/srv_align.py` | RI |
-| OP | `alignment/map_embeddings.py` | SVD, RI, SGNS |
-| VI | `alignment/sgns_vi.py` | SGNS |
-| WI | `alignment/wi.py` | Count, PPMI, SVD, RI, SGNS |
-
-The script `alignment/map_embeddings.py` is drawn from [VecMap](https://github.com/artetxem/vecmap), where you can find instructions how to use it. Find examples of how to obtain OP, OP- and OP+ under `scripts/`.
-
-For SRV, consider using the efficient and more powerful [TRIPY](https://github.com/Garrafao/TRIPY). Instead of WI, consider using the more advanced [Temporal Referencing](https://github.com/Garrafao/TemporalReferencing).
+|Name | Code | Applicability | Comment |
+| --- | --- | --- | --- |
+| CI | `alignment/ci_align.py` | Count, PPMI | |
+| SRV | `alignment/srv_align.py` | RI | - use `-a` for good performance <br> - consider using the efficient and more powerful [TRIPY](https://github.com/Garrafao/TRIPY) |
+| OP | `alignment/map_embeddings.py` | SVD, RI, SGNS | - drawn from [VecMap](https://github.com/artetxem/vecmap) <br> - for OP- and OP+ see `scripts/` |
+| VI | `alignment/sgns_vi.py` | SGNS | - updated 27/12/19 (see script for details) |
+| WI | `alignment/wi.py` | Count, PPMI, SVD, RI, SGNS | - consider using the more advanced [Temporal Referencing](https://github.com/Garrafao/TemporalReferencing) |
 
 #### Measures
 
-|Name | Code | Applicability |
-| --- | --- | --- |
-| CD | `measures/cd.py` | Count, PPMI, SVD, RI, SGNS |
-| LND | `measures/lnd.py` | Count, PPMI, SVD, RI, SGNS |
-| JSD | - | SCAN |
-| FD | `measures/freq.py` | from corpus |
-| TD | `measures/typs.py` |Count|
-| HD | `measures/entropy.py` | Count |
-
-FD, TD and HD need additional applications of `measures/diff.py` and optionally `measures/trsf.py`.
+|Name | Code | Applicability | Comment |
+| --- | --- | --- | --- |
+| CD | `measures/cd.py` | Count, PPMI, SVD, RI, SGNS | |
+| LND | `measures/lnd.py` | Count, PPMI, SVD, RI, SGNS | |
+| JSD | - | SCAN | |
+| FD | `measures/freq.py` | from corpus | - log-transform with `measures/trsf.py` <br> - get difference with `measures/diff.py` |
+| TD | `measures/typs.py` | Count | as above |
+| HD | `measures/entropy.py` | Count | as above |
 
 ### Parameter Settings
 
-For better performance, RI and SRV should be run with `-a` option, instead of specifying the seed number manually.
-
-Consider the application of column mean centering after L2-normalization to RI and SGNS embeddings before applying a change measure.
-
-Find more detailed notes on model performances and optimal parameter settings in [these papers](#bibtex).
+Find detailed notes on model performances and optimal parameter settings in [these papers](#bibtex).
 
 ### Evaluation
 

diff --git a/alignment/sgns_vi.py b/alignment/sgns_vi.py
@@ -21,19 +21,13 @@ def main():
     args = docopt("""Make comparable embedding vector spaces with Skip-Gram with Negative Sampling and Vector Initialization from corpus.
 
     Usage:
-        sgns_vi.py [-l] <modelPath> <corpDir> <outPath> <windowSize> <dim> <k> <t> <minCount> <itera>
+        sgns_vi.py [-l] <modelPath> <corpDir> <outPath>
         
     Arguments:
        
         <modelPath> = model for initialization
         <corpDir> = path to corpus directory with zipped files, each sentence in form 'year\tword1 word2 word3...'
         <outPath> = output path for vectors
-        <windowSize> = the linear distance of context words to consider in each direction
-        <dim> = dimensionality of embeddings
-        <k> = number of negative samples parameter (equivalent to shifting parameter for PPMI)
-        <t> = threshold for subsampling
-        <minCount> = number of occurrences for a word to be included in the vocabulary
-        <itera> = number of iterations
 
     Options:
         -l, --len   normalize final vectors to unit length
@@ -48,23 +42,14 @@ def main():
         Differences:
         In the original version for training on the second corpus only the previously created Embedding Matrix was loaded into the new model, so the Context matrix is newly initialized with random values. In the updated version the whole model is reused for training on the second corpus, that includes the Embedding Matrix as well as the Context matrix.
 
-        Additionally, vocabulary
+        Additionally, the vocabulary of the two corpora are now unified, before they were intersected.
 
     """)
 
     is_len = args['--len']
     modelPath = args['<modelPath>'] 
     corpDir = args['<corpDir>']
     outPath = args['<outPath>']
-    windowSize = int(args['<windowSize>'])    
-    dim = int(args['<dim>'])    
-    k = int(args['<k>'])
-    if args['<t>']=='None':
-        t = None
-    else:
-        t = float(args['<t>'])        
-    minCount = int(args['<minCount>'])    
-    itera = int(args['<itera>'])   
 
     logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
     logging.info(__file__.upper())

diff --git a/alignment/sgns_vi2.py b/alignment/sgns_vi2.py
diff --git a/scripts/run_SGNS_VI.sh b/scripts/run_SGNS_VI.sh
@@ -7,7 +7,7 @@ do
 	do		    		
 	    for iteration in "${iterations[@]}"
 	    do
-		python3 alignment/sgns_vi.py $infolder/win$windowSize-k$k-t$t-iter$iteration.sgns.model $corpDir2 $outfolder2/win$windowSize-k$k-t$t-iter$iteration.sgns-VI $windowSize $dim $k $t 0 5 # construct word2vec skip-gram embeddings with vector initialization
+		python3 alignment/sgns_vi.py $infolder/win$windowSize-k$k-t$t-iter$iteration.sgns.model $corpDir2 $outfolder2/win$windowSize-k$k-t$t-iter$iteration.sgns-VI # construct word2vec skip-gram embeddings with vector initialization
 		scp $infolder/win$windowSize-k$k-t$t-iter$iteration.sgns $outfolder1/win$windowSize-k$k-t$t-iter$iteration.sgns-VI # copy initialization vectors as matrix for first time period
 	    done	    
 	done