-
Notifications
You must be signed in to change notification settings - Fork 222
Word Embedding
The DMTK Word Embedding tool is a parallelization of the Word2Vec algorithm on top of Multiverso. It provides an efficient "scaling to industry size" solution for word embedding.
Linux Installation
cd multiverso/Applications/WordEmbedding
cmake CMakeLists.txt
make
Windows Installation
-
Get and build the DMTK Framework Multiverso.
-
Open Multiverso.sln, change configuration and platform to Release and x64, set the include and lib path of multiverso in WordEmbedding project property.
-
Enable openmp 2.0 support.
-
Build the solution.
For single machine training, run
WordEmbedding -param_name param_name
To run in a distributed environment, run with MPI,you need to determine which machines you will used,and make a machine file like that:
10.153.151.126
10.153.151.127
10.153.151.128
10.153.151.129
And you need to setup a listening process in every machine:
smpd -d -p port_number
you can create a run.bat like that and run it:
param_name = param_value
mpirun -m machine_file_path -port port_number WordEmbedding -param_name param_name
There are some parameters that need to set in Word2Vec experiments.Here is an example to show the format of run.bat. Suppose we're going to train a cbow model which negative number is 5 with 300-dimensions word embedding.
set size=300
set text=enwiki2014
set read_vocab="D:\Users\xxxx\Run\enwiki2014_vocab_m5.txt"
set train_file="D:\Users\xxxx\Run\enwiki2014"
set binary=1
set cbow=1
set alpha=0.05
set epoch=20
set window=5
set sample=0.001
set hs=0
set negative=5
set threads=15
set mincount=5
set sw_file="D:\Users\xxxx\Run\stopwords_simple.txt"
set stopwords=5
set data_block_size=100000000
set max_preload_data_size=300000000
set use_adagrad=0
set output=%text%_%size%.bin
set log_file= %text%_%size%.log
set is_pipeline=1
mpiexec.exe -machinefile machine_file.txt -port 9141 WordEmbedding.exe -is_pipeline %is_pipeline% -max_preload_data_size %max_preload_data_size% -alpha %alpha% -data_block_size %data_block_size% -train_file %train_file% -output %output% -threads %threads% -size %size% -binary %binary% -cbow %cbow% -epoch %epoch% -negative %negative% -hs %hs% -sample %sample% -min_count %mincount% -window %window% -stopwords %stopwords% -sw_file %sw_file% -read_vocab %read_vocab% -use_adagrad %use_adagrad% 2>&1 1>%log_file%
- It need to mentioned that training file
train_file
like enwiki2014 should be divided into n small dataset and put it on different machines.Every machine own the whole vocab dictionaryread_vocab
and stop word filesw_file
.
-
size
, word embedding size. -
cbow
, 0 or 1, default 1, whether to use cbow, otherwise skip-gram. -
alpha
, initial learning rate, usually set to 0.025. -
window
, the window size. -
sample
, the sub - sample size, default is 1e-3. -
hs
, 0 or 1,default 1, whether to use hierarchical softmax, otherwise negative-sampling. when hs = 1,negative must be 0. -
negative
, the negative word count in negative sampling, please set it to 0 when hs = 1. -
min_count
, words with lower frequency than min_count is removed from dictionary. -
use_adagrad
, 0 or 1, whether to use adagrad to adjust learning rate.
-
train_file
, the training corpus file, e.g.enwik2014. -
read_vocab
, the file to read all the vocab counts info. -
binary
, 0 or 1,indicates whether to write all the embeddings vectors into binary format. -
output
, the output file to store all the embedding vectors. -
stopwords
, 0 or 1, whether to avoid training stop words. -
sw_file
, the stop words file storing all the stop words, valid when stopwords = 1.
-
is_pipeline
, 0 or 1, whether to use pipeline. -
threads
, the thread number to run in one machine. -
epoch
, the epoch number. -
data_block_size
, default 1MB, the maximum bytes which a data block will store. -
max_preload_data_size
, default 8GB, the maximum data size(bytes) which program will preload. It could help you control memory efficiently. -
server_endpoint_file
, server ZMQ socket endpoint file in MPI - free version.
The final word embedding will save in rank 0 machine. Below is an example of output file. You could use the word embedding in other tasks easily.All of values are separated by whitespace.
word_number_m word_embeeding_size_n
word_name_1 dimension_1_of_word_1 dimension_2_of_word_1 ... dimension_n_of_word_1
word_name_2 dimension_1_of_word_2 dimension_2_of_word_2 ... dimension_n_of_word_2
word_name_3 dimension_1_of_word_3 dimension_2_of_word_3 ... dimension_n_of_word_3
...
word_name_m dimension_1_of_word_m dimension_2_of_word_m ... dimension_n_of_word_m
We report the performance of the DMTK Word Embedding tool on the English versions of Wiki2014 which contains 340,288,3423 tokens. The performances of DMTK Word Embedding are given as follows. The experiments are run on 20 cores of Intel Xeon E5-2670 CPU on each machine.
Progarm Name | Dimension | Machine | Analogical Reasoning | WS353 | Time |
---|---|---|---|---|---|
Google Word2Vec | 300 | 1 | 64.6% | 64.4% | 52340s |
DMTK Word2Vec | 300 | 4 | 65.3% | 75.1% | 23484s |
* The dataset statistics are got after data preprocessing.
* Analogical reasoning is evaluated by accuracy.
* WS353 is evaluated by Spearman's Rank.
* All the above experiments were run with the configuration like that:-cbow 1 -size 300 -alpha 0.05 -epoch 20 -window 5 -sample 0.0001 -hs 0 -negative 5 -mincount 5 -use_adagrad 0
. For DWE, the data block size is set as -data_block_size 100000000
(100MB).
* Take the best result in 20 epoch in experiments for the final result.
Convergence is as follows:
-
Adjust the learning rate
alpha
. You can try different learning rate according to convergence of every epoch. -
For small dataset, you can try Skip-gram and set
cbow = 0
. For large dataset, you can trycbow = 1
. -
A small
sample
value may improve the performance of word embedding. -
The init value of word embedding are randomly sample from Uniform[-0.5/embedding_size , 0.5/embedding]. Please consider before you change it.
-
Finally, dataset is really important.
-
Hierarchical softmax could get high quality word embedding in few epoch. But negative sample is faster and their performance are almost same when converged. You can set
hs = 0
and changenegative
. -
is_pipeline = 1
means the training model will train and request parameters in parallel. Which could help to reduce the training time. -
You could try setting a larger number of
threads
with the risk of lower accuracy.
[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.Efficient Estimation of Word Representations in Vector Space.In Proceedings of Workshop at ICLR, 2013.
DMTK
Multiverso
- Overview
- Multiverso setup
- Multiverso document
- Multiverso API document
- Multiverso applications
- Logistic Regression
- Word Embedding
- LightLDA
- Deep Learning
- Multiverso binding
- Run in docker
LightGBM