Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training script #5

Open
lparam opened this issue Nov 26, 2017 · 10 comments
Open

Training script #5

lparam opened this issue Nov 26, 2017 · 10 comments

Comments

@lparam
Copy link

lparam commented Nov 26, 2017

@bheinzerling
Could you provide training script?I want to train with my own data.

@alejandrojcastaneira
Copy link

Hello. Thanks for the great work! again. I'm also interested in training a BPEmb embeddings on my custom data. There would be way or an example of how to apply this feature.

Best Regards

@Danil328
Copy link

+1

@bheinzerling
Copy link
Owner

Most of my original training script deals with training many different embeddings for all languages on a cluster (not sure how much sense it makes to share this), but the basic procedure is quite simple:

  1. Preprocess corpus.
  2. Learn BPE model on corpus, using SentencePiece.
  3. Encode corpus with BPE model, again using SentencePiece.
  4. Learn embeddings on encoded corpus, using GloVe.
$sentencepiece_dir=/install/sentencepiece/and/set/this/path
$glove_dir=/install/glove/and/set/this/path

$corpus=corpus.txt
$corpus_preproc=corpus_preproc.txt
$vocab_size=100000
$emb_dim=100
$model_type=bpe
$model_prefix=${corpus_preproc}.${model_type}.${vocab_size}
$emb_out=$model_prefix.d${emb_dim}

# preprocessing
# you probably want to lowercase everything and replace all digits with 0
# the preprocessing I used is quite specific to Wikipedia, depending on your corpus you can do something much simpler

# remove wikipedia section header === and article title ''' markers, silly sentence split on "  " and remove initial whitespace
sed "s/===\+/\n/g;s/'''//g;s/  /\n/g" $corpus | perl -C -pe 's/\x{200B}|\x{200C}|\x{200D}|\x{200E}|\x{202C}|\x{96}//g' | tr -s [[:blank:]] " " | sed -re 's/\xc2\x91\|\xc2\x92\|\xc2\xa0\|\xe2\x80\x8e//g;s#(https?://[^">< ]+)#🔗#g;s/[0-9]/0/g;s/^ \+//'  | grep ".\{100\}" | sed "s/^ //" > $corpus_preproc

# train SentencePiece model
$sentencepiece_dir/bin/spm_train --split_by_whitespace true --input $corpus_preproc --model_prefix $model_prefix --vocab_size $vocab_size --model_type $model_type

# encode preprocessed corpus with the trained SentencePiece model
$model_file=${model_prefix}.model
$corpus_encoded=corpus_encoded.txt
# encoding to numerical IDs (--output_format=id) saves you headaches if your corpus contains weird whitespace characters that might get treated differently between SentencePiece and Glove. You can leave this out if your corpus is quite clean.
cat $corpus_preproc | $sentencepiece_dir/bin/spm_encode --model $model_file --output $corpus_encoded --extra_options=bos:eos # --output_format=id

# train BPE embeddings with GloVe
$glove_dir/run.sh $corpus_encoded $emb_out $emb_dim

This will give you BPE embeddings in GloVe format ${emb_out}.glove.txt

I copy&pasted this from my actual scripts, let me know if this works for you.

Finally, the embeddings in glove format are in different order than the subwords in the BPE vocabulary, so the last step is to reorder them. If the above works for you I can think of a way to properly add this to the repo (not just a comment) and maybe can make it into a push-button solution.

@Danil328
Copy link

Thank you very much!

@alejandrojcastaneira
Copy link

alejandrojcastaneira commented Jun 6, 2019

Hello, I manage to train my own embeddings using glove based on your spm then I try to load them into bpemb as you commented in #23 by using:

from bpemb.util import sentencepiece_load, load_word2vec_file

bpemb = BPEmb(lang='en')
bpemb.spm = sentencepiece_load('/some/folder/en.wiki.bpe.vs200000.model')
bpemb.emb = load_word2vec_file('/some/folder/my_byte_pair_emb.w2v.bin')

but I still didn't make the reorder of the vectors, could you help me with an insight on this?

Best regards

@bheinzerling
Copy link
Owner

bheinzerling commented Jun 10, 2019

Assuming you have a SentencePiece .vocab file for your model, let's first write a helper function for loading this:

def get_vocab(vocab_file, vocab_size):
    with vocab_file.open(encoding="utf8") as f:
        # read lines, ignoring fun characters such as 'LINE SEPARATOR' (U+2028)
        # which Python treats as line breaks when reading files
        # with the ususal 'for line in f' pattern
        vocab_lines = f.read().split("\n")[:-1]
    assert len(vocab_lines) == vocab_size
    vocab, ranks = zip(*map(lambda l: l.split("\t"), vocab_lines))
    return vocab

Now the function for converting from GloVe order embeddings to the proper order:

from gensim.models import keyedvectors
from dougu import to_from_idx  # https://github.com/bheinzerling/dougu/blob/d90e6c0ba92e61378c3c03df78ce5ba020f65ff8/dougu/iters.py#L70
import numpy as np

def convert_emb(glove_order_vocab_file, glove_order_emb_file):
    glove_order_vocab = get_vocab(glove_order_vocab_file)
    piece2id, id2piece = to_from_idx(vocab)
    glove_order_emb = keyedvectors.load_word2vec_format(glove_order_emb_file)
    v = glove_order_emb.vectors
    # sample embeddings for symbols that didn't occur in the training
    # data from normal distribution with same mean and variance
    new_v = v.std() * np.random.randn(len(glove_order_vocab), v.shape[1]) + v.mean()
    new_vocab = {}
    # go through all entries (piece) in the vocabulary with their corresponding id
    for id, piece in id2piece.items():
        try:
            new_v[id] = glove_order_emb[str(id)]  # str(id) assumes you used '--output_format=id', as described here https://github.com/bheinzerling/bpemb/issues/5#issuecomment-481616023
        except KeyError:
            pass
        # gensim sorts embeddings by -count when saving
        # set count to -id to preserve sentencepiece order
        assert piece not in new_vocab
        new_vocab[piece] = keyedvectors.Vocab(count=-id, index=id)

    proper_order_emb.index2word = id2piece
    proper_order_emb.vocab = new_vocab
    proper_order_emb.vectors = new_v
    return proper_order_emb

Copied this together from my actual scripts, let me know if this works for you.

@stefan-it
Copy link
Contributor

@bheinzerling I would be awesome if the training routine could be added here (I'm currently training bpemb's for historic texts).

Currently, I'm using the default parameters as provided in the GloVe demo script (I only adjusted dimenstion size to 300) 🤗

@bheinzerling
Copy link
Owner

bheinzerling commented Sep 4, 2019

@stefan-it The main difference to the demo script is setting VOCAB_MIN_COUNT=0 which creates embeddings for all byte-pair symbols, not just frequent ones.

#! /usr/bin/env bash
set -eou pipefail

# set this to something else if you want to keep GloVe co-occurrence files permanently,
# say, to create embeddings of the same corpus with different dimensions
TMP=/tmp
mkdir -p $TMP

# need to set this
BUILDDIR=/SET/THIS/TO/PATH/OF/glove/build

# set this to something appropriate for your system
NUM_THREADS=24

# path of single plain text file containing the byte-pair encoded corpus
CORPUS=$1
# where the GloVe files should be saved
OUT=$2
# GloVe embedding dim
VECTOR_SIZE=$3

FNAME=$(echo $CORPUS | sed "s#/#_#g")
SAVE_FILE=$OUT.glove
VERBOSE=2
MEMORY=64.0

# we want embeddings for *all* BPE symbols
VOCAB_MIN_COUNT=0

MAX_ITER=50
WINDOW_SIZE=15
BINARY=0
X_MAX=10


# this part is probably not necessary unless you create lots of embeddings
VOCAB_FILE=$TMP/$FNAME.vocab.txt
COOCCURRENCE_FILE=$TMP/$FNAME.cooccurrence.bin
COOCCURRENCE_SHUF_FILE=$TMP/$FNAME.cooccurrence.shuf.bin
# random filenames for overflow and tempshuf files to prevent naming clashes
OVERFLOW=$TMP/${FNAME}.overflow_$(echo $RANDOM $RANDOM $RANDOM $RANDOM $RANDOM | md5sum | cut -c -8)
TEMPSHUF=$TMP/${FNAME}.tempshuf_$(echo $RANDOM $RANDOM $RANDOM $RANDOM $RANDOM | md5sum | cut -c -8)
# create vocab and cooccurrence files only once
if [ ! -f $VOCAB_FILE ]; then
	echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
	$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
fi
if [ ! -f $COOCCURRENCE_FILE ]; then
	echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
	$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE -overflow-file $OVERFLOW < $CORPUS > $COOCCURRENCE_FILE
	if [ -f $OVERFLOW ]; then
		rm $OVERFLOW
	fi
fi
if [ ! -f $COOCCURRENCE_SHUF_FILE ]; then
	echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE -temp-file $TEMPSHUF < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
	$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE -temp-file $TEMPSHUF < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
	if [ -f $TEMPSHUF ]; then
		rm $TEMPSHUF
	fi
fi

# print the command we're running
echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE -write-header 1 -alpha 0.75 -eta 0.03"

# the actual command
# GloVe will cause a segmentation fault for some combinations of large vocabulary sizes and large vector sizes.
# In those cases, changing  alpha and eta slightly fixes the problem ‾\_(ツ)_/‾
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE -write-header 1 -alpha 0.75 -eta 0.03

# delete the <unk> embedding, assumes that <unk> doesn't occur as part of some BPE symbol
sed -i "/<unk>/d" ${SAVE_FILE}.txt

@stephantul
Copy link

For those interested: I created a python script that creates a sentencepiece model on a training corpus, after which it segments the corpus, and trains BPE embeddings. The end result is an embedding space which is aligned with the sentencepiece model. It doesn't use glove though.

See here: https://github.com/stephantul/piecelearn

@shantanu778
Copy link

@bheinzerling I want to use BPEmb, but in your training script you used sentencePiece for training and encoding .
How can I use BPEmb model for data preprocessing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants