Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easy import of GloVe vectors using Gensim #625

Closed
wants to merge 78 commits into from

Conversation

manasRK
Copy link

@manasRK manasRK commented Mar 9, 2016

word2vec embeddings start with a line with the number of lines (tokens?) and the number of dimensions of the file. This allows gensim to allocate memory accordingly for querying the model. Larger dimensions mean larger memory is held captive. Accordingly, this line has to be inserted into the GloVe embeddings file.

word2vec embeddings start with a line with the number of lines (tokens?) and the number of dimensions of the file. This allows gensim to allocate memory accordingly for querying the model. Larger dimensions mean larger memory is held captive. Accordingly, this line has to be inserted into the GloVe embeddings file.
@tmylk
Copy link
Contributor

tmylk commented Mar 13, 2016

@manasRK Thanks for the PR.

It would be better to have this converter as a standalone script with usage like this:
python -m gensim.scripts.convert_glove_to_gensim glove_file.txt gensim_file.txt

If you could add a test for a small file <100Kb using check_output that would be great

@tmylk tmylk assigned tmylk and unassigned gojomo Mar 13, 2016
Function use to prepend lines using bash utilities in Linux.
(source: http://stackoverflow.com/a/10850588/610569)
"""
with open(infile, 'r') as old:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use binary mode: rb, wb. Otherwise won't work on Windows.

@piskvorky
Copy link
Owner

This PR has code style issue (indentation, spaces around binary operators etc -- see PEP8).

I'm not sure I understand its purpose -- does it insert one line at the beginning of a file? Isn't it easier to just use cat? Or what is the advantage of using this script?

@piskvorky
Copy link
Owner

Also, it seems to only work for a few hard-coded filenames (reads the dimensionality from filename?).

Isn't that too constraining / narrow purpose?

@manasRK
Copy link
Author

manasRK commented Mar 13, 2016

Agreed @piskvorky . The major motivation to create this code was to allow easy port of GloVe vectors into Gensim. I was already using Gensim massively in my explorations with word2vec. I didn't want to write another code for GloVe separately. I found out later that there is a minor difference in the formats and hence wanted to make it Gensim-compatible.

Not a big fan of 'hard-coding' either, but this is something that was suggested in a Github pull request earlier in my repositories. My earlier approach was to calculate thenum_lines and dims on the fly to create a new model file, something like this;

#Model File
fname="glove.6B.50d.txt" #GloVe models

#extract information to pre-pend to the final Gensim compatible file
word2vec_convert_file="word2vec_line.txt"
num_lines = sum(1 for line in open(fname))
dims= re.findall('\d+',fname.split('.')[2])

print '%d lines with %d dimensions' %(num_lines,dims)

with open(word2vec_convert_file,'wb') as f:
    f.write(str(num_lines)+ " " +str(dims) + '\n')
f.close()

model_file='glove_model.txt' #for Gensim's consumption

filenames = [word2vec_convert_file,fname]

with open(model_file, 'wb') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)
outfile.close()    

Might be a bit in-efficient, but would love your suggestions to make it useful for guys who want to use Gensim for both GloVe and word2vec, without writing more code.

@tmylk I will add a test file soon as well. Thanks !

@piskvorky
Copy link
Owner

Sure. I think we can add such CLI script to convert glove => word2vec, it looks useful.

The script should accept the two filenames (glove, gensim) as CLI arguments, then use smart_open to do the actual writing/reading:

python -m gensim.scripts.glove2word2vec glove_vectors.txt word2vec_vectors.txt

In other words, rather than an ad-hoc script with print statements for a few hardcoded files, we definitely want this to be a generic script for any glove input.

@manasRK
Copy link
Author

manasRK commented Mar 13, 2016

Great, I will edit my script to do the same. Make it a CLI file, which can take in arguments as you suggested and create a file to consume in Gensim.

@piskvorky
Copy link
Owner

Thanks @manasRK .

Have a look at other scripts in the /gensim/scripts directory for inspiration and code style. Best of luck!

CLI USAGE: python glove2word2vec.py <GloVe vector file> <Output model file>

Convert GloVe vectors into Gensim compatible format to instantiate from an existing file on disk in the word2vec C format;

model = gensim.models.Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False)  # C text format

word2vec embeddings start with a line with the number of lines (tokens?) and the number of dimensions of the file. This allows gensim to allocate memory accordingly for querying the model. Larger dimensions mean larger memory is held captive. Accordingly, this line has to be inserted into the GloVe embeddings file.
@manasRK
Copy link
Author

manasRK commented Mar 14, 2016

I have edited the code for use in commandline in current form as

python glove2word2vec.py <GloVe vector file> <Output model file>

Further suggestions welcome. :)

# -*- coding: utf-8 -*-
#
# Copyright (C) 2016 Manas Ranjan Kar <[email protected]>
# Licensed under the MIT License https://opensource.org/licenses/MIT
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gensim uses LGPL (see headers of other files).

Updated changes with @piskvorky 's suggestions.
@piskvorky
Copy link
Owner

Good progress :) I added a few more comments around Pythonic code style.

manasRK added 2 commits March 14, 2016 15:29
More suggestions integrated. Hopefully, didn't miss anything.
Updated with more suggestions from @piskvorky
@manasRK
Copy link
Author

manasRK commented Mar 15, 2016

@piskvorky I have updated the file with your recommended changes.


num_lines, dims= get_info(glove_vector_file)
gensim_first_line = "{} {}".format(num_lines, dims)
model_file=prepend_line(glove_vector_file, output_model_file, gensim_first_line)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: Spaces around binary operator (here, above, below).

@manasRK
Copy link
Author

manasRK commented Mar 24, 2016

@tmylk Finally managed to get rid of string literal errors with Python 3. And I officially hate Python 3 now. My PR is ready to be merged . :)

Helpful link for anyone who follows this thread later, difference between Python 3 and Python 2.

http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/

@tmylk
Copy link
Contributor

tmylk commented Mar 24, 2016

Great job. Thanks a lot. Did you run pep8 on it?

@manasRK
Copy link
Author

manasRK commented Mar 24, 2016

Yes, checked with PEP8 and PEP257 python packages as well.

@@ -114,6 +104,11 @@ def readfile(fname):
import wheelhouse_uploader.cmd
cmdclass.update(vars(wheelhouse_uploader.cmd))

python_2_6_backports = ''
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does an "empty string" inside setup's install_requires actually do? Is that ok?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from #634 . @tmylk can shed some light on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We install argparse only if Python version is 2.6. By default we don't need to install any backports, hence an empty string.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question was, what does an empty string imply in install_requires. Ignored? Warning? Exception?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty string doesn't cause any output in Travis.
https://travis-ci.org/piskvorky/gensim/jobs/118421963#L290

@@ -6,7 +6,6 @@

"""
Run with:

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manasRK I still see many such deleted documentation lines in the diff (top tab, "Files changed").

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-added deleted blank lines.

@manasRK
Copy link
Author

manasRK commented Mar 28, 2016

@tmylk @piskvorky Codes are now okay? Will close the PR if you are ready to merge :)

@piskvorky piskvorky added the feature Issue described a new feature label Mar 28, 2016
@piskvorky
Copy link
Owner

Yes, this is in @tmylk 's hands now, thanks @manasRK !

# -*- coding: utf-8 -*-
#
# Copyright (C) 2016 Radim Rehurek <[email protected]>
# Copyright (C) 2016 Manas Ranjan Kar <[email protected]>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking that you are aware of the legal guidelines of contributing. "By submitting your contribution to be included in the gensim project, you agree to assign joint ownership of your changes (your code patch, documentation fix, whatever) to me, Radim Řehůřek."

https://github.com/piskvorky/gensim/wiki/Developer-page#legal

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, surely. That's why I added @piskvorky 's name in the code.

@tmylk
Copy link
Contributor

tmylk commented Mar 28, 2016

Please add note about this new feature to changelog file.
Also see comments about copyright and removing word2vec_standalone.

If you resubmit the PR against the develop branch then it will be merged in. (Master branch is always the latest pypi release.)

Also please squash the commits into one using rebase command.

@piskvorky
Copy link
Owner

My notes:

Doc fixes in word2vec_standalone are fine. Joint copyright ownership also fine.

If the PR against develop branch or squashing the commits gives you trouble @manasRK, @tmylk can help you. Let's get this merged ASAP.

@tmylk
Copy link
Contributor

tmylk commented Mar 29, 2016

@manasRK Happy to talk in https://gitter.im/piskvorky/gensim if needed

manasRK added a commit to manasRK/gensim that referenced this pull request Mar 29, 2016
@manasRK manasRK mentioned this pull request Mar 29, 2016
tmylk added a commit that referenced this pull request Apr 11, 2016
@tmylk
Copy link
Contributor

tmylk commented Apr 11, 2016

Merged in 982dc3c

@tmylk tmylk closed this Apr 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issue described a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants