-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easy import of GloVe vectors using Gensim #625
Conversation
word2vec embeddings start with a line with the number of lines (tokens?) and the number of dimensions of the file. This allows gensim to allocate memory accordingly for querying the model. Larger dimensions mean larger memory is held captive. Accordingly, this line has to be inserted into the GloVe embeddings file.
@manasRK Thanks for the PR. It would be better to have this converter as a standalone script with usage like this: If you could add a test for a small file <100Kb using check_output that would be great |
Function use to prepend lines using bash utilities in Linux. | ||
(source: http://stackoverflow.com/a/10850588/610569) | ||
""" | ||
with open(infile, 'r') as old: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use binary mode: rb
, wb
. Otherwise won't work on Windows.
This PR has code style issue (indentation, spaces around binary operators etc -- see PEP8). I'm not sure I understand its purpose -- does it insert one line at the beginning of a file? Isn't it easier to just use |
Also, it seems to only work for a few hard-coded filenames (reads the dimensionality from filename?). Isn't that too constraining / narrow purpose? |
Agreed @piskvorky . The major motivation to create this code was to allow easy port of GloVe vectors into Gensim. I was already using Gensim massively in my explorations with word2vec. I didn't want to write another code for GloVe separately. I found out later that there is a minor difference in the formats and hence wanted to make it Gensim-compatible. Not a big fan of 'hard-coding' either, but this is something that was suggested in a Github pull request earlier in my repositories. My earlier approach was to calculate the
Might be a bit in-efficient, but would love your suggestions to make it useful for guys who want to use Gensim for both GloVe and word2vec, without writing more code. @tmylk I will add a test file soon as well. Thanks ! |
Sure. I think we can add such CLI script to convert glove => word2vec, it looks useful. The script should accept the two filenames (glove, gensim) as CLI arguments, then use python -m gensim.scripts.glove2word2vec glove_vectors.txt word2vec_vectors.txt In other words, rather than an ad-hoc script with print statements for a few hardcoded files, we definitely want this to be a generic script for any glove input. |
Great, I will edit my script to do the same. Make it a CLI file, which can take in arguments as you suggested and create a file to consume in Gensim. |
Thanks @manasRK . Have a look at other scripts in the |
CLI USAGE: python glove2word2vec.py <GloVe vector file> <Output model file> Convert GloVe vectors into Gensim compatible format to instantiate from an existing file on disk in the word2vec C format; model = gensim.models.Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False) # C text format word2vec embeddings start with a line with the number of lines (tokens?) and the number of dimensions of the file. This allows gensim to allocate memory accordingly for querying the model. Larger dimensions mean larger memory is held captive. Accordingly, this line has to be inserted into the GloVe embeddings file.
I have edited the code for use in commandline in current form as
Further suggestions welcome. :) |
# -*- coding: utf-8 -*- | ||
# | ||
# Copyright (C) 2016 Manas Ranjan Kar <[email protected]> | ||
# Licensed under the MIT License https://opensource.org/licenses/MIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gensim uses LGPL (see headers of other files).
Updated changes with @piskvorky 's suggestions.
Good progress :) I added a few more comments around Pythonic code style. |
More suggestions integrated. Hopefully, didn't miss anything.
Updated with more suggestions from @piskvorky
@piskvorky I have updated the file with your recommended changes. |
|
||
num_lines, dims= get_info(glove_vector_file) | ||
gensim_first_line = "{} {}".format(num_lines, dims) | ||
model_file=prepend_line(glove_vector_file, output_model_file, gensim_first_line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: Spaces around binary operator (here, above, below).
@tmylk Finally managed to get rid of string literal errors with Python 3. And I officially hate Python 3 now. My PR is ready to be merged . :) Helpful link for anyone who follows this thread later, difference between Python 3 and Python 2. |
Great job. Thanks a lot. Did you run pep8 on it? |
Yes, checked with PEP8 and PEP257 python packages as well. |
@@ -114,6 +104,11 @@ def readfile(fname): | |||
import wheelhouse_uploader.cmd | |||
cmdclass.update(vars(wheelhouse_uploader.cmd)) | |||
|
|||
python_2_6_backports = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does an "empty string" inside setup's install_requires
actually do? Is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We install argparse only if Python version is 2.6. By default we don't need to install any backports, hence an empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question was, what does an empty string imply in install_requires
. Ignored? Warning? Exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The empty string doesn't cause any output in Travis.
https://travis-ci.org/piskvorky/gensim/jobs/118421963#L290
@@ -6,7 +6,6 @@ | |||
|
|||
""" | |||
Run with: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@manasRK I still see many such deleted documentation lines in the diff (top tab, "Files changed").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-added deleted blank lines.
@tmylk @piskvorky Codes are now okay? Will close the PR if you are ready to merge :) |
# -*- coding: utf-8 -*- | ||
# | ||
# Copyright (C) 2016 Radim Rehurek <[email protected]> | ||
# Copyright (C) 2016 Manas Ranjan Kar <[email protected]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking that you are aware of the legal guidelines of contributing. "By submitting your contribution to be included in the gensim project, you agree to assign joint ownership of your changes (your code patch, documentation fix, whatever) to me, Radim Řehůřek."
https://github.com/piskvorky/gensim/wiki/Developer-page#legal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, surely. That's why I added @piskvorky 's name in the code.
Please add note about this new feature to changelog file. If you resubmit the PR against the develop branch then it will be merged in. (Master branch is always the latest pypi release.) Also please squash the commits into one using rebase command. |
@manasRK Happy to talk in https://gitter.im/piskvorky/gensim if needed |
Merged in 982dc3c |
word2vec embeddings start with a line with the number of lines (tokens?) and the number of dimensions of the file. This allows gensim to allocate memory accordingly for querying the model. Larger dimensions mean larger memory is held captive. Accordingly, this line has to be inserted into the GloVe embeddings file.