Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: wordrank wrapper #1066

Merged
merged 20 commits into from
Jan 23, 2017
Merged

Conversation

parulsethi
Copy link
Contributor

@parulsethi parulsethi commented Dec 30, 2016

This PR adds python wrapper for Wordrank.
utils.check_output is modified here to pass an open file as stdout which is required in wrapper's train method.
Todo:

  • tutorial for using the wrapper
  • comparison with word2vec and fasttext blog and code
  • run dtm/mallet wrapper tests to make sure the check_output is not broken

@@ -118,7 +118,7 @@ def readfile(fname):

python_2_6_backports = ''
if sys.version_info[:2] < (2, 7):
python_2_6_backports = ['argparse', 'subprocess32']
python_2_6_backports = ['argparse', 'subprocess32', 'backport_collections']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should subprocess32 be removed now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -1154,7 +1154,7 @@ def check_output(*popenargs, **kwargs):
Added extra KeyboardInterrupt handling
"""
try:
process = subprocess.Popen(stdout=subprocess.PIPE, *popenargs, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stdout default has to be specified for other wrappers to work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you confirm that ldamallet wrapper works even without the default stdout specified? what prevents you from keeping it as default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ldamallet wrapper test passes without the default stdout

@tmylk
Copy link
Contributor

tmylk commented Jan 10, 2017

Please add the new class to gensim/docs/src/apiref.rst and create an RST files as in #961

Copy link
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes

@@ -0,0 +1,286 @@
{
"cells": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call it "WordRank_wrapper_quickstart.ipynb"

],
"source": [
"word_similarity_file = 'datasets/ws-353.txt'\n",
"model.wv.evaluate_word_pairs(word_similarity_file)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge in latest develop to get correct output of this cell

"source": [
"# Comparison of WordRank, Word2Vec and FastText\n",
"\n",
"Wordrank is a fresh new approach to the word embeddings, which formulates it as a ranking problem. That is, given a word w, it aims to output an ordered list (c1, c2, · · ·) of context words such that words that co-occur with w appear at the top of the list. This formulation fits naturally to popular word embedding tasks such as word similarity/analogy since instead of the likelihood of each word, we are interested in finding the most relevant words\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to wordrank from here. same link as in references.

"metadata": {},
"source": [
"# Comparison of WordRank, Word2Vec and FastText\n",
"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a link to your blog. say "this ipynb accompanies a more theoretical blog post [link] "

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding dummy url, will change if final url changes

>>> print model[word] # prints vector for given words

.. [1] https://bitbucket.org/shihaoji/wordrank/
.. [2] https://arxiv.org/pdf/1506.02761v3.pdf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add

 # Copyright (C) 2017 Parul Sethi <email>
 # Copyright (C) 2017 Radim Rehurek <[email protected]>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

copyfile(corpus_file, os.path.join(meta_dir, corpus_file.split('/')[-1]))
os.chdir(meta_dir)

cmd0 = ['../../glove/vocab_count', '-min-count', str(min_count), '-max-vocab', str(max_vocab_size)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please give meaningful names to variables like 'cmd_vocab_count'

cmd3 = ['cut', '-d', " ", '-f', '1', temp_vocab_file]
cmds = [cmd0, cmd1, cmd2, cmd3]
logger.info("Preparing training data using glove code '%s'", cmds)
o0 = smart_open(temp_vocab_file, 'w')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meaningful names here too please

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safer to open files in binary mode (wb), and explicitly encode all strings written there.

Has this been tested on unicode (non-ASCII) data?

Copy link
Contributor Author

@parulsethi parulsethi Jan 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried on hindi characters, it works

self.wv.vocab[word].count = counts[word]

def ensemble_embedding(self, word_embedding, context_embedding):
"""Addition of two embeddings."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better docstring Replace syn0 with the sum of context and word embeddings

glove2word2vec(context_embedding, context_embedding+'.w2vformat')
w_emb = Word2Vec.load_word2vec_format('%s.w2vformat' % word_embedding)
c_emb = Word2Vec.load_word2vec_format('%s.w2vformat' % context_embedding)
assert Counter(w_emb.wv.index2word) == Counter(c_emb.wv.index2word), 'Vocabs are not same for both embeddings'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a vocab comparison would do. no need for Counter

outputs = [o0, o1, o2, o3]
inputs = [i0, i1, i2, i3]
prepare_train_data = [utils.check_output(cmd, stdin=inp, stdout=out) for cmd, inp, out in zip(cmds, inputs, outputs)]
o0.close()
Copy link
Owner

@piskvorky piskvorky Jan 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best practice is to use context managers for opening files for writing (with smart_open() as input0: ...).

This whole code section would read better if rewritten as a loop (for command, input_fname, output_fname in zip(commands, input_fnames, output_fnames): with smart_open(...): ...).

glove2word2vec(context_embedding, context_embedding+'.w2vformat')
w_emb = Word2Vec.load_word2vec_format('%s.w2vformat' % word_embedding)
c_emb = Word2Vec.load_word2vec_format('%s.w2vformat' % context_embedding)
assert Counter(w_emb.wv.index2word) == Counter(c_emb.wv.index2word), 'Vocabs are not same for both embeddings'
assert set(w_emb.wv.index2word) == set(c_emb.wv.index2word), 'Vocabs are not same for both embeddings'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to compare wv.vocab?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, similarly using set(wv.vocab)
correcting in next commit

@@ -0,0 +1,353 @@
love sex 6.77
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,999 @@
old new 1.58
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move to test/test_data so it can be used in other code. Similar to https://github.com/parulsethi/gensim/blob/develop/gensim/test/test_data/wordsim353.tsv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"# WordRank wrapper tutorial on Lee Corpus\n",
"\n",
"WordRank is a new word embedding algorithm which captures the semantic similarities in a text data well. See this [notebook](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Wordrank_comparisons.ipynb) for it's comparisons to other popular embedding models. This tutorial will serve as a guide to use the WordRank wrapper in gensim. You need to install [WordRank](https://bitbucket.org/shihaoji/wordrank) before proceeding with this tutorial.\n",
"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -234,7 +234,7 @@ def show_topics(self, num_topics=10, num_words=10, log=False, formatted=True):
if formatted:
topic = self.print_topic(i, topn=num_words)
else:
topic = self.show_topic(i, topn=num_words)
topic = self.show_topic(i, num_words=num_words)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be in this pr

Copy link
Contributor

@tmylk tmylk Jan 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that num_words is correct

Copy link
Contributor Author

@parulsethi parulsethi Jan 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just to mention in review thread)
show_topic() doesn't have topn keyword argument but num_words, this fixes it

@@ -1146,15 +1146,16 @@ def keep_vocab_item(word, count, min_count, trim_rule=None):
else:
return default_res

def check_output(*popenargs, **kwargs):
def check_output(*popenargs, stdout=subprocess.PIPE, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will keep the previous default stdout=subprocess.PIPE for other wrappers, and a different stdout can be defined for wordrank wrapper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm bit doubtful about a workaround for this, it gave a syntax error for python2 in above check.
If I specify stdout=subprocess.PIPE as first argument for making py2 compatible, it won't work for other wrappers as they have cmd(basic command) as their first argument

@@ -103,7 +103,7 @@ def train(cls, wr_path, corpus_file, out_path, size=100, window=15, symmetric=1,
for command, input_fname, output_fname in zip(commands, input_fnames, output_fnames):
with smart_open(input_fname, 'rb') as r:
with smart_open(output_fname, 'wb') as w:
utils.check_output(command, stdin=r, stdout=w)
utils.check_output(command, stdout=w, stdin=r)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed as per current utils.check_output update in this PR

@tmylk tmylk changed the title added wordrank wrapper New feature: wordrank wrapper Jan 22, 2017
@@ -1146,15 +1146,15 @@ def keep_vocab_item(word, count, min_count, trim_rule=None):
else:
return default_res

def check_output(*popenargs, stdout=subprocess.PIPE, **kwargs):
def check_output(stdout=subprocess.PIPE, *popenargs, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this keeps default stdout=subprocess.PIPE. and check_output calls in all wrappers are updated to specify cmd as keyword argument rather than positional to work with this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants