Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor documentation for gensim.models.phrases #1950

Merged
merged 21 commits into from
Apr 3, 2018

Conversation

CLearERR
Copy link
Contributor

@CLearERR CLearERR commented Mar 4, 2018

@menshikh-iv menshikh-iv added the incubator project PR is RaRe incubator project label Mar 5, 2018

Parameters
----------
worda : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget about descriptions

Parameters
----------
args : object
Sequence of arguments, see :meth:`...` for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...? you should link to SaveLoad.load I think

and `phrases[corpus]` syntax.

"""Detect phrases, based on collected collocation counts. Adjacent words that appear together more frequently than
expected are joined together with the `_` character. It can be used to generate phrases on the fly,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ - this can be changed

setting. `scoring` can be set with either a string that refers to a built-in scoring function,
or with a function with the expected parameter names. Two built-in scoring functions are available
by setting `scoring` to a string:
sentences : list of str, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterable of list of str

min_count : int, optional
Ignore all words and bigrams with total collected count lower
than this.
threshold : int, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float

available memory you have.
delimiter : str, optional
Glue character used to join collocation tokens, should be a byte string (e.g. b'_').
scoring : {'default', 'npmi'} http://www.sphinx-doc.org/en/master/rest.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's a link?

Specify how potential phrases are scored for comparison to the `threshold` setting.
`scoring` can be set with either a string that refers to a built-in scoring function, or with a function
with the expected parameter names. Two built-in scoring functions are available by setting `scoring` to a
string:

'default': from "Efficient Estimaton of Word Representations in Vector Space" by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing part (don't forget to use enumerate list)

Parameters
----------
args : object
Sequence of arguments, see :meth:`...` for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as for previous load

@@ -373,7 +408,17 @@ def __str__(self):
@staticmethod
def learn_vocab(sentences, max_vocab_size, delimiter=b'_', progress_per=10000,
common_terms=frozenset()):
"""Collect unigram/bigram counts from the `sentences` iterable."""
"""Collect unigram/bigram counts from the `sentences` iterable. #TODO: Через пустой Phrasers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#TODO: Через пустой Phrasers - so Russian :D

try:
return self.phrasegrams[tuple(components)][1]
except KeyError:
return -1

def __getitem__(self, sentence):
"""
Convert the input tokens `sentence` (=list of unicode strings) into phrase
"""Convert the input tokens `sentence` (=list of unicode strings) into phrase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use this (=list of unicode strings), better to write concrete types for arguments.

@menshikh-iv menshikh-iv changed the title Create PRRefactor documentation for gensim.models.phrases. Refactor documentation for gensim.models.phrases Mar 14, 2018
@@ -68,11 +70,6 @@
>>> print(bigram[sent])
[u'the', u'mayor', u'shows', u'his', u'lack_of_interest']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should fix this example

with the expected parameter names. Two built-in scoring functions are available by setting `scoring` to a
string:

1. `default` - :meth:`~gensim.models.phrases.original_scorer`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is :func:, not :meth:

>>> from gensim.models.phrases import Phrases
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> bigram = Phrases(sentences, min_count=5, threshold=100)
>>> print bigram
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bad example, you should to show bigram extraction here

Parameters
----------
args : object
Sequence of arguments, see :class:`~gensim.models.phrases.Phrases` for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect links, you should refer to parent class (i.e. SaveLoad.load)

>>> from gensim.models.phrases import Phrases
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> learned = Phrases.learn_vocab(sentences,40000)
>>> print learned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is it?

>>> from gensim.models.phrases import Phrases, Phraser
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases_model = Phrases(sentences, min_count=5, threshold=100)
>>> phraser_model = Phraser(phrases_model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and what? how to use this classes?

>>> phrases_model = Phrases(sentences, min_count=5, threshold=100)
>>> phraser_model = Phraser(phrases_model)
>>> pseudo = phraser_model.pseudocorpus(phrases_model)
//>>> phraser_model.score_item("tree","human",pseudo,'default')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

>>> phraser_model = Phraser(phrases_model)
>>> pseudo = phraser_model.pseudocorpus(phrases_model)
//>>> phraser_model.score_item("tree","human",pseudo,'default')
>>> phraser_model.score_item(u"tree",u"human",pseudo,'default')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about demonstration of this function, this isn't really needed

>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases_model = Phrases(sentences, min_count=5, threshold=100)
>>> phraser_model = Phraser(phrases_model)
>>> pseudo = phraser_model.pseudocorpus(phrases_model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this need?

>>> phrases_model = Phrases(sentences, min_count=5, threshold=100)
>>> phraser_model = Phraser(phrases_model)
>>> pseudo = phraser_model.pseudocorpus(phrases_model)
>>> phraser_model["tree", "human"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorrect, please use phraser_model[["tree", "human"]]

@menshikh-iv
Copy link
Contributor

Good work @CLearERR 👍

@menshikh-iv menshikh-iv merged commit 5677ab3 into piskvorky:develop Apr 3, 2018
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor code style fixes needed.

@@ -933,12 +956,40 @@ def score_item(self, worda, wordb, components, scorer):
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, Phraser
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> #train the detector with
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: # followed by one space (here and elsewhere).

>>> #So we get 2 phrases
>>> res = phraser_model[sent]
>>> for phrase in res:
>>> print phrase
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best use brackets, for py3k compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incubator project PR is RaRe incubator project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants