-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor documentation for gensim.models.phrases
#1950
Conversation
gensim/models/phrases.py
Outdated
|
||
Parameters | ||
---------- | ||
worda : str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget about descriptions
gensim/models/phrases.py
Outdated
Parameters | ||
---------- | ||
args : object | ||
Sequence of arguments, see :meth:`...` for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...
? you should link to SaveLoad.load
I think
gensim/models/phrases.py
Outdated
and `phrases[corpus]` syntax. | ||
|
||
"""Detect phrases, based on collected collocation counts. Adjacent words that appear together more frequently than | ||
expected are joined together with the `_` character. It can be used to generate phrases on the fly, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_
- this can be changed
gensim/models/phrases.py
Outdated
setting. `scoring` can be set with either a string that refers to a built-in scoring function, | ||
or with a function with the expected parameter names. Two built-in scoring functions are available | ||
by setting `scoring` to a string: | ||
sentences : list of str, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iterable of list of str
gensim/models/phrases.py
Outdated
min_count : int, optional | ||
Ignore all words and bigrams with total collected count lower | ||
than this. | ||
threshold : int, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float
gensim/models/phrases.py
Outdated
available memory you have. | ||
delimiter : str, optional | ||
Glue character used to join collocation tokens, should be a byte string (e.g. b'_'). | ||
scoring : {'default', 'npmi'} http://www.sphinx-doc.org/en/master/rest.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's a link?
gensim/models/phrases.py
Outdated
Specify how potential phrases are scored for comparison to the `threshold` setting. | ||
`scoring` can be set with either a string that refers to a built-in scoring function, or with a function | ||
with the expected parameter names. Two built-in scoring functions are available by setting `scoring` to a | ||
string: | ||
|
||
'default': from "Efficient Estimaton of Word Representations in Vector Space" by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing part (don't forget to use enumerate list)
gensim/models/phrases.py
Outdated
Parameters | ||
---------- | ||
args : object | ||
Sequence of arguments, see :meth:`...` for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as for previous load
gensim/models/phrases.py
Outdated
@@ -373,7 +408,17 @@ def __str__(self): | |||
@staticmethod | |||
def learn_vocab(sentences, max_vocab_size, delimiter=b'_', progress_per=10000, | |||
common_terms=frozenset()): | |||
"""Collect unigram/bigram counts from the `sentences` iterable.""" | |||
"""Collect unigram/bigram counts from the `sentences` iterable. #TODO: Через пустой Phrasers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#TODO: Через пустой Phrasers
- so Russian :D
gensim/models/phrases.py
Outdated
try: | ||
return self.phrasegrams[tuple(components)][1] | ||
except KeyError: | ||
return -1 | ||
|
||
def __getitem__(self, sentence): | ||
""" | ||
Convert the input tokens `sentence` (=list of unicode strings) into phrase | ||
"""Convert the input tokens `sentence` (=list of unicode strings) into phrase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use this (=list of unicode strings)
, better to write concrete types for arguments.
gensim.models.phrases
.gensim.models.phrases
gensim/models/phrases.py
Outdated
@@ -68,11 +70,6 @@ | |||
>>> print(bigram[sent]) | |||
[u'the', u'mayor', u'shows', u'his', u'lack_of_interest'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should fix this example
gensim/models/phrases.py
Outdated
with the expected parameter names. Two built-in scoring functions are available by setting `scoring` to a | ||
string: | ||
|
||
1. `default` - :meth:`~gensim.models.phrases.original_scorer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is :func:
, not :meth:
gensim/models/phrases.py
Outdated
>>> from gensim.models.phrases import Phrases | ||
>>> sentences = Text8Corpus(datapath('testcorpus.txt')) | ||
>>> bigram = Phrases(sentences, min_count=5, threshold=100) | ||
>>> print bigram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bad example, you should to show bigram extraction here
gensim/models/phrases.py
Outdated
Parameters | ||
---------- | ||
args : object | ||
Sequence of arguments, see :class:`~gensim.models.phrases.Phrases` for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect links, you should refer to parent class (i.e. SaveLoad.load
)
gensim/models/phrases.py
Outdated
>>> from gensim.models.phrases import Phrases | ||
>>> sentences = Text8Corpus(datapath('testcorpus.txt')) | ||
>>> learned = Phrases.learn_vocab(sentences,40000) | ||
>>> print learned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is it?
gensim/models/phrases.py
Outdated
>>> from gensim.models.phrases import Phrases, Phraser | ||
>>> sentences = Text8Corpus(datapath('testcorpus.txt')) | ||
>>> phrases_model = Phrases(sentences, min_count=5, threshold=100) | ||
>>> phraser_model = Phraser(phrases_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and what? how to use this classes?
gensim/models/phrases.py
Outdated
>>> phrases_model = Phrases(sentences, min_count=5, threshold=100) | ||
>>> phraser_model = Phraser(phrases_model) | ||
>>> pseudo = phraser_model.pseudocorpus(phrases_model) | ||
//>>> phraser_model.score_item("tree","human",pseudo,'default') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
???
gensim/models/phrases.py
Outdated
>>> phraser_model = Phraser(phrases_model) | ||
>>> pseudo = phraser_model.pseudocorpus(phrases_model) | ||
//>>> phraser_model.score_item("tree","human",pseudo,'default') | ||
>>> phraser_model.score_item(u"tree",u"human",pseudo,'default') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about demonstration of this function, this isn't really needed
gensim/models/phrases.py
Outdated
>>> sentences = Text8Corpus(datapath('testcorpus.txt')) | ||
>>> phrases_model = Phrases(sentences, min_count=5, threshold=100) | ||
>>> phraser_model = Phraser(phrases_model) | ||
>>> pseudo = phraser_model.pseudocorpus(phrases_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this need?
gensim/models/phrases.py
Outdated
>>> phrases_model = Phrases(sentences, min_count=5, threshold=100) | ||
>>> phraser_model = Phraser(phrases_model) | ||
>>> pseudo = phraser_model.pseudocorpus(phrases_model) | ||
>>> phraser_model["tree", "human"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incorrect, please use phraser_model[["tree", "human"]]
Good work @CLearERR 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor code style fixes needed.
gensim/models/phrases.py
Outdated
@@ -933,12 +956,40 @@ def score_item(self, worda, wordb, components, scorer): | |||
>>> from gensim.models.word2vec import Text8Corpus | |||
>>> from gensim.models.phrases import Phrases, Phraser | |||
>>> sentences = Text8Corpus(datapath('testcorpus.txt')) | |||
>>> #train the detector with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: #
followed by one
space (here and elsewhere).
gensim/models/phrases.py
Outdated
>>> #So we get 2 phrases | ||
>>> res = phraser_model[sent] | ||
>>> for phrase in res: | ||
>>> print phrase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best use brackets, for py3k compatibility.
@menshikh-iv