Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime error in phrases.py #3031

Closed
thalishsajeed opened this issue Jan 20, 2021 · 21 comments · Fixed by #3041
Closed

Runtime error in phrases.py #3031

thalishsajeed opened this issue Jan 20, 2021 · 21 comments · Fixed by #3041
Assignees
Labels
bug Issue described a bug
Milestone

Comments

@thalishsajeed
Copy link
Contributor

thalishsajeed commented Jan 20, 2021

Problem description

Trying to use export_phrases function on a phrases model.
Instead getting Runtime error

Steps/code/corpus to reproduce

from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS

documents = ["I am interested in machine learning projects", 
             "machine learning projects can be useful sometimes",
            "I love working on machine learning projects",
            "interested does not mean the same thing as likes",
            "i am interested in blockchain"]

sentence_stream = [doc.split(" ") for doc in documents]
bigrams = Phrases(sentence_stream, min_count=2, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)
trigrams = Phrases(bigrams[sentence_stream], min_count=2, threshold=1)
trigrams.export_phrases()
RuntimeError                              Traceback (most recent call last)
<ipython-input-190-0f7e41471301> in <module>
----> 1 trigrams.export_phrases()

~\Anaconda3\lib\site-packages\gensim\models\phrases.py in export_phrases(self)
    716         """
    717         result, source_vocab = {}, self.vocab
--> 718         for token in source_vocab:
    719             unigrams = token.split(self.delimiter)
    720             if len(unigrams) < 2:

RuntimeError: dictionary changed size during iteration

Versions

Please provide the output of:

Windows-10-10.0.17763-SP0
Python 3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
Bits 64
NumPy 1.19.5
SciPy 1.5.2
gensim 4.0.0beta
FAST_VERSION 0
@piskvorky piskvorky added the bug Issue described a bug label Jan 20, 2021
@piskvorky piskvorky added this to the 4.0.0 milestone Jan 20, 2021
@piskvorky
Copy link
Owner

piskvorky commented Jan 20, 2021

OK, I can see the issue. Phrases employs defaultdict, which modifies its content on (unsuccessful) access: self.vocab[not_in_vocab] changes self.vocab.

The proper fix will be to get rid of this unwanted mutation – either by replacing vocab[x] by vocab.get(x, 0) in all const functions, or (better, easier to reason about) by replacing the entire defaultdict by a plain dict.

@thalishsajeed are you up for it?

@thalishsajeed
Copy link
Contributor Author

thalishsajeed commented Jan 20, 2021

@piskvorky Yep, I vote for replacing defaultdict. If you agree with that, I'll make a PR with the change. (after checking for any other unwanted issues)

@piskvorky
Copy link
Owner

piskvorky commented Jan 20, 2021

Yes, please go ahead.

You can re-use #3030 , no need to start a new PR.

@thalishsajeed
Copy link
Contributor Author

thalishsajeed commented Jan 21, 2021

@piskvorky Is it okay to replace instances of the following code pattern

self.vocab[word] + = 1

with

self.vocab[word] = vocab.get(word, 0) + 1

I don't know of any way to do updates in-place in a regular dict.

There's also this line - which needs to be modified for similar reasons and making sure generator isn't called twice if you use .get method.

https://github.com/RaRe-Technologies/gensim/blob/a21d9cc768598640f38e4bd03d368f8712a9aa77/gensim/models/phrases.py#L596

Still comfortable moving to dict right? Also I read some SO posts that defaultdict is more performant that dict, wondering if that is still the case and needs to be considered for this change.

@piskvorky
Copy link
Owner

piskvorky commented Jan 21, 2021

Yes, that's the way to do it!

And yes, I'd expect defaultdict to be (slightly) more performant than dict. But having correct, readable code is more important.

If we ever optimize Phrases, it'll be by translating its code to Cython / C, for a proper 10x+ boost. Not chasing a few percent here and there with defaultdict.

@thalishsajeed
Copy link
Contributor Author

thalishsajeed commented Jan 21, 2021

@piskvorky Thanks! Well, I'm done with the changes, can you point me to how I can run the unit test suite locally?

@piskvorky
Copy link
Owner

piskvorky commented Jan 21, 2021

Hmm. @mpenkov will python setup.py test work, for local testing? I don't see that on the Developer page.

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 22, 2021

I usually do pytest gensim.

You may need to do something like pip install -e .[test] to get the dependencies installed first.

@thalishsajeed
Copy link
Contributor Author

thalishsajeed commented Jan 22, 2021

@mpenkov Thanks! Do you have anything for me where I can read regarding the current tests which are passing?

Also I need some help with writing test cases, why is the test case called testExportPhrases present but with another function i.e find phrases being called inside.? Do i need to create a separate test case or is there an existing test case for export_phrases that I need to fix.

https://github.com/RaRe-Technologies/gensim/blob/a21d9cc768598640f38e4bd03d368f8712a9aa77/gensim/test/test_phrases.py#L215

@piskvorky
Copy link
Owner

piskvorky commented Jan 22, 2021

Yes, add a new test case please.

I don't know why a test called testExportPhrases doesn't test exporting phrases, that's really weird! Can you maybe rename that (to testFindPhrases?), and call your new test testExportPhrases?

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 22, 2021

Thanks! Do you have anything for me where I can read regarding the current tests which are passing?

Other than the developer wiki, which @piskvorky mentioned above, no, I'm not aware of any unit-test related documentation.

Looks like export_phrases got renamed to find_phrases, but the test case wasn't renamed. Renaming the test case should resolve the problem.

@piskvorky piskvorky self-assigned this Jan 31, 2021
@piskvorky
Copy link
Owner

@thalishsajeed did you manage?

@thalishsajeed
Copy link
Contributor Author

@thalishsajeed did you manage?

Hi yes, I'll work on this tomorrow and update you :)

@thalishsajeed
Copy link
Contributor Author

@piskvorky : Soo, just a quick update. The current implementation of analyze_sentence seems to modify the vocabulary (the count is zero naturally because defaultdict)

So doing something like -

bigrams.analyze_sentence(["dis", "is", "good", "machine", "interested"])

Will modify the vocabulary and so that bigams.vocab would end up looking like this -

'dis': 0, 'is': 0, 'good': 0, 'machine_interested': 0}

This then seems to be affecting bigrams.find_phrases where you end up with different scores every time you run it after performing bigrams.analyze_sentence

I guess my question is - is this expected behavior?

P.S - I hope I was able to explain the issue otherwise , let me know if you want like a more detailed write up of this behaviour.

@piskvorky
Copy link
Owner

piskvorky commented Feb 3, 2021

Yes, not desirable, as discussed above. Can you replace the defaultdict with normal dict?

@thalishsajeed
Copy link
Contributor Author

@piskvorky Yes, I've done that bit. I stumbled upon this while trying to create a test case. My test cases kept failing because the phrase scores were not matching up when I compared the currrent branch to the BugFix branch and it took some time to figure out what was happening.

@piskvorky
Copy link
Owner

Yes, I've done that bit

Wait, this still happens after getting rid of defaultdict? Something else must be afoot then.

@thalishsajeed
Copy link
Contributor Author

thalishsajeed commented Feb 3, 2021

Wait, this still happens after getting rid of defaultdict? Something else must be afoot then.

No no. Let me explain. So after getting rid of default dict I tried to make sure everything else works exactly the same. That's when I stumbled upon this phenomenon where the scores for phrases were different between the develop branch and the bugfix branch. Naturally I assumed that I was making some mistake while getting rid of defaultdict which is why the scores were different. After diving a little deeper in the debug mode I realized the root cause i.e the vocab being changed when calling analyze_sentence in the develop branch so just wanted to be sure that I'm not messing with some expected behavior.

@piskvorky
Copy link
Owner

No at all, you found a nasty hidden bug – well done!

@thalishsajeed
Copy link
Contributor Author

@piskvorky Hi, what is the workflow for closing this issue?

@piskvorky
Copy link
Owner

The issue will be automatically closed once its corresponding PR gets merged.

mpenkov added a commit that referenced this issue Feb 13, 2021
* fix typo

* fix test cases for test_export_phrases

* add test cases for test_find_phrases

* Fix #3031 Runtime error in phrases.py

* remove unused variable reference

* fix newline to end of file

* fix formattingpy

* Update CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Michael Penkov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug
Projects
None yet
3 participants