Runtime error in phrases.py #3031

thalishsajeed · 2021-01-20T07:11:54Z

Problem description

Trying to use export_phrases function on a phrases model.
Instead getting Runtime error

Steps/code/corpus to reproduce

from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS

documents = ["I am interested in machine learning projects", 
             "machine learning projects can be useful sometimes",
            "I love working on machine learning projects",
            "interested does not mean the same thing as likes",
            "i am interested in blockchain"]

sentence_stream = [doc.split(" ") for doc in documents]
bigrams = Phrases(sentence_stream, min_count=2, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)
trigrams = Phrases(bigrams[sentence_stream], min_count=2, threshold=1)
trigrams.export_phrases()

RuntimeError                              Traceback (most recent call last)
<ipython-input-190-0f7e41471301> in <module>
----> 1 trigrams.export_phrases()

~\Anaconda3\lib\site-packages\gensim\models\phrases.py in export_phrases(self)
    716         """
    717         result, source_vocab = {}, self.vocab
--> 718         for token in source_vocab:
    719             unigrams = token.split(self.delimiter)
    720             if len(unigrams) < 2:

RuntimeError: dictionary changed size during iteration

Versions

Please provide the output of:

Windows-10-10.0.17763-SP0
Python 3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
Bits 64
NumPy 1.19.5
SciPy 1.5.2
gensim 4.0.0beta
FAST_VERSION 0

The text was updated successfully, but these errors were encountered:

piskvorky · 2021-01-20T09:28:06Z

OK, I can see the issue. Phrases employs defaultdict, which modifies its content on (unsuccessful) access: self.vocab[not_in_vocab] changes self.vocab.

The proper fix will be to get rid of this unwanted mutation – either by replacing vocab[x] by vocab.get(x, 0) in all const functions, or (better, easier to reason about) by replacing the entire defaultdict by a plain dict.

@thalishsajeed are you up for it?

thalishsajeed · 2021-01-20T10:29:14Z

@piskvorky Yep, I vote for replacing defaultdict. If you agree with that, I'll make a PR with the change. (after checking for any other unwanted issues)

piskvorky · 2021-01-20T11:41:12Z

Yes, please go ahead.

You can re-use #3030 , no need to start a new PR.

thalishsajeed · 2021-01-21T18:29:00Z

@piskvorky Is it okay to replace instances of the following code pattern

self.vocab[word] + = 1

with

self.vocab[word] = vocab.get(word, 0) + 1

I don't know of any way to do updates in-place in a regular dict.

There's also this line - which needs to be modified for similar reasons and making sure generator isn't called twice if you use .get method.

https://github.com/RaRe-Technologies/gensim/blob/a21d9cc768598640f38e4bd03d368f8712a9aa77/gensim/models/phrases.py#L596

Still comfortable moving to dict right? Also I read some SO posts that defaultdict is more performant that dict, wondering if that is still the case and needs to be considered for this change.

piskvorky · 2021-01-21T18:42:39Z

Yes, that's the way to do it!

And yes, I'd expect defaultdict to be (slightly) more performant than dict. But having correct, readable code is more important.

If we ever optimize Phrases, it'll be by translating its code to Cython / C, for a proper 10x+ boost. Not chasing a few percent here and there with defaultdict.

thalishsajeed · 2021-01-21T20:37:56Z

@piskvorky Thanks! Well, I'm done with the changes, can you point me to how I can run the unit test suite locally?

piskvorky · 2021-01-21T20:54:50Z

Hmm. @mpenkov will python setup.py test work, for local testing? I don't see that on the Developer page.

mpenkov · 2021-01-22T00:29:16Z

I usually do pytest gensim.

You may need to do something like pip install -e .[test] to get the dependencies installed first.

thalishsajeed · 2021-01-22T19:59:09Z

@mpenkov Thanks! Do you have anything for me where I can read regarding the current tests which are passing?

Also I need some help with writing test cases, why is the test case called testExportPhrases present but with another function i.e find phrases being called inside.? Do i need to create a separate test case or is there an existing test case for export_phrases that I need to fix.

https://github.com/RaRe-Technologies/gensim/blob/a21d9cc768598640f38e4bd03d368f8712a9aa77/gensim/test/test_phrases.py#L215

piskvorky · 2021-01-22T21:01:20Z

Yes, add a new test case please.

I don't know why a test called testExportPhrases doesn't test exporting phrases, that's really weird! Can you maybe rename that (to testFindPhrases?), and call your new test testExportPhrases?

mpenkov · 2021-01-22T22:41:20Z

Thanks! Do you have anything for me where I can read regarding the current tests which are passing?

Other than the developer wiki, which @piskvorky mentioned above, no, I'm not aware of any unit-test related documentation.

Looks like export_phrases got renamed to find_phrases, but the test case wasn't renamed. Renaming the test case should resolve the problem.

piskvorky · 2021-01-31T17:35:20Z

@thalishsajeed did you manage?

thalishsajeed · 2021-01-31T17:36:51Z

@thalishsajeed did you manage?

Hi yes, I'll work on this tomorrow and update you :)

thalishsajeed · 2021-02-03T12:00:32Z

@piskvorky : Soo, just a quick update. The current implementation of analyze_sentence seems to modify the vocabulary (the count is zero naturally because defaultdict)

So doing something like -

bigrams.analyze_sentence(["dis", "is", "good", "machine", "interested"])

Will modify the vocabulary and so that bigams.vocab would end up looking like this -

'dis': 0, 'is': 0, 'good': 0, 'machine_interested': 0}

This then seems to be affecting bigrams.find_phrases where you end up with different scores every time you run it after performing bigrams.analyze_sentence

I guess my question is - is this expected behavior?

P.S - I hope I was able to explain the issue otherwise , let me know if you want like a more detailed write up of this behaviour.

piskvorky · 2021-02-03T12:48:15Z

Yes, not desirable, as discussed above. Can you replace the defaultdict with normal dict?

thalishsajeed · 2021-02-03T12:57:23Z

@piskvorky Yes, I've done that bit. I stumbled upon this while trying to create a test case. My test cases kept failing because the phrase scores were not matching up when I compared the currrent branch to the BugFix branch and it took some time to figure out what was happening.

piskvorky · 2021-02-03T13:22:48Z

Yes, I've done that bit

Wait, this still happens after getting rid of defaultdict? Something else must be afoot then.

thalishsajeed · 2021-02-03T13:35:21Z

Wait, this still happens after getting rid of defaultdict? Something else must be afoot then.

No no. Let me explain. So after getting rid of default dict I tried to make sure everything else works exactly the same. That's when I stumbled upon this phenomenon where the scores for phrases were different between the develop branch and the bugfix branch. Naturally I assumed that I was making some mistake while getting rid of defaultdict which is why the scores were different. After diving a little deeper in the debug mode I realized the root cause i.e the vocab being changed when calling analyze_sentence in the develop branch so just wanted to be sure that I'm not messing with some expected behavior.

piskvorky · 2021-02-03T13:38:36Z

No at all, you found a nasty hidden bug – well done!

thalishsajeed · 2021-02-08T15:49:16Z

@piskvorky Hi, what is the workflow for closing this issue?

piskvorky · 2021-02-08T16:38:45Z

The issue will be automatically closed once its corresponding PR gets merged.

* fix typo * fix test cases for test_export_phrases * add test cases for test_find_phrases * Fix #3031 Runtime error in phrases.py * remove unused variable reference * fix newline to end of file * fix formattingpy * Update CHANGELOG.md * Update CHANGELOG.md Co-authored-by: Michael Penkov <[email protected]>

piskvorky added the bug Issue described a bug label Jan 20, 2021

piskvorky added this to the 4.0.0 milestone Jan 20, 2021

piskvorky mentioned this issue Jan 20, 2021

[WIP] Fix "dictionary changed size during iteration" in Phrases #3030

Closed

piskvorky self-assigned this Jan 31, 2021

thalishsajeed added a commit to thalishsajeed/gensim that referenced this issue Feb 7, 2021

Fix piskvorky#3031 Runtime error in phrases.py

856692c

thalishsajeed mentioned this issue Feb 7, 2021

Fix RuntimeError in export_phrases (change defaultdict to dict) #3041

Merged

thalishsajeed closed this as completed Feb 7, 2021

thalishsajeed reopened this Feb 7, 2021

mpenkov closed this as completed in #3041 Feb 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime error in phrases.py #3031

Runtime error in phrases.py #3031

thalishsajeed commented Jan 20, 2021 •

edited by piskvorky

Loading

piskvorky commented Jan 20, 2021 •

edited

Loading

thalishsajeed commented Jan 20, 2021 •

edited

Loading

piskvorky commented Jan 20, 2021 •

edited

Loading

thalishsajeed commented Jan 21, 2021 •

edited

Loading

piskvorky commented Jan 21, 2021 •

edited

Loading

thalishsajeed commented Jan 21, 2021 •

edited

Loading

piskvorky commented Jan 21, 2021 •

edited

Loading

mpenkov commented Jan 22, 2021

thalishsajeed commented Jan 22, 2021 •

edited

Loading

piskvorky commented Jan 22, 2021 •

edited

Loading

mpenkov commented Jan 22, 2021

piskvorky commented Jan 31, 2021

thalishsajeed commented Jan 31, 2021

thalishsajeed commented Feb 3, 2021

piskvorky commented Feb 3, 2021 •

edited

Loading

thalishsajeed commented Feb 3, 2021

piskvorky commented Feb 3, 2021

thalishsajeed commented Feb 3, 2021 •

edited

Loading

piskvorky commented Feb 3, 2021

thalishsajeed commented Feb 8, 2021

piskvorky commented Feb 8, 2021

Runtime error in phrases.py #3031

Runtime error in phrases.py #3031

Comments

thalishsajeed commented Jan 20, 2021 • edited by piskvorky Loading

Problem description

Steps/code/corpus to reproduce

Versions

piskvorky commented Jan 20, 2021 • edited Loading

thalishsajeed commented Jan 20, 2021 • edited Loading

piskvorky commented Jan 20, 2021 • edited Loading

thalishsajeed commented Jan 21, 2021 • edited Loading

piskvorky commented Jan 21, 2021 • edited Loading

thalishsajeed commented Jan 21, 2021 • edited Loading

piskvorky commented Jan 21, 2021 • edited Loading

mpenkov commented Jan 22, 2021

thalishsajeed commented Jan 22, 2021 • edited Loading

piskvorky commented Jan 22, 2021 • edited Loading

mpenkov commented Jan 22, 2021

piskvorky commented Jan 31, 2021

thalishsajeed commented Jan 31, 2021

thalishsajeed commented Feb 3, 2021

piskvorky commented Feb 3, 2021 • edited Loading

thalishsajeed commented Feb 3, 2021

piskvorky commented Feb 3, 2021

thalishsajeed commented Feb 3, 2021 • edited Loading

piskvorky commented Feb 3, 2021

thalishsajeed commented Feb 8, 2021

piskvorky commented Feb 8, 2021

thalishsajeed commented Jan 20, 2021 •

edited by piskvorky

Loading

piskvorky commented Jan 20, 2021 •

edited

Loading

thalishsajeed commented Jan 20, 2021 •

edited

Loading

piskvorky commented Jan 20, 2021 •

edited

Loading

thalishsajeed commented Jan 21, 2021 •

edited

Loading

piskvorky commented Jan 21, 2021 •

edited

Loading

thalishsajeed commented Jan 21, 2021 •

edited

Loading

piskvorky commented Jan 21, 2021 •

edited

Loading

thalishsajeed commented Jan 22, 2021 •

edited

Loading

piskvorky commented Jan 22, 2021 •

edited

Loading

piskvorky commented Feb 3, 2021 •

edited

Loading

thalishsajeed commented Feb 3, 2021 •

edited

Loading