You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thank you for all your work on this amazing library!
I'm running into a strange issue with reproducibility: Even after setting the model seed, I'm still sometimes getting different LDA results with the same documents (a processed subset of the BBC news dataset).
My code is very simple -- It reads from a text file, where each line represents a single document with space-separated tokens, and trains an LDAModel over the data. I've turned off parallel processing to prevent any randomness from coming in there as well.
The results seem to switch randomly between these two possibilities (I haven't seen any other variations turn up), but I just can't seem to figure out where the indeterminacy is coming from. Would appreciate any advice or help you could provide!
Discovered that PYTHONHASHSEED seems to be affecting the results --
Invoking the script as: PYTHONHASHSEED=429467291 python lda.py
always gives the first set of results, and invoking it as: PYTHONHASHSEED=429467292 python lda.py
always gives the second set of results.
I wonder if it would be possible to have the algorithm give stable results across different hash seeds?
Update: After a long fruitless wild goose chase trying to track down the source of the indeterminacy, something seems to have fixed it and I now only get the second set of results no matter what PYTHONHASHSEED is set to.
Chalk it up to the oddest of heisenbugs, and I'll be setting PYTHONHASHSEED for my project from now on to be safe, but I don't think I can reproduce this anymore.
As far as I can tell, it was clearing the Windows 10 Prefetch cache that did it, so on the off chance that someone else on Windows runs into the same kind of behaviour, that's something that might help!
Hi, thank you for all your work on this amazing library!
I'm running into a strange issue with reproducibility: Even after setting the model seed, I'm still sometimes getting different LDA results with the same documents (a processed subset of the BBC news dataset).
My code is very simple -- It reads from a text file, where each line represents a single document with space-separated tokens, and trains an LDAModel over the data. I've turned off parallel processing to prevent any randomness from coming in there as well.
When I run the code, I usually get the following output:
But about 30% of the time I get the following output instead, where the stats seem to diverge at iteration 300:
The results seem to switch randomly between these two possibilities (I haven't seen any other variations turn up), but I just can't seem to figure out where the indeterminacy is coming from. Would appreciate any advice or help you could provide!
Attached:
docs.txt
The text was updated successfully, but these errors were encountered: