Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility issues even after setting model seed #60

Open
ZechyW opened this issue Jun 10, 2020 · 3 comments
Open

Reproducibility issues even after setting model seed #60

ZechyW opened this issue Jun 10, 2020 · 3 comments
Labels
bug Something isn't working

Comments

@ZechyW
Copy link

ZechyW commented Jun 10, 2020

Hi, thank you for all your work on this amazing library!

I'm running into a strange issue with reproducibility: Even after setting the model seed, I'm still sometimes getting different LDA results with the same documents (a processed subset of the BBC news dataset).

My code is very simple -- It reads from a text file, where each line represents a single document with space-separated tokens, and trains an LDAModel over the data. I've turned off parallel processing to prevent any randomness from coming in there as well.

import tomotopy as tp

with open("docs.txt", "r", encoding="utf8") as fp:
    model = tp.LDAModel(k=5, seed=123456789)
    for line in fp:
        model.add_doc(line.split())

for i in range(0, 1000, 100):
    model.train(100, workers=1, parallel=tp.ParallelScheme.NONE)
    print(f"Iteration: {i + 100} LL: {model.ll_per_word:.5f}")

When I run the code, I usually get the following output:

Iteration: 100 LL: -7.94113
Iteration: 200 LL: -7.90128
Iteration: 300 LL: -7.88406
Iteration: 400 LL: -7.86940
Iteration: 500 LL: -7.85939
Iteration: 600 LL: -7.84511
Iteration: 700 LL: -7.84116
Iteration: 800 LL: -7.83339
Iteration: 900 LL: -7.83029
Iteration: 1000 LL: -7.82927

But about 30% of the time I get the following output instead, where the stats seem to diverge at iteration 300:

Iteration: 100 LL: -7.94113
Iteration: 200 LL: -7.90128
Iteration: 300 LL: -7.88715
Iteration: 400 LL: -7.87158
Iteration: 500 LL: -7.86242
Iteration: 600 LL: -7.84669
Iteration: 700 LL: -7.84028
Iteration: 800 LL: -7.82794
Iteration: 900 LL: -7.82512
Iteration: 1000 LL: -7.82317

The results seem to switch randomly between these two possibilities (I haven't seen any other variations turn up), but I just can't seem to figure out where the indeterminacy is coming from. Would appreciate any advice or help you could provide!

Attached:
docs.txt

@ZechyW
Copy link
Author

ZechyW commented Jun 10, 2020

Discovered that PYTHONHASHSEED seems to be affecting the results --

Invoking the script as:
PYTHONHASHSEED=429467291 python lda.py
always gives the first set of results, and invoking it as:
PYTHONHASHSEED=429467292 python lda.py
always gives the second set of results.

I wonder if it would be possible to have the algorithm give stable results across different hash seeds?

@bab2min
Copy link
Owner

bab2min commented Jun 12, 2020

Thank you for reporting a potential bug.
I'll examine your code and data and figure out why PYTHONHASHSEED affects the results.

@ZechyW
Copy link
Author

ZechyW commented Jun 15, 2020

Update: After a long fruitless wild goose chase trying to track down the source of the indeterminacy, something seems to have fixed it and I now only get the second set of results no matter what PYTHONHASHSEED is set to.
Chalk it up to the oddest of heisenbugs, and I'll be setting PYTHONHASHSEED for my project from now on to be safe, but I don't think I can reproduce this anymore.

As far as I can tell, it was clearing the Windows 10 Prefetch cache that did it, so on the off chance that someone else on Windows runs into the same kind of behaviour, that's something that might help!

@bab2min bab2min added the bug Something isn't working label Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants