Reproducibility issues even after setting model seed #60

ZechyW · 2020-06-10T07:48:39Z

Hi, thank you for all your work on this amazing library!

I'm running into a strange issue with reproducibility: Even after setting the model seed, I'm still sometimes getting different LDA results with the same documents (a processed subset of the BBC news dataset).

My code is very simple -- It reads from a text file, where each line represents a single document with space-separated tokens, and trains an LDAModel over the data. I've turned off parallel processing to prevent any randomness from coming in there as well.

import tomotopy as tp

with open("docs.txt", "r", encoding="utf8") as fp:
    model = tp.LDAModel(k=5, seed=123456789)
    for line in fp:
        model.add_doc(line.split())

for i in range(0, 1000, 100):
    model.train(100, workers=1, parallel=tp.ParallelScheme.NONE)
    print(f"Iteration: {i + 100} LL: {model.ll_per_word:.5f}")

When I run the code, I usually get the following output:

Iteration: 100 LL: -7.94113
Iteration: 200 LL: -7.90128
Iteration: 300 LL: -7.88406
Iteration: 400 LL: -7.86940
Iteration: 500 LL: -7.85939
Iteration: 600 LL: -7.84511
Iteration: 700 LL: -7.84116
Iteration: 800 LL: -7.83339
Iteration: 900 LL: -7.83029
Iteration: 1000 LL: -7.82927

But about 30% of the time I get the following output instead, where the stats seem to diverge at iteration 300:

Iteration: 100 LL: -7.94113
Iteration: 200 LL: -7.90128
Iteration: 300 LL: -7.88715
Iteration: 400 LL: -7.87158
Iteration: 500 LL: -7.86242
Iteration: 600 LL: -7.84669
Iteration: 700 LL: -7.84028
Iteration: 800 LL: -7.82794
Iteration: 900 LL: -7.82512
Iteration: 1000 LL: -7.82317

The results seem to switch randomly between these two possibilities (I haven't seen any other variations turn up), but I just can't seem to figure out where the indeterminacy is coming from. Would appreciate any advice or help you could provide!

Attached:
docs.txt

The text was updated successfully, but these errors were encountered:

ZechyW · 2020-06-10T10:43:21Z

Discovered that PYTHONHASHSEED seems to be affecting the results --

Invoking the script as:
PYTHONHASHSEED=429467291 python lda.py
always gives the first set of results, and invoking it as:
PYTHONHASHSEED=429467292 python lda.py
always gives the second set of results.

I wonder if it would be possible to have the algorithm give stable results across different hash seeds?

bab2min · 2020-06-12T11:10:09Z

Thank you for reporting a potential bug.
I'll examine your code and data and figure out why PYTHONHASHSEED affects the results.

ZechyW · 2020-06-15T01:57:35Z

Update: After a long fruitless wild goose chase trying to track down the source of the indeterminacy, something seems to have fixed it and I now only get the second set of results no matter what PYTHONHASHSEED is set to.
Chalk it up to the oddest of heisenbugs, and I'll be setting PYTHONHASHSEED for my project from now on to be safe, but I don't think I can reproduce this anymore.

As far as I can tell, it was clearing the Windows 10 Prefetch cache that did it, so on the off chance that someone else on Windows runs into the same kind of behaviour, that's something that might help!

bab2min mentioned this issue Jul 12, 2020

different results even if seed is fixed #63

Open

bab2min added the bug Something isn't working label Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility issues even after setting model seed #60

Reproducibility issues even after setting model seed #60

ZechyW commented Jun 10, 2020

ZechyW commented Jun 10, 2020 •

edited

Loading

bab2min commented Jun 12, 2020

ZechyW commented Jun 15, 2020 •

edited

Loading

Reproducibility issues even after setting model seed #60

Reproducibility issues even after setting model seed #60

Comments

ZechyW commented Jun 10, 2020

ZechyW commented Jun 10, 2020 • edited Loading

bab2min commented Jun 12, 2020

ZechyW commented Jun 15, 2020 • edited Loading

ZechyW commented Jun 10, 2020 •

edited

Loading

ZechyW commented Jun 15, 2020 •

edited

Loading