-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
intermittent failure: Test CBOW w/ hierarchical softmax - #531
Comments
An odd thing here is that the tolerances were picked such that I saw no failures in 100-200 local test runs... but sometimes (and almost in a bursty fashion?!?!) the auto-testing systems seem far more prone to outlier results. Especially with regard to a Windows build that requires many tries to pass, compared to linux builds that "usually" passes, it might be indicative of a true training degradation on that platform, requiring a fix other than testing adjustments. Have you noticed if it's always this exact test ( More generally I see the problem as: We're attempting quick tests on a necessarily-thin amount of bundled data on algorithms that have some inherent variance. We may need to seed the tests to achieve perfect run-for-run reproducibility. At least then, many true bugs will break the test that previously passed predictably. Simply expanding the tolerances on a still-random process can make the failures less-frequent – but might not eliminate them entirely. Then on the rare occasions test-failures still happen, we'd be unsure if the code really changed in some tangible way, or it was just a very-unlucky outlier run. Making the tolerances so generous that we "never" (practically) see a failure could also hide many other kinds of calculation-degradation bugs we'd usually want tests to catch. |
I'm -1 on seeding tests with exact RNG. If the inherent variance is such that it requires the range of values to become effectively meaningless in order for the test to pass, then the test is not very useful in the first place (only checks syntax). So let's try to come up with other solutions. Would more data help? Different parameters (iterations)? |
@gojomo Do you have ideas one can try on how to fix this? More data or different params? |
PR #581 has a tuned version of I suspect the trigger has been thread-scheduling that introduces far more randomness on the build machines than in my local tests. (That may be further aggravated by how large jobs are compared to the small unit test dataset, and maybe even further aggravated by the race issue that was in #571.) The parameters before this check-in had been chosen after no failures in (well over) 200 runs. And, testing them again, they ran thousands of times without a failure on my OSX test machine, in both Py2.7 and Py3.4. But I did notice a slightly higher spread of (passing) values on Py3.4, probably due to some differences in thread-scheduling and CPU time-slicing. And by forcing far more randomness into my local tests (explicitly seeding with a random number), I could force failure rates much like those seen on the CI machines... until the adjusted parameters in this commit, which have executed 1500 times on both Py27 and Py34 without failure. The CI machines may yet hold surprises, we'll see. There are also still some mysteries in this particular test, in my local runs. In a number of settings I tried, it seemed like increasing the number of iterations could make the expected results (related-words-close-to-each-other, tests-passing) less likely: 10 iterations was doing well, 30+ iterations was doing awful. That's suspicious and non-intuitive enough that it might be indicative of something wrong with this exact training mode, but I'm stumped what it could be. While |
Need to restart Travis and Appveyor builds a couple of times in order to get this test set to pass. Is there a way to make it more robust?
-------------------- >> end captured logging << ---------------------
FAIL: Test CBOW w/ hierarchical softmax
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\gensim\test\test_word2vec.py", line 246, in test_cbow_hs
self.model_sanity(model)
File "C:\Python27\lib\site-packages\gensim\test\test_word2vec.py", line 226, in model_sanity
self.assertLess(t_rank, 50)
AssertionError: 64 not less than 50
-------------------- >> begin captured logging << --------------------
gensim.models.word2vec: INFO: collecting all words and their counts
The text was updated successfully, but these errors were encountered: