-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2361: Reduce failure rate of unit test #1170
Conversation
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java
Outdated
Show resolved
Hide resolved
0d3a20d
to
7460018
Compare
After this modification, the unit tests are still failing. It seems that we need to explore other approaches, such as fixing the random seed or increasing the tolerance over 10% (current fpp * 1.1) |
int falsePositive = 0; | ||
Set<String> distinctStringsForProbe = new HashSet<>(); | ||
while (distinctStringsForProbe.size() < testBloomFilterCount) { | ||
String str = RandomStringUtils.randomAlphabetic(randomStrLen - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason this cannot be randomStrLen
by the way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amousavigourabi The purpose of unit testing is to ensure that the false positive rate of the bloom filter meets expectations. The current approach involves generating a bloom filter using many strings of length 12. Any data with a length other than 12 is guaranteed to not exist in the original data. For this data (length != 12), if the bloom filter returns true, it is considered a false positive. The false positive rate is then calculated by examining these cases. Alternatively, if we want to use strings of length 12 for testing, we need to randomly generate strings and check if they exist in the original data. Only the ones that do not exist can be used to test the false positive rate.
@fengjiajie As the test is now evaluating the false positive rate with significantly more samples than what we use to build the filter, or are provided as NDV, might it not be the case that this will increase the false positive rate of the bloom filter to more than the FPP? Perhaps we could try increasing the NDV, or maybe an adaptive bloom filter might be more appropriate? WDYT? |
Nevermind, I don't think this should be an issue. I'll be running the test locally in a loop to see if I can reproduce the flake and check how often it might occur. |
I got 4 failures in 10k runs. This would mean 4 failures in 1250 full actions runs. These failures were all very slightly out of the expected range for the 0.01 fpp case. Given that we take 200k samples, this might indicate a flaw in the code, as 200k samples of some i.i.d. random variable ~ Bern(0.01) really should not have over 2200 hits that often, if ever. If we want to just fix this test I suggest raising the tolerance to 15%. That should keep it from failing. |
Change-Id: Ic230f197b0996333a082bb05bd201963d05d862e
7460018
to
9c54147
Compare
@amousavigourabi I agree with increasing fault tolerance. Thank you very much for your review and testing. |
Reduce failure rate of unit test testParquetFileWithBloomFilterWithFpp
Change-Id: Ic230f197b0996333a082bb05bd201963d05d862e
Multiple different PR triggered this failure:
I found two issues that may cause the failures to occur easily:
Make sure you have checked all steps below.
Jira
Tests
Commits
Documentation