PARQUET-2361: Reduce failure rate of unit test #1170

fengjiajie · 2023-10-14T03:00:53Z

Reduce failure rate of unit test testParquetFileWithBloomFilterWithFpp

Change-Id: Ic230f197b0996333a082bb05bd201963d05d862e

[INFO] Results:
[INFO] 
Error:  Failures: 
Error:    TestParquetWriter.testParquetFileWithBloomFilterWithFpp:342
[INFO]

Multiple different PR triggered this failure:

I found two issues that may cause the failures to occur easily:

This may be a bug. The 'distinctStrings' is used to generate test files, but it is cleared in the first round. Then it is used for probability testing, so in the next round of generating 'fpp' files, it uses the data from the previous round of testing, and the length of the filled strings (length 10 for test not in bloomfilter) is also different from the initial one (strings length 12 for build bloomfilter).
The number of test iterations is insufficient, resulting in an unstable probability.

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-2361
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java

fengjiajie · 2023-10-14T14:59:38Z

After this modification, the unit tests are still failing. It seems that we need to explore other approaches, such as fixing the random seed or increasing the tolerance over 10% (current fpp * 1.1)

amousavigourabi · 2023-10-14T15:01:18Z

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java

+        int falsePositive = 0;
+        Set<String> distinctStringsForProbe = new HashSet<>();
+        while (distinctStringsForProbe.size() < testBloomFilterCount) {
+          String str = RandomStringUtils.randomAlphabetic(randomStrLen - 1);


Is there any reason this cannot be randomStrLen by the way?

@amousavigourabi The purpose of unit testing is to ensure that the false positive rate of the bloom filter meets expectations. The current approach involves generating a bloom filter using many strings of length 12. Any data with a length other than 12 is guaranteed to not exist in the original data. For this data (length != 12), if the bloom filter returns true, it is considered a false positive. The false positive rate is then calculated by examining these cases. Alternatively, if we want to use strings of length 12 for testing, we need to randomly generate strings and check if they exist in the original data. Only the ones that do not exist can be used to test the false positive rate.

amousavigourabi · 2023-10-14T16:11:20Z

@fengjiajie As the test is now evaluating the false positive rate with significantly more samples than what we use to build the filter, or are provided as NDV, might it not be the case that this will increase the false positive rate of the bloom filter to more than the FPP? Perhaps we could try increasing the NDV, or maybe an adaptive bloom filter might be more appropriate? WDYT?

amousavigourabi · 2023-10-14T16:55:09Z

@fengjiajie As the test is now evaluating the false positive rate with significantly more samples than what we use to build the filter, or are provided as NDV, might it not be the case that this will increase the false positive rate of the bloom filter to more than the FPP? Perhaps we could try increasing the NDV, or maybe an adaptive bloom filter might be more appropriate? WDYT?

Nevermind, I don't think this should be an issue. I'll be running the test locally in a loop to see if I can reproduce the flake and check how often it might occur.

amousavigourabi · 2023-10-15T17:10:33Z

@fengjiajie As the test is now evaluating the false positive rate with significantly more samples than what we use to build the filter, or are provided as NDV, might it not be the case that this will increase the false positive rate of the bloom filter to more than the FPP? Perhaps we could try increasing the NDV, or maybe an adaptive bloom filter might be more appropriate? WDYT?

Nevermind, I don't think this should be an issue. I'll be running the test locally in a loop to see if I can reproduce the flake and check how often it might occur.

I got 4 failures in 10k runs. This would mean 4 failures in 1250 full actions runs. These failures were all very slightly out of the expected range for the 0.01 fpp case. Given that we take 200k samples, this might indicate a flaw in the code, as 200k samples of some i.i.d. random variable ~ Bern(0.01) really should not have over 2200 hits that often, if ever. If we want to just fix this test I suggest raising the tolerance to 15%. That should keep it from failing.

Change-Id: Ic230f197b0996333a082bb05bd201963d05d862e

fengjiajie · 2023-10-16T02:33:42Z

@fengjiajie As the test is now evaluating the false positive rate with significantly more samples than what we use to build the filter, or are provided as NDV, might it not be the case that this will increase the false positive rate of the bloom filter to more than the FPP? Perhaps we could try increasing the NDV, or maybe an adaptive bloom filter might be more appropriate? WDYT?

Nevermind, I don't think this should be an issue. I'll be running the test locally in a loop to see if I can reproduce the flake and check how often it might occur.

I got 4 failures in 10k runs. This would mean 4 failures in 1250 full actions runs. These failures were all very slightly out of the expected range for the 0.01 fpp case. Given that we take 200k samples, this might indicate a flaw in the code, as 200k samples of some i.i.d. random variable ~ Bern(0.01) really should not have over 2200 hits that often, if ever. If we want to just fix this test I suggest raising the tolerance to 15%. That should keep it from failing.

@amousavigourabi I agree with increasing fault tolerance. Thank you very much for your review and testing.

amousavigourabi suggested changes Oct 14, 2023

View reviewed changes

fengjiajie force-pushed the PARQUET-2361 branch from 0d3a20d to 7460018 Compare October 14, 2023 13:38

amousavigourabi approved these changes Oct 14, 2023

View reviewed changes

amousavigourabi reviewed Oct 14, 2023

View reviewed changes

PARQUET-2361: Reduce failure rate of unit test

9c54147

Change-Id: Ic230f197b0996333a082bb05bd201963d05d862e

fengjiajie force-pushed the PARQUET-2361 branch from 7460018 to 9c54147 Compare October 16, 2023 02:32

Fokko approved these changes Oct 17, 2023

View reviewed changes

wgtmac approved these changes Oct 18, 2023

View reviewed changes

wgtmac merged commit 354ddeb into apache:master Oct 19, 2023
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2361: Reduce failure rate of unit test #1170

PARQUET-2361: Reduce failure rate of unit test #1170

fengjiajie commented Oct 14, 2023 •

edited

Loading

fengjiajie commented Oct 14, 2023

amousavigourabi Oct 14, 2023

fengjiajie Oct 14, 2023

amousavigourabi commented Oct 14, 2023

amousavigourabi commented Oct 14, 2023

amousavigourabi commented Oct 15, 2023

fengjiajie commented Oct 16, 2023

PARQUET-2361: Reduce failure rate of unit test #1170

PARQUET-2361: Reduce failure rate of unit test #1170

Conversation

fengjiajie commented Oct 14, 2023 • edited Loading

Jira

Tests

Commits

Documentation

fengjiajie commented Oct 14, 2023

amousavigourabi Oct 14, 2023

Choose a reason for hiding this comment

fengjiajie Oct 14, 2023

Choose a reason for hiding this comment

amousavigourabi commented Oct 14, 2023

amousavigourabi commented Oct 14, 2023

amousavigourabi commented Oct 15, 2023

fengjiajie commented Oct 16, 2023

fengjiajie commented Oct 14, 2023 •

edited

Loading