Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2361: Reduce failure rate of unit test #1170

Merged
merged 1 commit into from
Oct 19, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -288,14 +288,15 @@ public void testParquetFileWithBloomFilter() throws IOException {

@Test
public void testParquetFileWithBloomFilterWithFpp() throws IOException {
int totalCount = 100000;
double[] testFpp = {0.01, 0.05, 0.10, 0.15, 0.20, 0.25};
int buildBloomFilterCount = 100000;
double[] testFpps = {0.01, 0.05, 0.10, 0.15, 0.20, 0.25};
int randomStrLen = 12;
final int testBloomFilterCount = 200000;

Set<String> distinctStrings = new HashSet<>();
while (distinctStrings.size() < totalCount) {
Set<String> distinctStringsForFileGenerate = new HashSet<>();
while (distinctStringsForFileGenerate.size() < buildBloomFilterCount) {
String str = RandomStringUtils.randomAlphabetic(randomStrLen);
distinctStrings.add(str);
distinctStringsForFileGenerate.add(str);
}

MessageType schema = Types.buildMessage().
Expand All @@ -305,7 +306,7 @@ public void testParquetFileWithBloomFilterWithFpp() throws IOException {
GroupWriteSupport.setSchema(schema, conf);

GroupFactory factory = new SimpleGroupFactory(schema);
for (int i = 0; i < testFpp.length; i++) {
for (double testFpp : testFpps) {
File file = temp.newFile();
file.delete();
Path path = new Path(file.getAbsolutePath());
Expand All @@ -314,32 +315,32 @@ public void testParquetFileWithBloomFilterWithFpp() throws IOException {
.withConf(conf)
.withDictionaryEncoding(false)
.withBloomFilterEnabled("name", true)
.withBloomFilterNDV("name", totalCount)
.withBloomFilterFPP("name", testFpp[i])
.withBloomFilterNDV("name", buildBloomFilterCount)
.withBloomFilterFPP("name", testFpp)
.build()) {
java.util.Iterator<String> iterator = distinctStrings.iterator();
while (iterator.hasNext()) {
writer.write(factory.newGroup().append("name", iterator.next()));
for (String str : distinctStringsForFileGenerate) {
writer.write(factory.newGroup().append("name", str));
}
}
distinctStrings.clear();

try (ParquetFileReader reader = ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration()))) {
BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0);
BloomFilter bloomFilter = reader.getBloomFilterDataReader(blockMetaData)
.readBloomFilter(blockMetaData.getColumns().get(0));

// The exist counts the number of times FindHash returns true.
int exist = 0;
while (distinctStrings.size() < totalCount) {
String str = RandomStringUtils.randomAlphabetic(randomStrLen - 2);
if (distinctStrings.add(str) &&
// The false positive counts the number of times FindHash returns true.
int falsePositive = 0;
fengjiajie marked this conversation as resolved.
Show resolved Hide resolved
Set<String> distinctStringsForProbe = new HashSet<>();
while (distinctStringsForProbe.size() < testBloomFilterCount) {
String str = RandomStringUtils.randomAlphabetic(randomStrLen - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason this cannot be randomStrLen by the way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amousavigourabi The purpose of unit testing is to ensure that the false positive rate of the bloom filter meets expectations. The current approach involves generating a bloom filter using many strings of length 12. Any data with a length other than 12 is guaranteed to not exist in the original data. For this data (length != 12), if the bloom filter returns true, it is considered a false positive. The false positive rate is then calculated by examining these cases. Alternatively, if we want to use strings of length 12 for testing, we need to randomly generate strings and check if they exist in the original data. Only the ones that do not exist can be used to test the false positive rate.

if (distinctStringsForProbe.add(str) &&
bloomFilter.findHash(LongHashFunction.xx(0).hashBytes(Binary.fromString(str).toByteBuffer()))) {
exist++;
falsePositive++;
}
}
// The exist should be less than totalCount * fpp. Add 10% here for error space.
assertTrue(exist < totalCount * (testFpp[i] * 1.1) && exist > 0);
// The false positive should be less than totalCount * fpp. Add 15% here for error space.
double expectedFalsePositiveMaxCount = Math.floor(testBloomFilterCount * (testFpp * 1.15));
assertTrue(falsePositive < expectedFalsePositiveMaxCount && falsePositive > 0);
}
}
}
Expand Down