Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-3026: ParquetRewriter fails when you try to nullify and encrypt 2 different columns #3027

Merged

Conversation

MaxNevermind
Copy link
Contributor

@MaxNevermind MaxNevermind commented Oct 7, 2024

GitHub issue: ParquetRewriter fails when you try to nullify and encrypt 2 different columns #3026
This issue was previously reported in PR: PARQUET-2430: Add parquet joiner v2 #1335

Issue description

When you try to nullify and encrypt different columns using ParquetRewriter it fails. There is a related test but it nullifies and encrypts the same column which doesn't reproduce a bug. The bug can be reproduced by changing a single line in ParquetRewriterTest from maskColumns.put("DocId", MaskMode.NULLIFY); to maskColumns.put("Links.Forward", MaskMode.NULLIFY); in testNullifyAndEncryptColumn() method, If you do that the test start to fail with bellow exception:

org.apache.parquet.crypto.ParquetCryptoRuntimeException: Column ordinal doesnt match [Links, Forward]: 0, 6

	at org.apache.parquet.crypto.InternalFileEncryptor.getColumnSetup(InternalFileEncryptor.java:92)
	at org.apache.parquet.hadoop.ColumnChunkPageWriteStore.<init>(ColumnChunkPageWriteStore.java:634)
	at org.apache.parquet.hadoop.rewrite.ParquetRewriter.nullifyColumn(ParquetRewriter.java:889)
	at org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlock(ParquetRewriter.java:445)
	at org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:395)
	at org.apache.parquet.hadoop.rewrite.ParquetRewriterTest.testNullifyAndEncryptColumn(ParquetRewriterTest.java:474)
	at org.apache.parquet.hadoop.rewrite.ParquetRewriterTest.testNullifyEncryptSingleFile(ParquetRewriterTest.java:521)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Issue root cause

The reason of a failure is that during the nullification we create a single column schema MessageType newSchema = newSchema(schema, descriptor), this is needed because we need to nullify only a specified column, so we create a custom schema for that purpose. But we can't reuse a default encryptor created during ParquetRewriter construction with that new custom schema because default encryptor performs encrypted columns metadata checks internally and when it does it fails because of schema discrepancy.

Close #3026

@MaxNevermind
Copy link
Contributor Author

@wgtmac
This is a fix you asked for here: #1335 (comment)

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sorry for the late review and thanks for the fix!

@wgtmac wgtmac changed the title GH-3026 ParquetRewriter fails when you try to nullify and encrypt 2 different columns GH-3026: ParquetRewriter fails when you try to nullify and encrypt 2 different columns Oct 12, 2024
@wgtmac wgtmac merged commit 5baa903 into apache:master Oct 12, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ParquetRewriter fails when you try to nullify and encrypt 2 different columns
2 participants