Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParquetRewriter fails when you try to nullify and encrypt 2 different columns #3026

Closed
MaxNevermind opened this issue Oct 7, 2024 · 0 comments · Fixed by #3027
Closed

ParquetRewriter fails when you try to nullify and encrypt 2 different columns #3026

MaxNevermind opened this issue Oct 7, 2024 · 0 comments · Fixed by #3027
Milestone

Comments

@MaxNevermind
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

This issue was previously reported in PR: PARQUET-2430: Add parquet joiner v2 #1335

Issue description

When you try to nullify and encrypt different columns using ParquetRewriter it fails. There is a related test but it nullifies and encrypts the same column which doesn't reproduce a bug. The bug can be reproduced by changing a single line in ParquetRewriterTest from maskColumns.put("DocId", MaskMode.NULLIFY); to maskColumns.put("Links.Forward", MaskMode.NULLIFY); in testNullifyAndEncryptColumn() method, If you do that the test start to fail with bellow exception:

org.apache.parquet.crypto.ParquetCryptoRuntimeException: Column ordinal doesnt match [Links, Forward]: 0, 6

	at org.apache.parquet.crypto.InternalFileEncryptor.getColumnSetup(InternalFileEncryptor.java:92)
	at org.apache.parquet.hadoop.ColumnChunkPageWriteStore.<init>(ColumnChunkPageWriteStore.java:634)
	at org.apache.parquet.hadoop.rewrite.ParquetRewriter.nullifyColumn(ParquetRewriter.java:889)
	at org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlock(ParquetRewriter.java:445)
	at org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:395)
	at org.apache.parquet.hadoop.rewrite.ParquetRewriterTest.testNullifyAndEncryptColumn(ParquetRewriterTest.java:474)
	at org.apache.parquet.hadoop.rewrite.ParquetRewriterTest.testNullifyEncryptSingleFile(ParquetRewriterTest.java:521)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Issue root cause

The reason of a failure is that during the nullification we create a single column schema MessageType newSchema = newSchema(schema, descriptor), this is needed because we need to nullify only a specified column, so we create a custom schema for that purpose. But we can't reuse a default encryptor created during ParquetRewriter construction with that new custom schema because default encryptor except main output schema used during ParquetRewriter construction, InternalFileEncryptor perform schema checks and it fails because of schema discrepancy.

Component(s)

Core

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants