-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2139: fix file_offset field in ColumnChunk metadata #1369
Conversation
new ColumnChunk(columnMetaData.getFirstDataPageOffset()); // verify this is the right offset | ||
// There is no ColumnMetaData written after the chunk data, so set the ColumnChunk | ||
// file_offset to 0 | ||
ColumnChunk columnChunk = new ColumnChunk(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that @etseidl and me have discussed in https://issues.apache.org/jira/browse/PARQUET-2139. The best fix is to write ColumnMetaData
at the end of each column chunk (currently it does not) and store the correct offset here. However, it has been wrong since day 1 and takes some effort to make it right. Since we have not seen any issue around this these years, I'm inclined to deprecate this field together with the v3 discussion. Therefore I'm fine with setting an invalid value here (0 or -1). WDYT? @gszadovszky @julienledem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also brought this up on the mailing list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac, I agree to write invalid value here (0 is as invalid as -1 because of the magic bytes at the beginning of the file) and remove the field for v3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we intended to write the ColumnMetaData at the end of the column chunk though. Is it something that is ambiguous in the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I followed up on the mailing list on the thread above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-cpp actually writes ColumnMetaData
right after the last page of the column and stores it into file_offset
field: https://github.com/apache/arrow/blob/6800be9331d88024bf550c77865a06c592a22699/cpp/src/parquet/metadata.cc#L1473-L1478
I don't remember all the context, but if this is completely wrong, I'd rather deprecate the field and document it should not be used rather than setting the value to zero.
What do other implementations put in this field? (if no other implementation sets this, then this might be a different story) |
I agree with deprecating, but I'm less sanguine about leaving an incorrect value in parquet-java, especially given the fact that arrow-cpp (and arrow-rs I believe) populate this field correctly. Having such a big difference between major implementations is IMO more confusing than stating the field should be set to 0 (or -1) if there is no second copy of the
Implementations that do this will break anyway if they try to read a file produced by arrow, so I don't know how big of a concern this is. That said, if the consensus is to just leave this be, that's fine too...we'd just have to make note of differing interpretations in the format documents. |
@wgtmac now that this field has been deprecated do you think this should move forward? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder! I'll merge this.
Fixes the referenced issue wherein the
file_offset
field of theColumnChunk
object is improperly set to the offset of the first page in the column chunk. Because parquet-java does not write a copy ofColumnMetaData
after the column chunk, this PR simply sets the value offile_offset
to 0 (per apache/parquet-format/#440).Closes #2678.
Make sure you have checked all steps below.
Jira
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
the ASF 3rd Party License Policy.
Tests
The footer metadata lacks the
file_offset
field, so unit testing is difficult. Manual inspection of generated files confirms the desired output.Commits
from "How to write a good git commit message":
Style
mvn spotless:apply -Pvector-plugins
Documentation