-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1482: [C++] Add branch to TypedRecordReader::ReadNewPage for … #3312
Conversation
The parquetjs project should probably be warned about cross-compatibility -- V2 data pages are likely unreadable in a number of places. |
Hi Wes, thanks for responding on this! I just opened an issue with the parquetjs project to make them aware of this too: ironSource/parquetjs#78 . I've noticed the same issue (from PARQUET-1482) occurring with files written using the I can't reproduce this using the newest release of I see that there are Travis CI build failures, but these seem to be unrelated to the changes in this commit. I was thinking of pulling in changes from apache:master after the build is passing, rebasing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in general happy with the changes but we definitely should have a test for this code. You could add probably in the most simple form to https://github.com/apache/arrow/blob/master/cpp/src/parquet/file-deserialize-test.cc
I agree that a unit test would be a good idea. When we were building this library initially I tried to make the file deserialization internals a bit more accessible to unit testing, otherwise triggering and testing various code paths would be a lot more difficult |
Hi Uwe and Wes, thank you for the recommendations. I added a test in My understanding is that the new test will not directly test the changes to I was thinking of creating a new JIRA story for writing DataPageV2 headers using the Arrow interface for Parquet, but I found PARQUET-458 that seems to be similar. Is it alright if I assign this JIRA story to myself? |
@rdmello Feel free to assign PARQUET-458 to yourself. To enable writing V2 pages in the Arrow, you should add an option https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L142-L433 to switch between V1 and V2 pages. This option needs then be passed through the layers where the page is actually created. |
I've gone through these changes and they look good to me. Would it be ok to accept this PR and follow up with a new PR to respond to the feedback? @rdmello could use PARQUET-458 to complete the write implementation and add the suggested tests. |
Sorry for the delay, will review this so it gets into 0.13.0 |
…PageType::DATA_PAGE_V2 to address incompatibility with parquetjs. Tests This commit doesn't include tests; I am working on them now. I may need to use an actual file generated by parquetjs to test this issue, so I wonder if adding feeds1kMicros.parquet from the JIRA task to the parquet-testing repository is an option. Description parquetjs seems to be writing Parquet V2 files with DataPageV2 pages, while parquet-cpp writes Parquet V2 files with DataPage pages. Since TypedRecordReader::ReadNewPage() only had a branch for PageType::DATA_PAGE, the reader would return without reading any data for records that have DATA_PAGE_V2 pages. This explains the behavior observed in PARQUET-1482. This commit adds a new if-else branch for the DataPageV2 case in TypedRecordReader::ReadNewPage(). Since the DataPageV2 branch needed to reuse the code from the DataPage case, I refactored the repetition/definition level decoder initialization and the data decoder initialization to two new methods in the TypedRecordReader class. These new methods are now called by the DataPage and DataPageV2 initialization branches in TypedRecordReader::ReadNewPage(). There is an alternate implementation possible (with a smaller diff) by sharing the same else-if branch between DataPage and DataPageV2 using a pointer-to-derived shared_ptr<Page>. However, since the Page superclass doesn't have the necessary encoding() or num_values() methods, I would need to add a common superclass to both DataPage and DataPageV2 that defined these methods. I didn't do this because I was hesitant to modify the Page class hierarchy for this commit.
…on and deserialization.
Rebased. Have you reported an issue into upstream parquetjs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine. The fact that DataPage
and DataPageV2
do not have a common base seems like an eyesore to me. I'm going to quickly fix that
template <typename PageType> | ||
int64_t InitializeLevelDecoders(const std::shared_ptr<PageType> page, | ||
const Encoding::type repetition_level_encoding, | ||
const Encoding::type definition_level_encoding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
is not needed with Encoding::type
|
||
template <typename PageType> | ||
void InitializeDataDecoder(const std::shared_ptr<PageType> page, | ||
const int64_t levels_bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
not needed with int64
// Have not decoded any values from the data page yet | ||
num_decoded_values_ = 0; | ||
|
||
const uint8_t* buffer = page->data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only data()
and num_values()
are used here. We should have a common base class for DataPageV1 and DataPageV2 instead of having this template
void TypedRecordReader<DType>::InitializeDataDecoder(const std::shared_ptr<PageType> page, | ||
const int64_t levels_byte_size) { | ||
const uint8_t* buffer = page->data() + levels_byte_size; | ||
const int64_t data_size = page->size() - levels_byte_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
Hi Wes, thanks for rebasing the code and providing feedback. Yes, I did report this to parquetjs earlier: ironSource/parquetjs#78 I've looked through some of the parquetjs code base, and my understanding is that there is a way to provide an option to the ParquetWriter class in parquetjs that would write DataPageV1 pages. However the default behavior for parquetjs is still to write DataPageV2 pages. |
+1. Will merge once the build looks good |
Appveyor is about half passed. Merging |
…PageType::DATA_PAGE_V2 to address incompatibility with parquetjs. **Tests** This commit doesn't include tests right now; I am working on adding tests and was hoping for some initial feedback on the code changes. I may need to use an actual file generated by `parquetjs` to test this issue, so I wonder if adding `feeds1kMicros.parquet` from the JIRA task to the parquet-testing repository is an option for this. **Description** `parquetjs` seems to be writing Parquet V2 files with [`DataPageV2`](https://github.com/apache/parquet-format/blob/e93dd628d90aa076745558998f0bf5d9c262bf22/src/main/thrift/parquet.thrift#L529) pages, while `parquet-cpp` writes Parquet V2 files with [`DataPage`](https://github.com/apache/parquet-format/blob/e93dd628d90aa076745558998f0bf5d9c262bf22/src/main/thrift/parquet.thrift#L492) pages. Since `TypedRecordReader::ReadNewPage()` only had a branch for `PageType::DATA_PAGE`, the reader would return without reading any data for records that have `DATA_PAGE_V2` pages. This explains the behavior observed in [PARQUET-1482](https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-1482?filter=allopenissues). This commit adds a new if-else branch for the `DataPageV2` case in `TypedRecordReader::ReadNewPage()`. Since the `DataPageV2` branch needed to reuse the code from the `DataPage` case, I refactored the repetition/definition level decoder initialization and the data decoder initialization to two new methods in the `TypedRecordReader` class. These new methods are now called by the `DataPage` and `DataPageV2` initialization branches in `TypedRecordReader::ReadNewPage()`. There is an alternate implementation possible (with a smaller diff) by sharing the same else-if branch between `DataPage` and `DataPageV2` using a pointer-to-derived `shared_ptr<Page>`. However, since the Page superclass doesn't have the necessary `encoding()` or `num_values()` methods, I would need to add a common superclass to both `DataPage` and `DataPageV2` that defined these methods. I didn't do this because I was hesitant to modify the `Page` class hierarchy for this commit. Author: Wes McKinney <[email protected]> Author: rdmello <[email protected]> Author: Rylan Dmello <[email protected]> Closes #3312 from rdmello/parquet_1482 and squashes the following commits: c5cb0f3 <Wes McKinney> Add DataPage base class for DataPageV1 and DataPageV2 8df8328 <rdmello> PARQUET-1482: Adding basic unit test for DataPageV2 serialization and deserialization. 9df3222 <Rylan Dmello> PARQUET-1482: Add branch to TypedRecordReader::ReadNewPage for PageType::DATA_PAGE_V2 to address incompatibility with parquetjs.
…PageType::DATA_PAGE_V2 to address incompatibility with parquetjs.
Tests
This commit doesn't include tests right now; I am working on adding tests and was hoping for some initial feedback on the code changes. I may need to use an actual file generated by
parquetjs
to test this issue, so I wonder if addingfeeds1kMicros.parquet
from the JIRA task to the parquet-testing repository is an option for this.Description
parquetjs
seems to be writing Parquet V2 files withDataPageV2
pages, whileparquet-cpp
writes Parquet V2 files withDataPage
pages.Since
TypedRecordReader::ReadNewPage()
only had a branch forPageType::DATA_PAGE
, the reader would return without reading any data for records that haveDATA_PAGE_V2
pages. This explains the behavior observed in PARQUET-1482.This commit adds a new if-else branch for the
DataPageV2
case inTypedRecordReader::ReadNewPage()
. Since theDataPageV2
branch needed to reuse the code from theDataPage
case, I refactored the repetition/definition level decoder initialization and the data decoder initialization to two new methods in theTypedRecordReader
class. These new methods are now called by theDataPage
andDataPageV2
initialization branches inTypedRecordReader::ReadNewPage()
.There is an alternate implementation possible (with a smaller diff) by sharing the same else-if branch between
DataPage
andDataPageV2
using a pointer-to-derivedshared_ptr<Page>
. However, since the Page superclass doesn't have the necessaryencoding()
ornum_values()
methods, I would need to add a common superclass to bothDataPage
andDataPageV2
that defined these methods. I didn't do this because I was hesitant to modify thePage
class hierarchy for this commit.