Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incomplete row reading issue in parquet files #1262

Merged
merged 1 commit into from
Jan 7, 2021

Conversation

yongtang
Copy link
Member

@yongtang yongtang commented Jan 6, 2021

This PR tries to address the issue raised in #1254 where reading parquet
files will results in InvalidArgumentError: null value in column

The issue comes from the fact that parquet's ColumnReader C++ API
ReadBatch(...) does not necessarily respect the number of rows
requested and may return less instead.

This PR fixes #1254.

Signed-off-by: Yong Tang [email protected]

This PR tries to address the issue raised in 1254 where reading parquet
files will results in `InvalidArgumentError: null value in column`

The issue comes from the fact that parquet's ColumnReader C++ API
`ReadBatch(...)` does not necessarily respect the number of rows
requested and may return less instead.

This PR fixes 1254.

Signed-off-by: Yong Tang <[email protected]>
Comment on lines +165 to +167
// Note: ReadBatch may not be able to read the elements requested
// (row_to_read_count) in one shot, as such we use while loop of
// `while (row_left > 0) {...}` to read until complete.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yongtang the puzzling thing was that this happened only for a particular column in the parquet dataset that was provided. Any reference as to why this might happen?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it'd be great to understand how general this error is. Any workaround for now? When might this fix get into a release?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgoldenberg-audiomack this PR should fix the issue. Also, can we use your sample parquet dataset in our CI tests (if that is fine with you)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kvignesh1420 Use the parquet dataset - yes, you can absolutely. So the fix will be in the next release; do you know when that's slated for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgoldenberg-audiomack I am not sure about the next release, but you can always use tensorflow-io-nightly for using these immediate fixes.

Copy link
Member

@kvignesh1420 kvignesh1420 Jan 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will merge this PR after the tests pass.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kvignesh1420 @dgoldenberg-audiomack. The ReadBatch was only encountered for ByteArray. This is related to the internal implementation of Arrow's Parquet cpp. One discrepancy I can think of, is that ByteArray are different from other types where other types are more uniform with size known before hand (e.g., the size can be preallocated for float/bool/int as long as the number of rows are known). For ByteArray (variable size) the allocation might be hard to pre-allocate.

Copy link
Member

@kvignesh1420 kvignesh1420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, @yongtang for the info and fix.

@kvignesh1420 kvignesh1420 merged commit a49a806 into tensorflow:master Jan 7, 2021
@yongtang yongtang deleted the 1254-parquet branch January 7, 2021 11:55
i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
This PR tries to address the issue raised in 1254 where reading parquet
files will results in `InvalidArgumentError: null value in column`

The issue comes from the fact that parquet's ColumnReader C++ API
`ReadBatch(...)` does not necessarily respect the number of rows
requested and may return less instead.

This PR fixes 1254.

Signed-off-by: Yong Tang <[email protected]>
i-ony pushed a commit to i-ony/io that referenced this pull request Mar 8, 2021
This PR tries to address the issue raised in 1254 where reading parquet
files will results in `InvalidArgumentError: null value in column`

The issue comes from the fact that parquet's ColumnReader C++ API
`ReadBatch(...)` does not necessarily respect the number of rows
requested and may return less instead.

This PR fixes 1254.

Signed-off-by: Yong Tang <[email protected]>
michaelbanfield pushed a commit to michaelbanfield/io that referenced this pull request Mar 30, 2021
This PR tries to address the issue raised in 1254 where reading parquet
files will results in `InvalidArgumentError: null value in column`

The issue comes from the fact that parquet's ColumnReader C++ API
`ReadBatch(...)` does not necessarily respect the number of rows
requested and may return less instead.

This PR fixes 1254.

Signed-off-by: Yong Tang <[email protected]>
@dgoldenberg-audiomack
Copy link

Could folks pls provide any info on what version of TF this fix is/will be available in? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants