-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix incomplete row reading issue in parquet files #1262
Conversation
This PR tries to address the issue raised in 1254 where reading parquet files will results in `InvalidArgumentError: null value in column` The issue comes from the fact that parquet's ColumnReader C++ API `ReadBatch(...)` does not necessarily respect the number of rows requested and may return less instead. This PR fixes 1254. Signed-off-by: Yong Tang <[email protected]>
// Note: ReadBatch may not be able to read the elements requested | ||
// (row_to_read_count) in one shot, as such we use while loop of | ||
// `while (row_left > 0) {...}` to read until complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yongtang the puzzling thing was that this happened only for a particular column in the parquet dataset that was provided. Any reference as to why this might happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it'd be great to understand how general this error is. Any workaround for now? When might this fix get into a release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgoldenberg-audiomack this PR should fix the issue. Also, can we use your sample parquet dataset in our CI tests (if that is fine with you)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kvignesh1420 Use the parquet dataset - yes, you can absolutely. So the fix will be in the next release; do you know when that's slated for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgoldenberg-audiomack I am not sure about the next release, but you can always use tensorflow-io-nightly
for using these immediate fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will merge this PR after the tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kvignesh1420 @dgoldenberg-audiomack. The ReadBatch
was only encountered for ByteArray. This is related to the internal implementation of Arrow's Parquet cpp. One discrepancy I can think of, is that ByteArray are different from other types where other types are more uniform with size known before hand (e.g., the size can be preallocated for float/bool/int as long as the number of rows are known). For ByteArray (variable size) the allocation might be hard to pre-allocate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks, @yongtang for the info and fix.
This PR tries to address the issue raised in 1254 where reading parquet files will results in `InvalidArgumentError: null value in column` The issue comes from the fact that parquet's ColumnReader C++ API `ReadBatch(...)` does not necessarily respect the number of rows requested and may return less instead. This PR fixes 1254. Signed-off-by: Yong Tang <[email protected]>
This PR tries to address the issue raised in 1254 where reading parquet files will results in `InvalidArgumentError: null value in column` The issue comes from the fact that parquet's ColumnReader C++ API `ReadBatch(...)` does not necessarily respect the number of rows requested and may return less instead. This PR fixes 1254. Signed-off-by: Yong Tang <[email protected]>
This PR tries to address the issue raised in 1254 where reading parquet files will results in `InvalidArgumentError: null value in column` The issue comes from the fact that parquet's ColumnReader C++ API `ReadBatch(...)` does not necessarily respect the number of rows requested and may return less instead. This PR fixes 1254. Signed-off-by: Yong Tang <[email protected]>
Could folks pls provide any info on what version of TF this fix is/will be available in? Thanks. |
This PR tries to address the issue raised in #1254 where reading parquet
files will results in
InvalidArgumentError: null value in column
The issue comes from the fact that parquet's ColumnReader C++ API
ReadBatch(...)
does not necessarily respect the number of rowsrequested and may return less instead.
This PR fixes #1254.
Signed-off-by: Yong Tang [email protected]