Fix incomplete row reading issue in parquet files #1262

yongtang · 2021-01-06T19:13:53Z

This PR tries to address the issue raised in #1254 where reading parquet
files will results in InvalidArgumentError: null value in column

The issue comes from the fact that parquet's ColumnReader C++ API
ReadBatch(...) does not necessarily respect the number of rows
requested and may return less instead.

This PR fixes #1254.

Signed-off-by: Yong Tang [email protected]

This PR tries to address the issue raised in 1254 where reading parquet files will results in `InvalidArgumentError: null value in column` The issue comes from the fact that parquet's ColumnReader C++ API `ReadBatch(...)` does not necessarily respect the number of rows requested and may return less instead. This PR fixes 1254. Signed-off-by: Yong Tang <[email protected]>

kvignesh1420 · 2021-01-06T19:26:35Z

tensorflow_io/core/kernels/parquet_kernels.cc

+      // Note: ReadBatch may not be able to read the elements requested
+      // (row_to_read_count) in one shot, as such we use while loop of
+      // `while (row_left > 0) {...}` to read until complete.


@yongtang the puzzling thing was that this happened only for a particular column in the parquet dataset that was provided. Any reference as to why this might happen?

Yes, it'd be great to understand how general this error is. Any workaround for now? When might this fix get into a release?

@dgoldenberg-audiomack this PR should fix the issue. Also, can we use your sample parquet dataset in our CI tests (if that is fine with you)?

@kvignesh1420 Use the parquet dataset - yes, you can absolutely. So the fix will be in the next release; do you know when that's slated for?

@dgoldenberg-audiomack I am not sure about the next release, but you can always use tensorflow-io-nightly for using these immediate fixes.

Will merge this PR after the tests pass.

Thanks @kvignesh1420 @dgoldenberg-audiomack. The ReadBatch was only encountered for ByteArray. This is related to the internal implementation of Arrow's Parquet cpp. One discrepancy I can think of, is that ByteArray are different from other types where other types are more uniform with size known before hand (e.g., the size can be preallocated for float/bool/int as long as the number of rows are known). For ByteArray (variable size) the allocation might be hard to pre-allocate.

kvignesh1420

LGTM. Thanks, @yongtang for the info and fix.

This PR tries to address the issue raised in 1254 where reading parquet files will results in `InvalidArgumentError: null value in column` The issue comes from the fact that parquet's ColumnReader C++ API `ReadBatch(...)` does not necessarily respect the number of rows requested and may return less instead. This PR fixes 1254. Signed-off-by: Yong Tang <[email protected]>

dgoldenberg-audiomack · 2021-05-13T19:44:14Z

Could folks pls provide any info on what version of TF this fix is/will be available in? Thanks.

yongtang mentioned this pull request Jan 6, 2021

"InvalidArgumentError: null value in column" when loading Parquet into tfio.IODataset when column has no actual null values #1254

Closed

kvignesh1420 reviewed Jan 6, 2021

View reviewed changes

kvignesh1420 approved these changes Jan 7, 2021

View reviewed changes

kvignesh1420 merged commit a49a806 into tensorflow:master Jan 7, 2021

yongtang deleted the 1254-parquet branch January 7, 2021 11:55

dgoldenberg-audiomack mentioned this pull request Apr 2, 2021

NotImplementedError: unable to open file: libtensorflow_io.so #1313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incomplete row reading issue in parquet files #1262

Fix incomplete row reading issue in parquet files #1262

yongtang commented Jan 6, 2021

kvignesh1420 Jan 6, 2021

dgoldenberg-audiomack Jan 6, 2021

kvignesh1420 Jan 6, 2021

dgoldenberg-audiomack Jan 6, 2021

kvignesh1420 Jan 6, 2021

kvignesh1420 Jan 6, 2021 •

edited

Loading

yongtang Jan 6, 2021

kvignesh1420 left a comment

dgoldenberg-audiomack commented May 13, 2021

Fix incomplete row reading issue in parquet files #1262

Fix incomplete row reading issue in parquet files #1262

Conversation

yongtang commented Jan 6, 2021

kvignesh1420 Jan 6, 2021

Choose a reason for hiding this comment

dgoldenberg-audiomack Jan 6, 2021

Choose a reason for hiding this comment

kvignesh1420 Jan 6, 2021

Choose a reason for hiding this comment

dgoldenberg-audiomack Jan 6, 2021

Choose a reason for hiding this comment

kvignesh1420 Jan 6, 2021

Choose a reason for hiding this comment

kvignesh1420 Jan 6, 2021 • edited Loading

Choose a reason for hiding this comment

yongtang Jan 6, 2021

Choose a reason for hiding this comment

kvignesh1420 left a comment

Choose a reason for hiding this comment

dgoldenberg-audiomack commented May 13, 2021

kvignesh1420 Jan 6, 2021 •

edited

Loading