Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully handle malformed fields with trailing bytes in the data #413

Merged
merged 5 commits into from
Nov 14, 2023

Conversation

mukunku
Copy link
Contributor

@mukunku mukunku commented Oct 7, 2023

Summary

One of my users recently shared a bug where they couldn't read INT64 columns exported from Oracle: mukunku/ParquetViewer#81

The error being:

Destination is too short. (Parameter 'destination')
at Parquet.Encodings.ParquetPlainEncoder.Decode(Span`1 source, Span`1 data)
   at Parquet.Encodings.ParquetPlainEncoder.Decode(Array data, Int32 offset, Int32 count, SchemaElement tse, Span`1 source, Int32& elementsRead)
   at Parquet.File.DataColumnReader.ReadColumn(Span`1 src, Encoding encoding, Int64 totalValuesInChunk, Int32 totalValuesInPage, PackedColumn pc)
   at Parquet.File.DataColumnReader.ReadDataPageV1Async(PageHeader ph, PackedColumn pc)
   at Parquet.File.DataColumnReader.ReadAsync(CancellationToken cancellationToken)
   at ParquetViewer.Engine.ParquetEngine.ReadPrimitiveField(DataTable dataTable, ParquetRowGroupReader groupReader, Int32 rowBeginIndex, ParquetSchemaElement field, Int64 skipRecords, Int64 readRecords, Boolean isFirstColumn, Dictionary`2 rowLookupCache, CancellationToken cancellationToken, IProgress`1 progress)
   at ParquetViewer.Engine.ParquetEngine.ProcessRowGroup(DataTable dataTable, ParquetRowGroupReader groupReader, Int64 skipRecords, Int64 readRecords, CancellationTok

At the time I investigated this issue and concluded that the file must be malformed. But I've been monitoring for this exception since then and I've noticed a few more users continuing to get the same error:
image

I also noticed other libraries don't seem to have issues opening this file. So even though the file is malformed in my opinion, would it still be worth gracefully processing such malformed fields so parquet-dotnet doesn't fall behind the competition? I mean, even if a file is malformed, if other libraries are supporting it but parquet-dotnet isn't that might cause people to prefer other libraries over this one.

However if this PR doesn't make sense I'm happy to close it out. Just wanted to get your opinion.

Copy link
Owner

@aloneguid aloneguid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, thank you!

@aloneguid aloneguid merged commit 56bd861 into aloneguid:master Nov 14, 2023
@aloneguid aloneguid added this to the 4.17.0 milestone Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants