Added support for remaining non-nested datatypes #336

jorgecarleitao · 2021-08-24T19:07:44Z

This PR adds support to dictionary-encoding (parquet) to the remaining non-nested datatypes.

This is demonstrated by:

a roundtrip test of all columns in the generated_dictionary IPC file pair (containing dictionaries with different types and validities)
a rust -> pyspark test (equality over all values)

Together, they demonstrate that

spark reads dictionary-encoded arrays written by this crate
this crate reads dictionary-encoded arrays written by this crate

where "read" is in the sense that data integrity is preserved over all values (nulls or not).

Tests against pyarrow currently fail, likely due to ARROW-13486 and ARROW-13487.

codecov · 2021-08-24T19:32:04Z

Codecov Report

Merging #336 (b7e2352) into main (8e96ec4) will increase coverage by 0.01%.
The diff coverage is 66.03%.

@@            Coverage Diff             @@
##             main     #336      +/-   ##
==========================================
+ Coverage   80.96%   80.97%   +0.01%     
==========================================
  Files         325      326       +1     
  Lines       21077    21167      +90     
==========================================
+ Hits        17066    17141      +75     
- Misses       4011     4026      +15

Impacted Files	Coverage Δ
src/io/parquet/read/mod.rs	`51.21% <30.55%> (-18.79%)`	⬇️
src/io/parquet/read/binary/dictionary.rs	`80.00% <80.00%> (ø)`
src/io/parquet/write/dictionary.rs	`72.72% <100.00%> (+29.48%)`	⬆️
src/io/parquet/write/mod.rs	`51.28% <100.00%> (ø)`
src/io/parquet/write/record_batch.rs	`91.30% <100.00%> (ø)`
tests/it/io/parquet/mod.rs	`94.93% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e96ec4...b7e2352. Read the comment docs.

jorgecarleitao added the enhancement An improvement to an existing feature label Aug 24, 2021

Added support for more dictionary-encoded types.

b7e2352

jorgecarleitao force-pushed the dict_parquet branch from fa27215 to b7e2352 Compare August 24, 2021 19:19

jorgecarleitao marked this pull request as ready for review August 24, 2021 19:20

jorgecarleitao changed the title ~~Added support for remaining non-nested datatypes.~~ Added support for remaining non-nested datatypes Aug 24, 2021

jorgecarleitao merged commit b2a1233 into main Aug 24, 2021

jorgecarleitao deleted the dict_parquet branch August 24, 2021 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for remaining non-nested datatypes #336

Added support for remaining non-nested datatypes #336

jorgecarleitao commented Aug 24, 2021 •

edited

Loading

codecov bot commented Aug 24, 2021 •

edited

Loading

Added support for remaining non-nested datatypes #336

Added support for remaining non-nested datatypes #336

Conversation

jorgecarleitao commented Aug 24, 2021 • edited Loading

codecov bot commented Aug 24, 2021 • edited Loading

Codecov Report

jorgecarleitao commented Aug 24, 2021 •

edited

Loading

codecov bot commented Aug 24, 2021 •

edited

Loading