Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Added support for remaining non-nested datatypes #336

Merged
merged 1 commit into from
Aug 24, 2021
Merged

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Aug 24, 2021

This PR adds support to dictionary-encoding (parquet) to the remaining non-nested datatypes.

This is demonstrated by:

  1. a roundtrip test of all columns in the generated_dictionary IPC file pair (containing dictionaries with different types and validities)
  2. a rust -> pyspark test (equality over all values)

Together, they demonstrate that

  • spark reads dictionary-encoded arrays written by this crate
  • this crate reads dictionary-encoded arrays written by this crate

where "read" is in the sense that data integrity is preserved over all values (nulls or not).

Tests against pyarrow currently fail, likely due to ARROW-13486 and ARROW-13487.

@jorgecarleitao jorgecarleitao added the enhancement An improvement to an existing feature label Aug 24, 2021
@jorgecarleitao jorgecarleitao marked this pull request as ready for review August 24, 2021 19:20
@jorgecarleitao jorgecarleitao changed the title Added support for remaining non-nested datatypes. Added support for remaining non-nested datatypes Aug 24, 2021
@codecov
Copy link

codecov bot commented Aug 24, 2021

Codecov Report

Merging #336 (b7e2352) into main (8e96ec4) will increase coverage by 0.01%.
The diff coverage is 66.03%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #336      +/-   ##
==========================================
+ Coverage   80.96%   80.97%   +0.01%     
==========================================
  Files         325      326       +1     
  Lines       21077    21167      +90     
==========================================
+ Hits        17066    17141      +75     
- Misses       4011     4026      +15     
Impacted Files Coverage Δ
src/io/parquet/read/mod.rs 51.21% <30.55%> (-18.79%) ⬇️
src/io/parquet/read/binary/dictionary.rs 80.00% <80.00%> (ø)
src/io/parquet/write/dictionary.rs 72.72% <100.00%> (+29.48%) ⬆️
src/io/parquet/write/mod.rs 51.28% <100.00%> (ø)
src/io/parquet/write/record_batch.rs 91.30% <100.00%> (ø)
tests/it/io/parquet/mod.rs 94.93% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e96ec4...b7e2352. Read the comment docs.

@jorgecarleitao jorgecarleitao merged commit b2a1233 into main Aug 24, 2021
@jorgecarleitao jorgecarleitao deleted the dict_parquet branch August 24, 2021 21:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement An improvement to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant