Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Sampling tests for parquet round trips #1519

Merged
merged 1 commit into from
Jul 27, 2023

Conversation

AnIrishDuck
Copy link
Contributor

This uses sample-test, and sample-arrow2, which we built specifically for this purpose. See the README for why we felt quickcheck and proptest were unsuitable.

I'm not sure if we'd rather have the libraries be optional [dependencies] instead of [dev-dependencies] (which cannot be optional). I figured this should be behind a feature flag though.

When run exhaustively (see the commented TODO lines), this appears to unearth more errors in the parquet IO code. Issues appear to trigger with nesting and nullable fields in combination. Some examples:

"assertion failed: `(left == right)`\n 
left: `[Chunk { arrays: [ListArray[[{wttdx: -3458878700, yjsyrg: [None]}]]] }]`,\n
right: `[Chunk { arrays: [ListArray[[{wttdx: -3458878700, yjsyrg: None}]]] }]`"
"assertion failed: `(left == right)`\n  
left: `[Chunk { arrays: [ListArray[[[None]]]] }]`,\n 
right: `[Chunk { arrays: [ListArray[[None]]] }]`"

My prior experience with the def/rep level encoder obviously leads me to suspect that code. I know it was recently rewritten, but it's a very complex subject and I'm not shocked that it may need more work.

Let me know how I can help assist. In particular, the shrinking behavior in sample-arrow2 is suboptimal due to chained resampling and some implementation hacks that can probably be improved. I can definitely assist if you're playing around with this and are having trouble shrinking back to useful exemplars.

Setting the chunk length to a small value appears to be generating good counterexamples for now.

@codecov
Copy link

codecov bot commented Jul 7, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.01 ⚠️

Comparison is base (8ee5ad8) 83.51% compared to head (1e624c0) 83.50%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1519      +/-   ##
==========================================
- Coverage   83.51%   83.50%   -0.01%     
==========================================
  Files         388      388              
  Lines       42024    42024              
==========================================
- Hits        35096    35093       -3     
- Misses       6928     6931       +3     

see 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ritchie46
Copy link
Collaborator

it's a very complex subject and I'm not shocked that it may need more work.

Me neither. I don't understand this well enough to fix atm, so any help is great. I am working on a separate parquet implementation, but progress is slow.

@ritchie46 ritchie46 merged commit f175c1c into jorgecarleitao:main Jul 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants