Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support type coercion in Parquet Reader #6458

Merged
merged 15 commits into from
Jun 6, 2023

Conversation

e1ijah1
Copy link
Contributor

@e1ijah1 e1ijah1 commented May 26, 2023

Which issue does this PR close?

Closes #6427 .

Rationale for this change

What changes are included in this PR?

Support type coercion in Parquet Reader

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label May 26, 2023
@e1ijah1 e1ijah1 marked this pull request as ready for review May 29, 2023 02:07
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @e1ijah1 ! This is looking very cool

I think to really complete this feature we should have an "end to end test" -- like actually creating parquet files with two different schemas and showing how they can be read as a single table using this feature

Perhaps we could add a test to https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/parquet, following the model of

https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/parquet/custom_reader.rs

@tustvold since you filed #6427 do you have some idea about how this feature would be used?

datafusion/core/src/physical_plan/file_format/mod.rs Outdated Show resolved Hide resolved
@alamb alamb marked this pull request as draft May 30, 2023 13:36
@alamb
Copy link
Contributor

alamb commented May 30, 2023

Marking as draft as we are waiting on feedback

@tustvold
Copy link
Contributor

do you have some idea about how this feature would be used

The use-case is where you have one or more parquet files, with different schema. You can provide a file_schema to FileScanConfig (or to ListingTableConfig) and have underlying data coerced to that schema on read. An example might be if a column has changed from a StringArray to DictionaryArray or vice-versa

@e1ijah1
Copy link
Contributor Author

e1ijah1 commented May 30, 2023

Thank you for your valuable feedback, @alamb and @tustvold ❤️. I've added an end-to-end test to ensure the type coercion functionality in the Parquet reader works as expected. Please let me know if there are any other improvements you'd like to see.

@e1ijah1 e1ijah1 force-pushed the fix/6427/type-cast-in-parquet branch from a853cce to 6eb5353 Compare May 30, 2023 22:50
@e1ijah1 e1ijah1 marked this pull request as ready for review June 1, 2023 02:29
@e1ijah1 e1ijah1 requested a review from alamb June 1, 2023 03:24
@e1ijah1 e1ijah1 force-pushed the fix/6427/type-cast-in-parquet branch from 5287f6d to 030a631 Compare June 3, 2023 02:57
@e1ijah1 e1ijah1 requested a review from tustvold June 5, 2023 01:37
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you this looks good to me

@tustvold tustvold merged commit 70d633a into apache:main Jun 6, 2023
@tustvold
Copy link
Contributor

tustvold commented Jun 6, 2023

Getting this in to avoid conflicts with #6374

@tustvold
Copy link
Contributor

tustvold commented Jun 6, 2023

I filed a follow on PR #6563 that avoids needing to recompute the mapping for each batch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Type Coercion in Parquet Reader
3 participants