Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix partition columns when they they are not at the end of the schema (fix Removing Partition Columns from RecordBatch) #9276

Merged
merged 4 commits into from
Feb 21, 2024

Conversation

devinjdangelo
Copy link
Contributor

@devinjdangelo devinjdangelo commented Feb 19, 2024

Which issue does this PR close?

closes #9290

Rationale for this change

The current logic for removing partition_by columns assumes that the columns are at the end of the record batch. This is not always true when using COPY with the partition_by option.

What changes are included in this PR?

Removes columns from record batch based on name rather than position.

Are these changes tested?

Yes, modified copy.slt to cover

Are there any user-facing changes?

Yes, partitioned COPY statements will work correctly even if specifying columns that do not come at the end of the schema

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Feb 19, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devinjdangelo -- I don't fully understand what problem this PR is fixing. Is this fixing wrong results? Or a panic? Or something else 🤔

Could you possibly add a new test (rather than modify the existing one) so it is clearer?

@@ -53,32 +53,32 @@ select * from validate_partitioned_parquet_bar order by col1;
2

# Copy to directory as partitioned files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to add a new test (not modify an existing test) to show that the existing behavior is not changed

@devinjdangelo
Copy link
Contributor Author

Thanks @devinjdangelo -- I don't fully understand what problem this PR is fixing. Is this fixing wrong results? Or a panic? Or something else 🤔

Could you possibly add a new test (rather than modify the existing one) so it is clearer?

Apologies for not properly documenting the issue up front. I created #9290 to document the incorrect behavior that this PR is fixing. Essentially, ListingTables enforced a particular ordering on the columns which Copy alone does not. The logic for eliminating the partition columns from the RecordBatches to be written to disk needed to updated now that it cannot rely on the fixed column ordering.

I also restored the original test and added a new one instead of modifying.

@alamb
Copy link
Contributor

alamb commented Feb 21, 2024

Thanks @devinjdangelo -- I don't fully understand what problem this PR is fixing. Is this fixing wrong results? Or a panic? Or something else 🤔
Could you possibly add a new test (rather than modify the existing one) so it is clearer?

Apologies for not properly documenting the issue up front. I created #9290 to document the incorrect behavior that this PR is fixing. Essentially, ListingTables enforced a particular ordering on the columns which Copy alone does not. The logic for eliminating the partition columns from the RecordBatches to be written to disk needed to updated now that it cannot rely on the fixed column ordering.

I also restored the original test and added a new one instead of modifying.

Thank you so much -- I am sorry for not being able to see it immediately. I struggle these days to keep up with everthing going on and don't seem to have the ability to research things as much as I would like to

@alamb alamb changed the title Fix Logic for Removing Partition Columns from RecordBatch Support partition columns when they they are not at the end of the schema (fix Removing Partition Columns from RecordBatch) Feb 21, 2024
@alamb alamb changed the title Support partition columns when they they are not at the end of the schema (fix Removing Partition Columns from RecordBatch) Fix partition columns when they they are not at the end of the schema (fix Removing Partition Columns from RecordBatch) Feb 21, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @devinjdangelo -- this looks great

@alamb alamb merged commit a22b413 into apache:main Feb 21, 2024
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CopyTo with partition_by gives incorrect result if partition cols are not at end of schema
2 participants