Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Add repartition_file.slt end to end test for repartitioning files, and supporting tweaks #8505

Merged
merged 2 commits into from
Dec 13, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 11, 2023

Which issue does this PR close?

This is part of #8451

Rationale for this change

There are no good end to end tests of repartitioned scans that I could find (because the default size threshold for repartitioning is 10MB, and most tests use files smaller than this)

What changes are included in this PR?

  1. Add end to end tests for repartitioning
  2. Fix a bug in sqllogictest harness that meant --complete mode didn't clear scratch files (I think @tustvold hit this at some point)
  3. FIx a bug in SINGLE_FILE_OUTPUT that caused it to fail if the parent directory didn't exist
  4. Sort files by path beore splitting them in ListingTable to ensure deterministic order

Are these changes tested?

Yes -- there are several new tests added

Are there any user-facing changes?

Two bug fixes as described above

@@ -131,6 +131,10 @@ impl ListingTableUrl {
if is_directory {
fs::create_dir_all(path)?;
} else {
// ensure parent directory exists
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this I can't write into /dir/table_name/1.parquet if /dir/table_name doesn't exist already

This is not enabled by deafult, it only affects when the user has specified SINGLE_FILE_OUTPUT explicitly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, I still need to fix this, this is a weird hold over from append

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What needs to be fixed? I would like to write up a catalog

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason you should need to create these files ahead of time anymore, it was a consequence of the head request being performed to accommodate the append semantics - I can add it to my list to clean it up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the object_store LocalFile::put implementation should automatically create the paths?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR in #8539

@@ -159,6 +159,7 @@ async fn run_test_file_with_postgres(test_file: TestFile) -> Result<()> {
relative_path,
} = test_file;
info!("Running with Postgres runner: {}", path.display());
setup_scratch_dir(&relative_path)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this complete mode didn't clear the scratch dir resulting in confusing errors

SortPreservingMergeExec: [column1@0 ASC NULLS LAST]
--CoalesceBatchesExec: target_batch_size=8192
----FilterExec: column1@0 != 42
------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..200], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:200..400], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:400..403, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..197], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:197..394]]}, projection=[column1], predicate=column1@0 != 42, pruning_predicate=column1_min@0 != 42 OR 42 != column1_max@1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually incorrect because data from file1 appended to file2 is not sorted by column1 anymore -- #8451

(note the query gets the right answer because only the first group actually makes any values for 1.parquet as it is tiny)

@alamb alamb marked this pull request as ready for review December 11, 2023 22:22
@alamb
Copy link
Contributor Author

alamb commented Dec 12, 2023

The test is failing because the order of files listed is not consistent. I am working on a fix for these.

Update: fixed

@alamb alamb marked this pull request as draft December 12, 2023 12:11
@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Dec 12, 2023
@alamb alamb marked this pull request as ready for review December 12, 2023 13:26
@alamb
Copy link
Contributor Author

alamb commented Dec 13, 2023

Thank you for the review @tustvold 🙏

@alamb alamb merged commit 5bf80d6 into apache:main Dec 13, 2023
23 checks passed
@alamb alamb deleted the alamb/repartition_tests branch December 14, 2023 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants