Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SortMergeJoin spilling #11218

Merged
merged 11 commits into from
Jul 22, 2024
Merged

Support SortMergeJoin spilling #11218

merged 11 commits into from
Jul 22, 2024

Conversation

comphead
Copy link
Contributor

@comphead comphead commented Jul 2, 2024

Which issue does this PR close?

Closes #9359 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@comphead comphead marked this pull request as draft July 2, 2024 17:01
@github-actions github-actions bot added the core Core DataFusion crate label Jul 2, 2024
@comphead
Copy link
Contributor Author

comphead commented Jul 4, 2024

all existing spilling tests are okay, I will add 3 more tests to test the spilling

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) substrait labels Jul 8, 2024
@github-actions github-actions bot removed sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) substrait labels Jul 8, 2024
@comphead
Copy link
Contributor Author

comphead commented Jul 8, 2024

Multi batch spill tests still fails

@comphead
Copy link
Contributor Author

comphead commented Jul 9, 2024

All initial tests passed, I'm planning to add more tests related to result correctness in separate PR

"Spill file {:?} does not exist",
spill.path()
)));
return internal_err!("Spill file {:?} does not exist", spill.path());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a clean up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by cleanups: 👍

@comphead comphead marked this pull request as ready for review July 9, 2024 16:26
@comphead comphead requested review from viirya and alamb July 9, 2024 16:26
@viirya
Copy link
Member

viirya commented Jul 9, 2024

I will review this in next few days.

@alamb alamb changed the title Support SortMerge spilling Support SortMergeJoin spilling Jul 9, 2024
Comment on lines 192 to 204
TestCase::new()
.with_query(
"select t1.* from t t1 JOIN t t2 ON t1.pod = t2.pod AND t1.time = t2.time",
)
.with_memory_limit(1_000)
.with_config(config)
.with_disk_manager_config(DiskManagerConfig::NewOs)
.run()
.await
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how do we know if it triggers spilling or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is great idea, I was overthinking how to check that file spilled to disk but metrics is much easier, I'm adding it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya I added metrics tests in sort_merge_join.rs like https://github.com/apache/datafusion/pull/11218/files#diff-825342e035aec56595dce761afb00dd54e3ae663a2e24ebf3a597123e636f9e2R3140

For this exact test which runs on SQL level I'm thinking if I can access metrics some how

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesnt seem possible to access any metrics in this case. We can rely that if test with disabled spill is failing on mem issues, then the same test with enabled spilling is passing. Hope that is enough

@comphead comphead requested a review from viirya July 12, 2024 20:58
@alamb
Copy link
Contributor

alamb commented Jul 12, 2024

I plan to review this PR later today -- sorry for the delay

}

#[tokio::test]
async fn sort_merge_join_spill() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case can only make sure the query can run, it may or may not be spilling.

We should have some ways to verify the spilling is actually happened.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately exactly this test case we cannot access any spilling metrics, but there is another test above sort_merge_join_no_spill which is exactly the same but expectedly fails by mem issue and have the spilling disabled explicitly. This test passes without issues and with spilling enabled so we can conclude the spilling happened.

Comment on lines 253 to 258
self.join_type,
on,
self.filter
.as_ref()
.map(|f| format!(", filter={}", f.expression()))
.unwrap_or("".to_string())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why moving the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inlined the display filter and changed map_or_else to map with default

Comment on lines +894 to +897
if buffered_batch.spill_file.is_none() && buffered_batch.batch.is_some() {
self.reservation
.try_shrink(buffered_batch.size_estimation)?;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also handle else cases, i.e., spilling file is Some and batch is also Some, and both are None, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those cases are not possible but the current code doesn't make that clear

Here is a proposal that I think makes it clearer what states are possible: comphead#297

// If the batch was spilled to disk, less likely
(Some(spill_file), None) => {
let mut buffered_cols: Vec<ArrayRef> =
Vec::with_capacity(buffered_indices.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffered_indices.len() is the length of arrays. I think the capacity should be the number of columns of the batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the should be the same right? the take kernel will check the bounds

/// Spill the `RecordBatch` to disk as smaller batches
/// split by `batch_size_rows`
/// Return `total_rows` what is spilled
pub fn spill_record_batch_by_size(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this function used other than in test? I don't find it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be left over from an earlier version of this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Im planning to keep it and reuse it in row_hash in following PR, basically the subbatch slicing is from row_hash.rs

@comphead
Copy link
Contributor Author

Thanks @viirya for your review, I'll address the comments today/tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @comphead and @viirya

I think this code is now correct, though I also think it could be improved (both with the comments from @viirya , my suggestion in comphead#297 as well as more testing)

Specifically, for testing, given the subtlety of the code involved I am not 100% sure it works for all corner cases. I suggest (as a follow on) we invest in fuzz testing both for SMJ in general as well as for spilling SMJ

https://github.com/apache/datafusion/blob/6c0e4fb5d9ac7a0a2f2b91f8b88d21f0bc0b4424/datafusion/core/tests/fuzz_cases/join_fuzz.rs#L50-L49

I think in particular, making sure we adjust the random inputs to have different numbers of repeated values (as the code in this PR is only going to be exercised when there are many of the same join keys I think)

/// Spill the `RecordBatch` to disk as smaller batches
/// split by `batch_size_rows`
/// Return `total_rows` what is spilled
pub fn spill_record_batch_by_size(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be left over from an earlier version of this PR

@@ -565,7 +583,7 @@ impl StreamedBatch {
#[derive(Debug)]
struct BufferedBatch {
/// The buffered record batch
pub batch: RecordBatch,
pub batch: Option<RecordBatch>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reviewing this PR, I found having to reason about what the valid batch or spill_file combinations was confusing (like there is an invariant I think that they can't both be Some)

Rather than use two fields, I tried making an enum that encoded the state and I thought it was easier to reason about. Here is a proposal here: comphead#297

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its great idea. I'll include this to follow up to simplify double option check in favor of enum.

Comment on lines +894 to +897
if buffered_batch.spill_file.is_none() && buffered_batch.batch.is_some() {
self.reservation
.try_shrink(buffered_batch.size_estimation)?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those cases are not possible but the current code doesn't make that clear

Here is a proposal that I think makes it clearer what states are possible: comphead#297

@comphead
Copy link
Contributor Author

Thank you @comphead and @viirya

I think this code is now correct, though I also think it could be improved (both with the comments from @viirya , my suggestion in comphead#297 as well as more testing)

Specifically, for testing, given the subtlety of the code involved I am not 100% sure it works for all corner cases. I suggest (as a follow on) we invest in fuzz testing both for SMJ in general as well as for spilling SMJ

https://github.com/apache/datafusion/blob/6c0e4fb5d9ac7a0a2f2b91f8b88d21f0bc0b4424/datafusion/core/tests/fuzz_cases/join_fuzz.rs#L50-L49

I think in particular, making sure we adjust the random inputs to have different numbers of repeated values (as the code in this PR is only going to be exercised when there are many of the same join keys I think)

Filed #11541

@comphead comphead requested a review from viirya July 19, 2024 01:34
@comphead comphead merged commit 63efaee into apache:main Jul 22, 2024
23 checks passed
Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 23, 2024
wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add spilling in SortMergeJoin
3 participants