-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making stream joins extensible: A new Trait implementation for SHJ #8234
Conversation
I plan to review this PR, hopefully later today |
For context: This enables downstream users (like us) to implement various join algorithms without duplicating a bunch of code every time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @metesynnada -- this looks like a nice refactoring to me. I had some suggestions on documentation and API design but I think they could also be done as follow on PRs
I had some questions:
- Do you plan to add new join implementations to DataFusion, and if so are your plans written anywhere? A trait is a nice way to keep specialized implementation in other crates as well.
- Is it possible to extend "eager join" to MergeJoin? Would that even be a good idea?
- I don't understand how always reading alternately from left and right inputs would work (I left a more detailed question / comment below)
/// | 0 | 0 | 0 | 2 | 4 | <--- hash value 10 maps to 5,4,2 (which means indices values 4,3,1) | ||
/// --------------------- | ||
/// ``` | ||
pub struct JoinHashMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to move this structure anyways perhaps we can put it into its own module (e.g datafusion/physical-plan/src/joins/join_hash_map.rs
or something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is only struct that hash join uses different from other joins, this is why I put it under utils. I can create a new folder as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it will be a single struct in the new file. We can split more similar structs into a new file in future.
/// # Returns | ||
/// | ||
/// * `Result<StreamJoinStateResult<Option<RecordBatch>>>` - The state result after pulling the batch. | ||
async fn fetch_next_from_right_stream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat confused by this default implementation as it implies that join will always "ping pong" back and forth between fetching left and right inputs, while in realty I think the details of how the stream is implemented and how the join keys are distributed across batches could require fetching multiple batches from one (or both) inputs before progress can be made.
I am thinking of a join on a = b
where all the rows in the batch have the same join key, for example:
Batch 1
a |
---|
100 |
100 |
Batch 2
a |
---|
100 |
100 |
Batch 3
a |
---|
100 |
200 |
Wouldn't the symmetric hash join have to read all three batches to find the next join key (200
) before reading a batch from the other input / producing output?
I didn't see how the symmetric hash handles this case, so I must be missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both this and previous implementations buffer rows on both sides, adhering to deletion criteria set by the interval library; this aspect remains unchanged.
The core proposal focuses on retaining control over the SendableRecordBatch streams instead of merging them with futures::select. Additionally, there are plans to develop more efficient yielding strategies in the future. The suggested alternation strategy is expected to advance into sophisticated load-balancing techniques. This characteristic is fundamental to the realization of these advanced features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both this and previous implementations buffer rows on both sides, adhering to deletion criteria set by the interval library; this aspect remains unchanged.
I agree this PR doesn't seem to change any behavior.
Additionally, there are plans to develop more efficient yielding strategies in the future. The suggested alternation strategy is expected to advance into sophisticated load-balancing techniques.
Are these plans described anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these plans described anywhere?
Not yet, but hopefully soon. We will continue publishing blog posts about this stuff in Datafusion and will talk about future goodies there :)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing @alamb. We enriched the comments per your suggestions and will short open an issue to track the work on SortMergeJoin
, which will be addressed in a follow-on PR.
SortMergeJoin: #8273 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the improved comments 🙏
/// Represents the asynchronous trait for an eager join stream. | ||
/// This trait defines the core methods for handling asynchronous join operations | ||
/// between two streams (left and right). | ||
/// `EagerJoinStream` is an asynchronous trait designed for managing incremental join operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Closes #.
Rationale for this change
In our ongoing efforts to increase join use cases in our Datafusion, this PR introduces a trait:
EagerJoinStream
. This trait is designed to provide a more structured and efficient way to implement more join use cases in future, ensuring better maintainability and ease of use.EagerJoinStream
: This trait ensures that all join operations are evaluated eagerly, providing faster response times for scenarios where immediate results are required.What changes are included in this PR?
This PR includes the implementation of the
EagerJoinStream
trait, along with necessary modifications to existing code to integrate these new traits. The changes are as follows:EagerJoinStream
: This trait is implemented to support eager evaluation of join operations.stream_hash_utils.rs
introduced to separate responsibility on HJ and SHJ. It becomes easier to maintain.datafusion.proto
file and ser/de features are updated to support SHJ in proto.Changes are mostly code restructuring and proto implementations, instead of adding new functionality to the joins.
Are these changes tested?
Yes, comprehensive tests have been added to cover the new functionality introduced by these traits. The tests ensure the new features' correctness, performance, and reliability.
Are there any user-facing changes?
NA