Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: HashJoinStream state machine #8538

Merged
merged 7 commits into from
Dec 18, 2023
Merged

Conversation

korowa
Copy link
Contributor

@korowa korowa commented Dec 14, 2023

Which issue does this PR close?

Part of #8130.

Rationale for this change

Structuring HashJoinStream processing logic based on @alamb design suggestion from the issue.

What changes are included in this PR?

  • HashJoinStream::poll_next_impl splitted into more granular functions which are used as a handlers, modifying stream state as their result
  • left_fut & visited_left_side moved to HashJoinStream.build_side BuildSide attribute -- the reason for storing build-side related data separate from stream state, is that it allows to avoid redundant cloning of build side contents across all state changes (as all handlers operate on &mut self: HashJoinStream) and still keeps build-side available for reading/mutating through references.
  • utility structures / macros reused for HJ state management moved from stream_join_utils.rs to utils.rs + StreamJoinStateResult renamed to StatefulStreamResult -- seems reasonable as it is used for stateful streams and not only for stream-like joins -- any naming or rolling back related suggestions are welcome and appreciated.

Are these changes tested?

Covered by existing test cases.

Are there any user-facing changes?

No.

@korowa
Copy link
Contributor Author

korowa commented Dec 14, 2023

Benchmark results for tpch_mem are

--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃   master ┃ hash_join_state ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 238.62ms │        239.68ms │     no change │
│ QQuery 2     │  83.95ms │         81.64ms │     no change │
│ QQuery 3     │ 169.21ms │        173.42ms │     no change │
│ QQuery 4     │ 142.54ms │        140.95ms │     no change │
│ QQuery 5     │ 324.32ms │        324.17ms │     no change │
│ QQuery 6     │  27.11ms │         25.77ms │     no change │
│ QQuery 7     │ 666.41ms │        652.17ms │     no change │
│ QQuery 8     │ 126.28ms │        133.17ms │  1.05x slower │
│ QQuery 9     │ 179.83ms │        179.57ms │     no change │
│ QQuery 10    │ 306.72ms │        308.34ms │     no change │
│ QQuery 11    │  70.66ms │         65.59ms │ +1.08x faster │
│ QQuery 12    │ 136.88ms │        137.92ms │     no change │
│ QQuery 13    │ 142.20ms │        147.34ms │     no change │
│ QQuery 14    │  56.03ms │         49.15ms │ +1.14x faster │
│ QQuery 15    │ 144.09ms │        143.68ms │     no change │
│ QQuery 16    │  62.33ms │         59.99ms │     no change │
│ QQuery 17    │ 241.87ms │        235.84ms │     no change │
│ QQuery 18    │ 577.50ms │        586.59ms │     no change │
│ QQuery 19    │  75.16ms │         73.42ms │     no change │
│ QQuery 20    │ 189.61ms │        184.65ms │     no change │
│ QQuery 21    │ 755.00ms │        733.98ms │     no change │
│ QQuery 22    │  36.87ms │         37.85ms │     no change │
└──────────────┴──────────┴─────────────────┴───────────────┘

@korowa
Copy link
Contributor Author

korowa commented Dec 14, 2023

cc @alamb @metesynnada PTAL, if you have some time and feel related / being in context of the issue

@metesynnada
Copy link
Contributor

I will review it soon, but it looks great.

@alamb
Copy link
Contributor

alamb commented Dec 14, 2023

I also plan to review this soon. Thank you @korowa

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @korowa -- I read this PR carefully. I have a few style suggestions, but I don't think any of them are requred.

As I undertand it, this is a step towards being able to incrementally generate output batches for joins.

I wanted to also say the comments are really nice and make this PR a joy to read 👏

🏆

cc @Dandandan and @liukun4515

I think it would be good to get @metesynnada 's review before merging this as well

datafusion/physical-plan/src/joins/hash_join.rs Outdated Show resolved Hide resolved
datafusion/physical-plan/src/joins/hash_join.rs Outdated Show resolved Hide resolved
datafusion/physical-plan/src/joins/hash_join.rs Outdated Show resolved Hide resolved
datafusion/physical-plan/src/joins/hash_join.rs Outdated Show resolved Hide resolved
datafusion/physical-plan/src/joins/hash_join.rs Outdated Show resolved Hide resolved
@korowa
Copy link
Contributor Author

korowa commented Dec 16, 2023

As I undertand it, this is a step towards being able to incrementally generate output batches for joins.

Yes, the plan is

  1. FSM -- this PR -- first part of the issue mentioned in PR description)
  2. changing iteration order for both inputs of hash join to preserve input order -- second part which closes the issue from PR description
  3. another try with feat: emitting partial join results in HashJoinStream #8020 -- after 1 + 2 it should fit into HashJoinStreamState::ProcessProbeBatch / process_probe_batch and retain original sort order (if some)

Copy link
Contributor

@metesynnada metesynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job on implementing the state machine - I really appreciate the effort you put into it! I'm glad to see that we now have a design coupling with SHJ. From what I can see, it doesn't look like there have been any changes in how we handle join, so I think it all looks good. Thanks for your hard work!

@metesynnada
Copy link
Contributor

@Dandandan If this looks OK, we can merge this.

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for the refactoring!

@Dandandan Dandandan merged commit a71a76a into apache:main Dec 18, 2023
22 checks passed
@Dandandan
Copy link
Contributor

Thank you @korowa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants