-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the precision of half_join
#386
Improve the precision of half_join
#386
Conversation
It seems to be solving the problem: the original query from https://github.com/MaterializeInc/database-issues/issues/5550 is not hanging anymore! I also ran I'll look through the code in detail now. @philip-stoev, could you please test this with random WMR queries with Delta joins? (The fixed
(For join inputs that are not directly coming from indexed tables, one way to create the arrangement that needs to be there for a Delta join to be planned is to put a Distinct or other Reduce at the root of the join input, where the key is the same as the join key.) |
The code makes sense to me, but it would be good if someone who knows more about Timely/Differential would also check it. Maybe @antiguru or @vmarcos. This is the Materialize issue, and a Slack thread with more details of what is happening. Also, it would be good to somehow test the |
I'll try and think through a test case for it that forces a multi-element frontier there. |
I looked this through and it makes sense to me that the changes are safe, since we only enforce higher precision on the lower bounds of times in the stash. However, I must admit I am not 100% confident that I understood the root cause of the original issue. Is it because we have a self-join and the frontiers in the arranged inputs advance at the same rate? To delve deeper, I'd probably need to invest some debugging time here. |
@ggevay or I can explain the issue in more detail, but it stems from being imprecise about which capabilities are required, which blocks downstream batch formation, which believes that these capabilities might still change data. This causes the dataflow to stall out at certain times when it should be able to continue. In simple cases (without a dataflow cycle) this was fine as everything would just drain. With a cycle, these held-back capabilities prevent event times from completing that the whole dataflow seizes up. |
I did not manage to think through a test for the multi-element antichain. Going to merge nonetheless, and will work harder on this in spare moments. |
@vmarcos, let's discuss it in the Wednesday office hours. |
Sounds good, thanks! |
This PR changes the behavior of the
half_join
methods indogsdogsdogs
to better track the required capabilities for outstanding work. The lack of precision has the potential to lead to stalls where work is not done, especially when that work would feed back around to the arrangement the operator uses to unblock its own work.This has not been tested in any form other than
cargo test
running fine. There is a Materialize issue that this may resolve, but I wouldn't be at all surprised if there are other issues in the written code. Up for inspection and discussion.cc: @ggevay