Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsqlrun: lookup join still unboundedly buffers, but this time in a different way #39044

Closed
jordanlewis opened this issue Jul 23, 2019 · 1 comment · Fixed by #40208
Closed
Assignees
Labels
A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.

Comments

@jordanlewis
Copy link
Member

jordanlewis commented Jul 23, 2019

#35950 fixed one kind of unbounded buffering in lookup join - the kind caused by the KV layer, when somebody asks for a span that has an unbounded number of keys inside, without a batch limit.

But this didn't remove the potential for the joinReader to buffer in an unbounded manner, unfortunately!

The proximal cause is that joinReader doesn't use memory monitored containers. This should be a simple fix, to use a memory monitored container instead of a regular slice. At the very least, we should account for the row copies we perform when buffering output rows.

The root cause is that the algorithm that the joinReader uses to ensure that the output is ordered and that unmatched rows are accounted for (in the case of left outer and anti join) requires that all results for an input batch are present before emitting any results. This is problematic for an example where a left-side table has a single row that matches a billion rows on a right-side table - the algorithm won't emit anything until all billion rows are buffered!

I'm not sure if this algorithm can be modified to be streaming. It currently needs to see results for the whole batch because it can't guarantee that the results from the KV layer come in order, so it can't declare any of the input rows as "closed" at any point and let them get emitted. It's possible there's a way to work around this limitation, but it interacts in a funny way with the solution for #35950, which actually sorts all of the input row spans before sending them out, since KV requires it. This permitted limited KV batches, but it's painted us into a corner.

Imagine a scenario with two tables, a and b, where a contains the integers 0-100 and each value of a has 1 million rows in b. Then, feed a joinReader all of the rows in a in reversed sort order (starting 100, 99, 98...)! Now, we're beholden to output matches for 100 first, since it came first to the joinReader and we must output matches in order, but to output any matches for row 100 of a requires that all matches from 0-99 be buffered first, since we sorted the spans first and will therefore get matches from 0 first, then 1, then 2...

So this will require us to buffer 100 * 1 million rows into memory, kaboom.

I believe that a solution for this requires that we upgrade the batch limit capabilities for the KV layer to not require sorted input spans, but I'm open to other ideas.

cc @solongordon @RaduBerinde

@jordanlewis jordanlewis added the A-sql-execution Relating to SQL execution. label Jul 23, 2019
@jordanlewis jordanlewis added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. labels Jul 23, 2019
@jordanlewis
Copy link
Member Author

I believe this is still the root cause of #36842.

@jordanlewis jordanlewis self-assigned this Jul 23, 2019
solongordon added a commit to solongordon/cockroach that referenced this issue Aug 26, 2019
In lookup joins on partial index keys, there is no limit on how many
rows might be returned by any particular lookup, so the joinreader may
be buffering an unbounded number of rows into memory. I changed
joinreader to use a disk-backed row container rather than just storing
the rows in memory with no accounting.

Fixes cockroachdb#39044

Release note (bug fix): Lookup joins now spill to disk if the index
lookups return more rows than can be stored in memory.
solongordon added a commit to solongordon/cockroach that referenced this issue Aug 28, 2019
In lookup joins on partial index keys, there is no limit on how many
rows might be returned by any particular lookup, so the joinreader may
be buffering an unbounded number of rows into memory. I changed
joinreader to use a disk-backed row container rather than just storing
the rows in memory with no accounting.

Fixes cockroachdb#39044

Release note (bug fix): Lookup joins now spill to disk if the index
lookups return more rows than can be stored in memory.
craig bot pushed a commit that referenced this issue Aug 28, 2019
40208: distsql: add disk spilling to lookup joiner r=solongordon a=solongordon

In lookup joins on partial index keys, there is no limit on how many
rows might be returned by any particular lookup, so the joinreader may
be buffering an unbounded number of rows into memory. I changed
joinreader to use a disk-backed row container rather than just storing
the rows in memory with no accounting.

Fixes #39044

Release note (bug fix): Lookup joins now spill to disk if the index
lookups return more rows than can be stored in memory.

40284: storage: issue swaps on AllocatorConsiderRebalance r=nvanbenschoten a=tbg

Change the rebalancing code so that it not only looks up a new replica to
add, but also picks one to remove. Both actions are then given to a
ChangeReplicas invocation which will carry it out atomically as long as
that feature is enabled.

Release note (bug fix): Replicas can now be moved between stores without
entering an intermediate configuration that violates the zone constraints.
Violations may still occur during zone config changes, decommissioning, and
in the presence of dead nodes (NB: the remainder be addressed in a future
change, so merge the corresponding release note)

40300: store: pull updateMVCCGauges out of StoreMetrics lock, use atomics r=nvanbenschoten a=nvanbenschoten

The operations it performs are already atomic, so we can use atomic add instructions to avoid any critical section.

This was responsible for 8.15% of mutex contention on a YCSB run.

The change also removes MVCCStats from the `storeMetrics` interface, which addresses a long-standing TODO.

40301: roachtest: Deflake clock jump test r=tbg a=bdarnell

These tests perform various clock jumps, then reverse them. The
reverse can cause a crash even if the original jump did not.
Add some sleeps to make things more deterministic and improve the
recovery process at the end of the test.

Fixes #38723

Release note: None

40305: exec: modify tests to catch bad selection vector access r=rafiss a=rafiss

The runTests helper will now cause a panic if a vectorized operator
tries to access a part of the selection vector that is out of bounds.
This identified bugs in the projection operator.

Release note: None

Co-authored-by: Solon Gordon <[email protected]>
Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Ben Darnell <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@craig craig bot closed this as completed in dfcf20f Aug 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants