distsqlrun: lookup join still unboundedly buffers, but this time in a different way #39044

jordanlewis · 2019-07-23T03:01:17Z

#35950 fixed one kind of unbounded buffering in lookup join - the kind caused by the KV layer, when somebody asks for a span that has an unbounded number of keys inside, without a batch limit.

But this didn't remove the potential for the joinReader to buffer in an unbounded manner, unfortunately!

The proximal cause is that joinReader doesn't use memory monitored containers. This should be a simple fix, to use a memory monitored container instead of a regular slice. At the very least, we should account for the row copies we perform when buffering output rows.

The root cause is that the algorithm that the joinReader uses to ensure that the output is ordered and that unmatched rows are accounted for (in the case of left outer and anti join) requires that all results for an input batch are present before emitting any results. This is problematic for an example where a left-side table has a single row that matches a billion rows on a right-side table - the algorithm won't emit anything until all billion rows are buffered!

I'm not sure if this algorithm can be modified to be streaming. It currently needs to see results for the whole batch because it can't guarantee that the results from the KV layer come in order, so it can't declare any of the input rows as "closed" at any point and let them get emitted. It's possible there's a way to work around this limitation, but it interacts in a funny way with the solution for #35950, which actually sorts all of the input row spans before sending them out, since KV requires it. This permitted limited KV batches, but it's painted us into a corner.

Imagine a scenario with two tables, a and b, where a contains the integers 0-100 and each value of a has 1 million rows in b. Then, feed a joinReader all of the rows in a in reversed sort order (starting 100, 99, 98...)! Now, we're beholden to output matches for 100 first, since it came first to the joinReader and we must output matches in order, but to output any matches for row 100 of a requires that all matches from 0-99 be buffered first, since we sorted the spans first and will therefore get matches from 0 first, then 1, then 2...

So this will require us to buffer 100 * 1 million rows into memory, kaboom.

I believe that a solution for this requires that we upgrade the batch limit capabilities for the KV layer to not require sorted input spans, but I'm open to other ideas.

cc @solongordon @RaduBerinde

The text was updated successfully, but these errors were encountered:

jordanlewis · 2019-07-23T03:03:26Z

I believe this is still the root cause of #36842.

In lookup joins on partial index keys, there is no limit on how many rows might be returned by any particular lookup, so the joinreader may be buffering an unbounded number of rows into memory. I changed joinreader to use a disk-backed row container rather than just storing the rows in memory with no accounting. Fixes cockroachdb#39044 Release note (bug fix): Lookup joins now spill to disk if the index lookups return more rows than can be stored in memory.

40208: distsql: add disk spilling to lookup joiner r=solongordon a=solongordon In lookup joins on partial index keys, there is no limit on how many rows might be returned by any particular lookup, so the joinreader may be buffering an unbounded number of rows into memory. I changed joinreader to use a disk-backed row container rather than just storing the rows in memory with no accounting. Fixes #39044 Release note (bug fix): Lookup joins now spill to disk if the index lookups return more rows than can be stored in memory. 40284: storage: issue swaps on AllocatorConsiderRebalance r=nvanbenschoten a=tbg Change the rebalancing code so that it not only looks up a new replica to add, but also picks one to remove. Both actions are then given to a ChangeReplicas invocation which will carry it out atomically as long as that feature is enabled. Release note (bug fix): Replicas can now be moved between stores without entering an intermediate configuration that violates the zone constraints. Violations may still occur during zone config changes, decommissioning, and in the presence of dead nodes (NB: the remainder be addressed in a future change, so merge the corresponding release note) 40300: store: pull updateMVCCGauges out of StoreMetrics lock, use atomics r=nvanbenschoten a=nvanbenschoten The operations it performs are already atomic, so we can use atomic add instructions to avoid any critical section. This was responsible for 8.15% of mutex contention on a YCSB run. The change also removes MVCCStats from the `storeMetrics` interface, which addresses a long-standing TODO. 40301: roachtest: Deflake clock jump test r=tbg a=bdarnell These tests perform various clock jumps, then reverse them. The reverse can cause a crash even if the original jump did not. Add some sleeps to make things more deterministic and improve the recovery process at the end of the test. Fixes #38723 Release note: None 40305: exec: modify tests to catch bad selection vector access r=rafiss a=rafiss The runTests helper will now cause a panic if a vectorized operator tries to access a part of the selection vector that is out of bounds. This identified bugs in the projection operator. Release note: None Co-authored-by: Solon Gordon <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Ben Darnell <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

jordanlewis added the A-sql-execution Relating to SQL execution. label Jul 23, 2019

jordanlewis added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. labels Jul 23, 2019

jordanlewis mentioned this issue Jul 23, 2019

sql: fatal error: runtime: out of memory (via sqlsmith) #37750

Closed

jordanlewis self-assigned this Jul 23, 2019

jordanlewis assigned solongordon and unassigned jordanlewis Aug 6, 2019

solongordon mentioned this issue Aug 26, 2019

distsql: add disk spilling to lookup joiner #40208

Merged

craig bot closed this as completed in dfcf20f Aug 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsqlrun: lookup join still unboundedly buffers, but this time in a different way #39044

distsqlrun: lookup join still unboundedly buffers, but this time in a different way #39044

jordanlewis commented Jul 23, 2019 •

edited

Loading

jordanlewis commented Jul 23, 2019

distsqlrun: lookup join still unboundedly buffers, but this time in a different way #39044

distsqlrun: lookup join still unboundedly buffers, but this time in a different way #39044

Comments

jordanlewis commented Jul 23, 2019 • edited Loading

jordanlewis commented Jul 23, 2019

jordanlewis commented Jul 23, 2019 •

edited

Loading