-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distsqlrun: lookup join still unboundedly buffers, but this time in a different way #39044
Labels
A-sql-execution
Relating to SQL execution.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
S-2-temp-unavailability
Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Comments
jordanlewis
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
S-2-temp-unavailability
Temp crashes or other availability problems. Can be worked around or resolved by restarting.
labels
Jul 23, 2019
I believe this is still the root cause of #36842. |
solongordon
added a commit
to solongordon/cockroach
that referenced
this issue
Aug 26, 2019
In lookup joins on partial index keys, there is no limit on how many rows might be returned by any particular lookup, so the joinreader may be buffering an unbounded number of rows into memory. I changed joinreader to use a disk-backed row container rather than just storing the rows in memory with no accounting. Fixes cockroachdb#39044 Release note (bug fix): Lookup joins now spill to disk if the index lookups return more rows than can be stored in memory.
solongordon
added a commit
to solongordon/cockroach
that referenced
this issue
Aug 28, 2019
In lookup joins on partial index keys, there is no limit on how many rows might be returned by any particular lookup, so the joinreader may be buffering an unbounded number of rows into memory. I changed joinreader to use a disk-backed row container rather than just storing the rows in memory with no accounting. Fixes cockroachdb#39044 Release note (bug fix): Lookup joins now spill to disk if the index lookups return more rows than can be stored in memory.
craig bot
pushed a commit
that referenced
this issue
Aug 28, 2019
40208: distsql: add disk spilling to lookup joiner r=solongordon a=solongordon In lookup joins on partial index keys, there is no limit on how many rows might be returned by any particular lookup, so the joinreader may be buffering an unbounded number of rows into memory. I changed joinreader to use a disk-backed row container rather than just storing the rows in memory with no accounting. Fixes #39044 Release note (bug fix): Lookup joins now spill to disk if the index lookups return more rows than can be stored in memory. 40284: storage: issue swaps on AllocatorConsiderRebalance r=nvanbenschoten a=tbg Change the rebalancing code so that it not only looks up a new replica to add, but also picks one to remove. Both actions are then given to a ChangeReplicas invocation which will carry it out atomically as long as that feature is enabled. Release note (bug fix): Replicas can now be moved between stores without entering an intermediate configuration that violates the zone constraints. Violations may still occur during zone config changes, decommissioning, and in the presence of dead nodes (NB: the remainder be addressed in a future change, so merge the corresponding release note) 40300: store: pull updateMVCCGauges out of StoreMetrics lock, use atomics r=nvanbenschoten a=nvanbenschoten The operations it performs are already atomic, so we can use atomic add instructions to avoid any critical section. This was responsible for 8.15% of mutex contention on a YCSB run. The change also removes MVCCStats from the `storeMetrics` interface, which addresses a long-standing TODO. 40301: roachtest: Deflake clock jump test r=tbg a=bdarnell These tests perform various clock jumps, then reverse them. The reverse can cause a crash even if the original jump did not. Add some sleeps to make things more deterministic and improve the recovery process at the end of the test. Fixes #38723 Release note: None 40305: exec: modify tests to catch bad selection vector access r=rafiss a=rafiss The runTests helper will now cause a panic if a vectorized operator tries to access a part of the selection vector that is out of bounds. This identified bugs in the projection operator. Release note: None Co-authored-by: Solon Gordon <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Ben Darnell <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-sql-execution
Relating to SQL execution.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
S-2-temp-unavailability
Temp crashes or other availability problems. Can be worked around or resolved by restarting.
#35950 fixed one kind of unbounded buffering in lookup join - the kind caused by the KV layer, when somebody asks for a span that has an unbounded number of keys inside, without a batch limit.
But this didn't remove the potential for the
joinReader
to buffer in an unbounded manner, unfortunately!The proximal cause is that
joinReader
doesn't use memory monitored containers. This should be a simple fix, to use a memory monitored container instead of a regular slice. At the very least, we should account for the row copies we perform when buffering output rows.The root cause is that the algorithm that the
joinReader
uses to ensure that the output is ordered and that unmatched rows are accounted for (in the case of left outer and anti join) requires that all results for an input batch are present before emitting any results. This is problematic for an example where a left-side table has a single row that matches a billion rows on a right-side table - the algorithm won't emit anything until all billion rows are buffered!I'm not sure if this algorithm can be modified to be streaming. It currently needs to see results for the whole batch because it can't guarantee that the results from the KV layer come in order, so it can't declare any of the input rows as "closed" at any point and let them get emitted. It's possible there's a way to work around this limitation, but it interacts in a funny way with the solution for #35950, which actually sorts all of the input row spans before sending them out, since KV requires it. This permitted limited KV batches, but it's painted us into a corner.
Imagine a scenario with two tables,
a
andb
, wherea
contains the integers0-100
and each value ofa
has 1 million rows inb
. Then, feed ajoinReader
all of the rows ina
in reversed sort order (starting100, 99, 98
...)! Now, we're beholden to output matches for100
first, since it came first to thejoinReader
and we must output matches in order, but to output any matches for row100
ofa
requires that all matches from0-99
be buffered first, since we sorted the spans first and will therefore get matches from0
first, then1
, then2
...So this will require us to buffer
100 * 1 million
rows into memory, kaboom.I believe that a solution for this requires that we upgrade the batch limit capabilities for the KV layer to not require sorted input spans, but I'm open to other ideas.
cc @solongordon @RaduBerinde
The text was updated successfully, but these errors were encountered: