-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] NDS query 16 hangs at SF30K #8278
Comments
Confirmed that issue started after subpartition join was introduced in 23.04 release cycle: #7794. |
Do we have a full call stack for the hanging task ? |
@firestarman I am not 100% sure that it is a hang. It could be a really big performance regression, or even a live lock. The thing to understand is that there is a lot of skew for one of the tasks that is hanging. But the skew appears to be in the stream side of the join, not necessarily the build side. |
Here is sample stderr output for the skewed task that continues to spill (and doesn't complete):
|
Note that when I disable the subpartion join by setting Here is exception sample:
|
The OOM error looks really odd to me. How in the world did we need to split data on a 40 GiB GPU when we just read in data from a shuffle that obviously fit on the GPU. Something is off here and I am not sure what it is. |
I am thinking if it was due to a potential live lock in the spilling framework according to the spilling logs. It will always retry the allocation when there are spillable buffers according the code here after doing the spilling, even no buffer is actually spilled. So when it re-enters the |
I will try to run this today and see if I can repro the hang/spill situation. |
Here's what I see after adding some debug logging, I am not as familiar with this code yet so I am mostly adding what I see so far: I let the query run for ~1 hour and it eventually failed with OOM here https://github.com/NVIDIA/spark-rapids/blob/branch-23.08/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuSubPartitionHashJoin.scala#L45. So my attempt had loaded and partitioned build and stream batches as spillable and it is trying to exit There's a single task running at this stage, and it depleted the This single task is trying to concatenate > 40GB of memory (see RMM allocated number) for the build side.. It is actually trying to materialize 197 GiB from the build side (https://github.com/NVIDIA/spark-rapids/blob/branch-23.08/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuSubPartitionHashJoin.scala#L423):
If I understand correctly this table is likely needed whole in memory as the code is currently written. Rest of the OOM part of the log:
Additionally, we spend almost all of the time reading from the build/stream iterators, splitting, and making things spillable. This causes spill to happen as we do it for all of the input in one pass. In this case the input is massive. I didn't see a livelock situation with the spill framework, but perhaps I have missed it (the code could have race conditions so I am happy to be wrong and @firestarman let me know if you have evidence to the contrary). In discussing with @revans2 and @jlowe we need to change the algorithm here. We'll have to document what was discussed, but essentially the idea is to go away from pulling everything into memory and instead perform several small joins against a build table round robin, and merging the individual products. |
@abellina do you know how big the build side is in this case? I just don't see how we end up with spilling 1.6 TiB for a build side table when the input for the entire shuffle is not that big. I think there has to be some kind of a bug in the internal code. Especially because previously the entire join worked without any errors so the entire build side fit into GPU memory. |
The build batch was ~190 GiB total in this particular case. Running the query again, I actually see SQL metrics with a stream side that is even bigger (630 GiB):
And that build side:
What looks to happen is both the build and stream side are being pulled fully and then we get the sub partition pairs that we call
Then we attempt to repartition the build side 184 ways and it looks like we put this aside in the "big batches" area. Then I see we pop it again:
And the size is slightly different (I do not know why). But we do blow up because we say this is already partitioned, so the code tries to concatenate all of it. This is going to be bad. I do not know if all 197GB worth of build side have the same key. By the way, I see really odd calls to contiguousSplit in
Which produces: 4 contiguous tables of 0 rows, then 1 table with 117 rows, then 10 0-row tables. I need to understand this better. I also see this:
I am not sure yet, but I believe this could be the stream side where a particular streamed batch didn't have any rows matching the key, but I would like to get some confirmation from @firestarman that this is the intention. |
This should not happen if the input batch is not empty. We expect there should be at least one sub partition to have all its rows. Seems something is wrong here, but I am not sure what it is now. The indices for
This may be due to the aligned sizes in |
Sorry for confusion, this is just a guess from the code. |
I did some investigation and here is what I have found so far. After loading and repartitioning all the build side data, the sub-partition 4 (shown as below) is quite big (about 190GB) than other sub-partitions. Looks like the data is highly skewed.
I will dig more next. |
I think we understand what is happening now. Query 16 has some issues with nulls on the right hand side of the join. In the old code we used the a spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoalesceBatches.scala Lines 206 to 213 in 5699ac1
spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinExec.scala Lines 122 to 134 in 9187eb6
This was used to remove nulls from the right side of these joins because there are some cases where Spark is not doing this. LeftSemi and LeftAnti are the big ones that we need to worry about. This was missed with the new code. It is only used if need to get a single batch. It is never used if we think we have to partition the data. So I think we need to change a few things here. Sadly there is no simple/good way to filter out nulls on the CPU when trying to get the build side table. It would be awesome, but I don't think we can do it. So I would propose that we start out the same way that we are doing it today, and if there is a single batch with no overflow etc that we call For the other cases once we get the If we hit the end of the stream we concat them all togheter and return it. If we overflowed, then we need to do the safe iterator trick again with everything that we buffered up and send them to be partitioned. |
Really appreciate for finding the root cause. I will try to make the fix next. |
With the linked PR, query16 at 30k dataset passed after running for 26 minutes without any spilling in my verification. [Update] Need to verify the latest version of the PR, but is pending because Spark2a disks are out of space now. [Update] The latest version can also have query16 passed. |
NDS query 16 was hanging (still running for over 1.5 hours) at SF30K for an on-prem 8-node A100 cluster.
Query seems to be hanging here:
A task failed with this error as well:
The text was updated successfully, but these errors were encountered: