-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] executor crash intermittantly in scala2.13-built spark332 integration tests #9659
Comments
Saw another one of these for test_repartition_df_for_round_robin, also running for spark332. |
IIUC this is failing in local mode, and it's pretty odd to get an executor heartbeat timeout when the driver and executor are in the same JVM instance. What might be happening here is that the JVM is GC'ing a lot, causing executor heartbeat timers to expire before the executor can heartbeat in time. We saw instances of this with #9829, and I'm wondering if the fix for it in #9944 would also address this issue. Unfortunately #9944 only went into 24.02 so far, so we'd have to check the 24.02 pipelne for scala2.13 to see if there's any improvement going forward. |
Unfortunately, I think this failure has started to creep into pre-release pipelines which are running integration tests with As far as significant different between the 3.3.2 shim and 3.4.0 shim (where the issue doesn't seem to occur), there are the shimmed versions of In 3.4.0, |
The tests described above I don't think run under the shuffle, so I am not sure that these classes are getting instantiated. |
That's news to me, but don't we always use Seq in an immutable way? Even if we relied on some mutable methods, if this changed between scala versions, wouldn't scala 2.13 fail to compile that code? |
So we don't quite do that in all places in the code. There were a lot of places in the code where we returned something like a With these new sequences being created in 2.13, it's very conceivable that there is a lot more garbage collection going on. It's possible that there will have to be further re-writing of the code for better performance under Scala 2.13. What is interesting about these Rapids classes is not the plugins' usage of them, but the underlying Spark functionality which these graft onto. One of the big 3.4.0 changes was to start using |
Seeing the commit description here apache/spark@66c6aab they describe something like Note that the shuffle classes you pointed at ( BTW, which artifacts (spark versions) are we going to compile against 2.13? |
Officially we support Spark 3.3.0 and higher with Scala 2.13 |
I also filed #9952 to track this update. I do think this needs to be fixed in many places in our code. |
Found an executor log for run 29 of the scala213 dev pipeline, and it's getting shot out of the blue:
@pxLi are we sure there aren't container memory issues? Driver isn't shooting this, so it seems like a cgroup or OOM memory threshold thing that steps in and kills this. |
Should clarify, this is the log that shows this non-graceful kill:
|
From all the failed collected metrics, the memory usage was way below the limitation (60Gi), unless there was a spike which was not caught by the metric collector in time. Let me ask SRE to help check the failed host to see if any other unexpected processes running which could cause the issue |
I have filed a ticket to SRE, and moved one suspicious bad node out of the pool for now |
It appears the Spark driver+worker is the one responsible for killing the executor. From a recent scala2.13+spark332 run, the worker log has:
and this is what the corresponding executor log has:
So the issue is that somehow the executor heartbeats are not making it back to the driver. It's likely either the executor heartbeat thread is stuck or died or the driver is somehow not able to receive or process the heartbeat messages in a timely manner. |
Ah, this looks like a smoking gun. In the executor log of the executor that was ultimately killed, I found this:
Looks like the heartbeat thread died near the time that explains the heartbeat timeout. From the driver exception:
That means the last heartbeat seen by the driver was approximately at 21:41:26 - 00:02:17 = 21:39:09 which is just before the exception message seen in the executor. |
This is a bug in Spark 3.3.2, fixed in Spark 3.3.3 and Spark 3.4.0 (which explains why we never see it in Spark 3.4), see SPARK-39696. We should move our Scala 2.13 pipelines off of Spark 3.3.2 and use Spark 3.3.3 as the baseline for 3.3.x testing. |
This comment on the PR that fixes the bug confirms the environment (Scala 2.13, JVM 17) this error would show up in apache/spark#37206 (comment) |
Previously these 2 versions were selected to cover similar dataproc serverless runtimes. Please let us know,
@sameerz @SurajAralihalli @NVnavkumar @GaryShen2008 cc @NvTimLiu |
I think the plan now would be just to replace Spark 3.3.2 scala 2.13 integration test run with Spark 3.3.3, and keep 3.4.0 and 3.5.0 runs as is. I don't think we need to update the pom files at this time (correct me if I'm wrong @jlowe) |
I don't think we need to remove 3.3.2 support, but we cannot reliably test against Apache Spark 3.3.2 due to the known issue Re: Dataproc serverless, testing against 3.3.3 should be close enough, or we would need to run against a custom Apache Spark 3.3.2 version that has the fix for SPARK-39696 applied. Re: premerge, I do think we need to modify the pom file to adjust the premerge so we're doing a premerge against 333 instead of 332 due to the known issue with running with Spark 3.3.2 + Scala 2.1.3. I'll post a PR to do that shortly. |
Thanks for the update! Let me close this ticket. Feel free to reopen if require some other ops. |
Describe the bug
After set up regular integrations tests CI for scala2.13-built plugin (built with default jdk8, run with java 17 runtime)
we found intermittent failures of executor crashed no clear reasons in spark 332 IT (spark340 passed fine)
failed spark332
passed spark340
Steps/Code to reproduce bug
rerun internal rapids_integration-scala213-dev-github pipeline (~50% reproduciable)
Expected behavior
passed the test
Environment details (please complete the following information)
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: