Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Test simple pinned blocking alloc Failed nightly tests #10585

Closed
tgravescs opened this issue Mar 13, 2024 · 2 comments · Fixed by #10615
Closed

[BUG]Test simple pinned blocking alloc Failed nightly tests #10585

tgravescs opened this issue Mar 13, 2024 · 2 comments · Fixed by #10615
Assignees
Labels
bug Something isn't working

Comments

@tgravescs
Copy link
Collaborator

Describe the bug
rapids_nightly-dev-github #1073 failed due to spark 334 unit test - simple pinned blocking alloc failing. Looks like it timed out.

[2024-03-13T13:29:55.623Z]              TEST THREAD APPEARS TO BE STUCK
[2024-03-13T13:29:55.623Z] thread2
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.jni.SparkResourceAdaptor.preCpuAlloc(Native Method)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.jni.SparkResourceAdaptor.preCpuAlloc(SparkResourceAdaptor.java:262)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.jni.RmmSpark.preCpuAlloc(RmmSpark.java:607)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.HostAlloc.tryAllocInternal(HostAlloc.scala:162)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.HostAlloc.alloc(HostAlloc.scala:218)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.HostAlloc$.alloc(HostAlloc.scala:258)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.HostAllocSuite$AllocOnAnotherThread.$anonfun$doAlloc$1(HostAllocSuite.scala:269)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.HostAllocSuite$AllocOnAnotherThread$$Lambda$570/563871766.apply$mcV$sp(Unknown Source)
[2024-03-13T13:29:55.624Z]      scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

[2024-03-13T13:29:55.624Z] TEST THREAD ScalaTest-main-running-HostAllocSuite
[2024-03-13T13:29:55.624Z]      java.lang.Object.wait(Native Method)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.HostAllocSuite$TaskThread$TaskThreadTrackingOp.get(HostAllocSuite.scala:111)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.HostAllocSuite$AllocOnAnotherThread.waitForAlloc(HostAllocSuite.scala:221)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$24(HostAllocSuite.scala:454)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$24$adapted(HostAllocSuite.scala:449)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.HostAllocSuite$$Lambda$580/1058189103.apply(Unknown Source)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$23(HostAllocSuite.scala:449)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$23$adapted(HostAllocSuite.scala:445)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.HostAllocSuite$$Lambda$579/809665906.apply(Unknown Source)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$21(HostAllocSuite.scala:445)
[2024-03-13T13:29:55.624Z]      com.nvidia.spark.rapids.HostAllocSuite$$Lambda$482/860099309.apply(Unknown Source)

@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Mar 13, 2024
@sameerz sameerz changed the title [BUG]Test simple pinned blocking alloc Faiiled nightly tests [BUG]Test simple pinned blocking alloc Failed nightly tests Mar 13, 2024
@abellina abellina self-assigned this Mar 19, 2024
@abellina
Copy link
Collaborator

abellina commented Mar 19, 2024

I can repro this and I am looking at the issue. It appears to be a race condition where a free/notify is not seen by the state machine but I don't know why. It repros pretty consistently, and adding transition logging makes it go away.

@abellina
Copy link
Collaborator

Ok here's what I see so far, the state in HostAlloc and the state machine in spark-rapids-jni is getting updated via MemoryBuffer.onClosed. This callback is called before we free the memory from the pinned pool. Because of that we update state in HostAlloc saying that we have free pinned memory and we should attempt an allocation, but what ends up happening is that we fail to allocate because the pinned pool hasn't seen the free yet (onClosed triggered, but we haven't gotten to the free yet).

In this case then HostAlloc will NOT retry the allocation and instead it sends the failing thread to BLOCKED state. Unfortunately the event that would have moved this thread out of BLOCKED already triggered via onClosed so we "miss the free" in HostAlloc.

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Mar 20, 2024
Closes #15350. This PR changes the order of the callback `MemoryBuffer.onClosed` to happen after our `MemoryCleaner` finishes. This is done so that we can accurately, and safely, reflect the state of the memory resource (be it device or host). This PR is needed to address a bug found in spark-rapids here: NVIDIA/spark-rapids#10585.

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Gera Shegalov (https://github.com/gerashegalov)

URL: #15351
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants