Debug Session stuck for attach remote with Go Debugging on Theia #740

quoctruong · 2020-10-04T18:52:46Z

Repro:

Launch a program (an example is this) in headless mode with Delve in --continue and --accept-multiclient mode.
Attach to it from Theia with VSCode Go Extension.
Set a breakpoint and notice that it didn't go through.
Disconnecting also did not work.

Why this happens

Important Caveats

Internally we are using a field called continueRequestRunning to keep track of whether Delve is in a running state or not.
If Delve is in a running state and we issue a synchronous (blocking) call to it, any subsequent calls (both synchronous and asynchronous) will not get through until Delve changes to a halted state.

Theia's and Delve's Sequence of Events

To understand the root cause, we need to go through the sequences of events from both Theia and Delve's perspective. Note the bolded part as that is crucial to understanding the problem. I also skipped some initial calls when initializing Delve (getVersion). Special thanks to @polinasok for helping debugging the issue and very helpful and detailed discussion of different scenarios.

Initializing Sequence (Not important)

Theia -> DAP: Sends InitializeRequest to VSCode Go's Debug Adapter Protocol (DAP) to initialize.
Theia <- DAP: Receives a successful response from DAP.
Theia -> DAP: Sends SetBreakpointsRequest to DAP to set breakpoints if there are any (also sends setFunctionBreakpoints and setExceptionBreakpoints if applicable). // <==== this does make a difference - please see explanation below
Theia <- DAP: Receives successful response(s) for those events from DAP.

Post-Initializing Sequence

Theia -> DAP Sends configurationDoneRequest to DAP to indicate the end of the configuration.
DAP -> Delve: Sends asynchronous call to get Delve's state.
DAP <- Delve: Receives Delve's state and sees that it is running because we started Delve with --continue switch. Since this is the case, the code doesn't call this.continue() function and this means this.continueRequestRunning is still set to false.
Theia <- DAP: Receives response from DAP for the configurationDoneRequest.
Theia -> DAP: Sends ThreadsRequest to DAP.
DAP -> Delve: Since DAP sees that this.continueRequestRunning is still false, DAP thinks that Delve is not in a running state and so it sends a non-asynchronous (BLOCKING) call ListGoroutines to Delve. Since Delve is in a running state, it won't return this call until it changes its state to Halted. However, since ListGoroutines is a BLOCKING call, no other requests will get through, including the request to Halt Delve.
Theia -> DAP: Sends setBreakpointsRequest to DAP, which will not go through.

As soon as we fix ThreadsRequest() to bypass the blocking call, the same issue occurs with the next call in the request waterfall to StackTraceRequest()

Remedy

Instead of relying on the internal tracking this.continueRequestRunning, we should issue a non-blocking async call to get Delve's state instead and only fall back to this.continueRequestRunning if the call fails.
For synchronous (blocking calls) in ThreadsRequest and StacktraceRequest, we need to make sure Delve's is in a halted state. Otherwise, we are not stopped at a breakpoint and should send back the dummy thread or empty response.

The text was updated successfully, but these errors were encountered:

hyangah · 2020-10-05T14:28:10Z

@quoctruong @polinasok Thanks for inspecting the case thoroughly.

DAP <- Delve: Receives Delve's state and sees that it is running because we started Delve with --continue switch. Since this is the case, the code doesn't call this.continue() function and this means this.continueRequestRunning is still set to false.

Shouldn't we just fix this part - if we see the delve is in running state and the this.continueRequestRunning state to true? I wonder what's the impact on performance/correctness if we completely switch to issue a non-blocking async call to get Delve's state instead of local tracking of this.continueRequestRunning.

quoctruong · 2020-10-05T16:51:44Z

@hyangah @polinasok Correctness-wise, I think getDebugState will always be most accurate as we are querying Delve directly at this instance. There could be more bugs in how we are keeping track of continueRequestRunning.

Performance-wise, getDebugState is supposed to return immediately (see https://github.com/go-delve/delve/blob/master/service/rpc2/server.go#L106) so I don't see this being a big issue. And of course, in the long run when we switch to DAP in Delve, I don't think we have to do this anymore.

polinasok · 2020-10-05T19:28:02Z

This issue is not unique to Theia. I am able to reproduce this problem locally with just vscode on my mac.

package main

import (
    "fmt"
    "time"
)
func main() {
    for i := 0; true; i++ {
        fmt.Println("================== i", i)
        time.Sleep(2 * time.Second)
    }
}

Launch with

    dlv debug foo.go --headless --listen=:2345 --api-version=2 --accept-multiclient --continue

You can also start the program separate and use dlv attach with the same flags.

Debug configuration:

  "request": "attach"
  "mode": "remote"
  "port": 2345,
  "host": "127.0.0.1"

polinasok · 2020-10-05T19:31:52Z

Please also note that the problem does NOT occur if the editor has a breakpoint before the debug session is started. Then as part of the initialization sequence there will be a SetBreakpointsRequest, which, unlike ThreadsRequest, does correctly check both this.continueRequestRunning (tracking if adapter asked to continue, false in this case) AND this.debugState.Running (tracking state from querying dlv, true in this case). If target is running, a halt request will be issued to dlv before a breakpoint can be set, followed by a call to continue(), which will flip this.continueRequestRunning, so then the subsequent ThreadsRequest will not get stuck.

This might explain why we are not seeing more user complaints for this issue. Users are likely to set all their breakpoints before starting a debug session that will not stop on entry.

polinasok · 2020-10-05T20:00:51Z

About explicit calls to getDebugState. Funny, but I was about to file a separate issue for us to revise existing code and issue this call only when needed and not just always by default :)

If the debugger is not running in multi-client mode, we don't need to call this because we always have accurate state from the context and the various calls that we issue to dlv that return state. We can check if we are in this position, with RPCServer.IsMulticlient. In multi-client mode, things get more complicated. We might be connecting to halted or running dlv and must check which one with an explicit call to getDebugState (like in this issue). But we also need to review/test how we rely on state in other cases as well. What if our client requests to continue, but another client then halts delve, do we get correctly notified? I think so. That would be just like hitting any other breakpoint. But what about if we are in the middle of processing a breakpoint and another client issues a request to continue, would we detect that before issuing the many blocking requests? Adding to what was mentioned earlier, upon a stopped event, Threads request is followed by a waterfall of requests Threads, StackTrace, Scopes, Variables,..... that are translated to blocking dlv RPCs. We need to make sure our code can detect if the target starts running while we are processing any of these or before any other blocking call to delve.

quoctruong · 2020-10-06T01:00:15Z

@polinasok Thank you for the follow up. About the getDebugState calls, I think that we should address that separately as it will require more in-depth investigation as you stated. The priority on that issue is also lower than this current one as well I feel like.

polinasok · 2020-10-07T18:13:45Z

@quoctruong Agree that optimizing getDebugState is its own issue. But having more of them to guard against more instances of getting stuck might be relevant for this issue depending on its scope. It started as just getting stuck due to blocking threads request, right? But the CL 257337 with the fix evolved into fixing more issues that cause additional instances of getting stuck, right? We seem to be running into more and more of these as we investigate. First it was only threads in remote-attach mode. Then stackTrace in remote-attach mode. Now I was also able to get the session stuck on a blocking dlv call with an accidentally issued double continue with F5 followed by a pause and an also evaluate request while the process is running. Is this issue evolving to track all of these or should we be filing different issues with steps to reproduce and logs for each?

polinasok · 2020-10-09T16:08:49Z

Want to add that stackTraceRequest when it cannot be sent to running delve should return an error, so the next waterfall request doesn't kick in. Otherwise, we would need to put another guard for this in ScopesRequest.

gopherbot · 2020-10-13T16:55:07Z

Change https://golang.org/cl/257337 mentions this issue: debugAdapter: Fix bugs that cause remote debug to hang

hyangah added the Debug Issues related to the debugging functionality of the extension. label Oct 5, 2020

quoctruong mentioned this issue Oct 6, 2020

debugAdapter: Fix bugs that cause remote debug to hang #686

Closed

polinasok closed this as completed Oct 7, 2020

polinasok reopened this Oct 7, 2020

polinasok closed this as completed Oct 11, 2020

polinasok reopened this Oct 12, 2020

gopherbot closed this as completed in d2fbb62 Oct 14, 2020

hyangah added this to the v0.18.0 milestone Oct 21, 2020

polinasok mentioned this issue Jan 25, 2021

accept-multiclient: concurrent clients can get the server and each other stuck go-delve/delve#2322

Open

golang locked and limited conversation to collaborators Oct 21, 2021

gopherbot added the FrozenDueToAge label Oct 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug Session stuck for attach remote with Go Debugging on Theia #740

Debug Session stuck for attach remote with Go Debugging on Theia #740

quoctruong commented Oct 4, 2020 •

edited by polinasok

Loading

hyangah commented Oct 5, 2020

quoctruong commented Oct 5, 2020 •

edited

Loading

polinasok commented Oct 5, 2020

polinasok commented Oct 5, 2020

polinasok commented Oct 5, 2020

quoctruong commented Oct 6, 2020

polinasok commented Oct 7, 2020 •

edited

Loading

polinasok commented Oct 9, 2020

gopherbot commented Oct 13, 2020

Debug Session stuck for attach remote with Go Debugging on Theia #740

Debug Session stuck for attach remote with Go Debugging on Theia #740

Comments

quoctruong commented Oct 4, 2020 • edited by polinasok Loading

Repro:

Why this happens

Important Caveats

Theia's and Delve's Sequence of Events

Initializing Sequence (Not important)

Post-Initializing Sequence

Remedy

hyangah commented Oct 5, 2020

quoctruong commented Oct 5, 2020 • edited Loading

polinasok commented Oct 5, 2020

polinasok commented Oct 5, 2020

polinasok commented Oct 5, 2020

quoctruong commented Oct 6, 2020

polinasok commented Oct 7, 2020 • edited Loading

polinasok commented Oct 9, 2020

gopherbot commented Oct 13, 2020

quoctruong commented Oct 4, 2020 •

edited by polinasok

Loading

quoctruong commented Oct 5, 2020 •

edited

Loading

polinasok commented Oct 7, 2020 •

edited

Loading