Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug Session stuck for attach remote with Go Debugging on Theia #740

Closed
quoctruong opened this issue Oct 4, 2020 · 9 comments
Closed
Labels
Debug Issues related to the debugging functionality of the extension. FrozenDueToAge
Milestone

Comments

@quoctruong
Copy link
Contributor

quoctruong commented Oct 4, 2020

Repro:

  1. Launch a program (an example is this) in headless mode with Delve in --continue and --accept-multiclient mode.
  2. Attach to it from Theia with VSCode Go Extension.
  3. Set a breakpoint and notice that it didn't go through.
  4. Disconnecting also did not work.

Why this happens

Important Caveats

  • Internally we are using a field called continueRequestRunning to keep track of whether Delve is in a running state or not.
  • If Delve is in a running state and we issue a synchronous (blocking) call to it, any subsequent calls (both synchronous and asynchronous) will not get through until Delve changes to a halted state.

Theia's and Delve's Sequence of Events

To understand the root cause, we need to go through the sequences of events from both Theia and Delve's perspective. Note the bolded part as that is crucial to understanding the problem. I also skipped some initial calls when initializing Delve (getVersion). Special thanks to @polinasok for helping debugging the issue and very helpful and detailed discussion of different scenarios.

Initializing Sequence (Not important)

Theia -> DAP: Sends InitializeRequest to VSCode Go's Debug Adapter Protocol (DAP) to initialize.
Theia <- DAP: Receives a successful response from DAP.
Theia -> DAP: Sends SetBreakpointsRequest to DAP to set breakpoints if there are any (also sends setFunctionBreakpoints and setExceptionBreakpoints if applicable). // <==== this does make a difference - please see explanation below
Theia <- DAP: Receives successful response(s) for those events from DAP.

Post-Initializing Sequence

Theia -> DAP Sends configurationDoneRequest to DAP to indicate the end of the configuration.
DAP -> Delve: Sends asynchronous call to get Delve's state.
DAP <- Delve: Receives Delve's state and sees that it is running because we started Delve with --continue switch. Since this is the case, the code doesn't call this.continue() function and this means this.continueRequestRunning is still set to false.
Theia <- DAP: Receives response from DAP for the configurationDoneRequest.
Theia -> DAP: Sends ThreadsRequest to DAP.
DAP -> Delve: Since DAP sees that this.continueRequestRunning is still false, DAP thinks that Delve is not in a running state and so it sends a non-asynchronous (BLOCKING) call ListGoroutines to Delve. Since Delve is in a running state, it won't return this call until it changes its state to Halted. However, since ListGoroutines is a BLOCKING call, no other requests will get through, including the request to Halt Delve.
Theia -> DAP: Sends setBreakpointsRequest to DAP, which will not go through.

As soon as we fix ThreadsRequest() to bypass the blocking call, the same issue occurs with the next call in the request waterfall to StackTraceRequest()

Remedy

  • Instead of relying on the internal tracking this.continueRequestRunning, we should issue a non-blocking async call to get Delve's state instead and only fall back to this.continueRequestRunning if the call fails.
  • For synchronous (blocking calls) in ThreadsRequest and StacktraceRequest, we need to make sure Delve's is in a halted state. Otherwise, we are not stopped at a breakpoint and should send back the dummy thread or empty response.
@hyangah
Copy link
Contributor

hyangah commented Oct 5, 2020

@quoctruong @polinasok Thanks for inspecting the case thoroughly.

DAP <- Delve: Receives Delve's state and sees that it is running because we started Delve with --continue switch. Since this is the case, the code doesn't call this.continue() function and this means this.continueRequestRunning is still set to false.

Shouldn't we just fix this part - if we see the delve is in running state and the this.continueRequestRunning state to true? I wonder what's the impact on performance/correctness if we completely switch to issue a non-blocking async call to get Delve's state instead of local tracking of this.continueRequestRunning.

@hyangah hyangah added the Debug Issues related to the debugging functionality of the extension. label Oct 5, 2020
@quoctruong
Copy link
Contributor Author

quoctruong commented Oct 5, 2020

@hyangah @polinasok Correctness-wise, I think getDebugState will always be most accurate as we are querying Delve directly at this instance. There could be more bugs in how we are keeping track of continueRequestRunning.

Performance-wise, getDebugState is supposed to return immediately (see https://github.com/go-delve/delve/blob/master/service/rpc2/server.go#L106) so I don't see this being a big issue. And of course, in the long run when we switch to DAP in Delve, I don't think we have to do this anymore.

@polinasok
Copy link
Contributor

This issue is not unique to Theia. I am able to reproduce this problem locally with just vscode on my mac.

package main

import (
    "fmt"
    "time"
)
func main() {
    for i := 0; true; i++ {
        fmt.Println("================== i", i)
        time.Sleep(2 * time.Second)
    }
}

Launch with

    dlv debug foo.go --headless --listen=:2345 --api-version=2 --accept-multiclient --continue

You can also start the program separate and use dlv attach with the same flags.

Debug configuration:

  "request": "attach"
  "mode": "remote"
  "port": 2345,
  "host": "127.0.0.1"

@polinasok
Copy link
Contributor

Please also note that the problem does NOT occur if the editor has a breakpoint before the debug session is started. Then as part of the initialization sequence there will be a SetBreakpointsRequest, which, unlike ThreadsRequest, does correctly check both this.continueRequestRunning (tracking if adapter asked to continue, false in this case) AND this.debugState.Running (tracking state from querying dlv, true in this case). If target is running, a halt request will be issued to dlv before a breakpoint can be set, followed by a call to continue(), which will flip this.continueRequestRunning, so then the subsequent ThreadsRequest will not get stuck.

This might explain why we are not seeing more user complaints for this issue. Users are likely to set all their breakpoints before starting a debug session that will not stop on entry.

@polinasok
Copy link
Contributor

About explicit calls to getDebugState. Funny, but I was about to file a separate issue for us to revise existing code and issue this call only when needed and not just always by default :)

If the debugger is not running in multi-client mode, we don't need to call this because we always have accurate state from the context and the various calls that we issue to dlv that return state. We can check if we are in this position, with RPCServer.IsMulticlient. In multi-client mode, things get more complicated. We might be connecting to halted or running dlv and must check which one with an explicit call to getDebugState (like in this issue). But we also need to review/test how we rely on state in other cases as well. What if our client requests to continue, but another client then halts delve, do we get correctly notified? I think so. That would be just like hitting any other breakpoint. But what about if we are in the middle of processing a breakpoint and another client issues a request to continue, would we detect that before issuing the many blocking requests? Adding to what was mentioned earlier, upon a stopped event, Threads request is followed by a waterfall of requests Threads, StackTrace, Scopes, Variables,..... that are translated to blocking dlv RPCs. We need to make sure our code can detect if the target starts running while we are processing any of these or before any other blocking call to delve.

@quoctruong
Copy link
Contributor Author

@polinasok Thank you for the follow up. About the getDebugState calls, I think that we should address that separately as it will require more in-depth investigation as you stated. The priority on that issue is also lower than this current one as well I feel like.

@polinasok
Copy link
Contributor

polinasok commented Oct 7, 2020

@quoctruong Agree that optimizing getDebugState is its own issue. But having more of them to guard against more instances of getting stuck might be relevant for this issue depending on its scope. It started as just getting stuck due to blocking threads request, right? But the CL 257337 with the fix evolved into fixing more issues that cause additional instances of getting stuck, right? We seem to be running into more and more of these as we investigate. First it was only threads in remote-attach mode. Then stackTrace in remote-attach mode. Now I was also able to get the session stuck on a blocking dlv call with an accidentally issued double continue with F5 followed by a pause and an also evaluate request while the process is running. Is this issue evolving to track all of these or should we be filing different issues with steps to reproduce and logs for each?

@polinasok polinasok reopened this Oct 7, 2020
@polinasok
Copy link
Contributor

Want to add that stackTraceRequest when it cannot be sent to running delve should return an error, so the next waterfall request doesn't kick in. Otherwise, we would need to put another guard for this in ScopesRequest.

@polinasok polinasok reopened this Oct 12, 2020
@gopherbot
Copy link
Collaborator

Change https://golang.org/cl/257337 mentions this issue: debugAdapter: Fix bugs that cause remote debug to hang

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Debug Issues related to the debugging functionality of the extension. FrozenDueToAge
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants