-
Notifications
You must be signed in to change notification settings - Fork 113
Add opentracing support #322
Comments
The plan for adding opentracing support to the agent is currently not clear: nominally, a "host system" will run a Jaeger agent (which the application will send spans to via UDP). However, we really do not want to have to add another component to the guest images. Further, adding an extra component to the image when the agent is running as PID 1 will require custom code to launch the agent at VM startup... by the agent. Clearly, that isn't going to work unless the Jaeger client could "buffer" the spans in readiness for sending to the Jaeger agent when it eventually started. The current thinking is to mandate a Jaeger agent in the host environment only (the host the runtime runs on). We could use a virtio-serial port or vsock to allow the agent to send spans out of the VM back to the agent running outside the VM. However, vsock can't really be the only solution given some users run old host kernels that do not have vsock support (and they cannot change the host kernel). The current thinking is that we either offer multiple approaches to extracting the spans from the agent inside the VM (yuck), or we:
/cc @sameo, @mcastelino, @sboeuf, @bergwolf, @grahamwhaley. |
Just a brief (slightly devils advocate) thought.... given:
then I'd not be unhappy if we said, at least initially, that we would implement agent tracing via VSOCK, and if you don't have VSOCK available on your system, then, well, sorry, but you still have 90+% of the useful information just by looking at the runtime traces. We could always put non-VSOCK as a secondary item, and if a member really really needed it, they are free to work with us to PR it ;-) |
I guess that could work, but it would be good to get input from some power users about whether this approach would work for them. P.S. You missed an opportunity to use 👿 😄 |
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
We're long overdue for an update on this issue, so brace yourself as I attempt to summarise the current situation... Current serial behaviourStartupThe default configuration causes the agent to connect to a virtio serial channel. That serial channel is wrapped in a yamux connection. The server uses this channel to create a gRPC server. The server serves "forever" - that is to say, the agent never exits the ShutdownThe agent "never exits"; once the workload has finished, the runtime sends a QMP However, iff a QMP Current vsock behaviourBasically the same as serial but with one important exception - the gRPC server cannot exit since the vsock code path does not use yamux. This means the server has no way of detecting the client has disconnected and just blocks forever waiting for a new client to connect. Ideal behaviourThe agent would exit the Given the design of the grpc package, golang constrains and our requirements, the only clean way I can find to shut down the agent in a controlled fashion is to require yamux since that layer provides the agent with the ability to detect when a client ( NotesNote that we could create two codepaths and only enable vsock with yamux if:
Note that we could create two codepaths and only enable vsock with yamux if:
That would be annoying in that we'd have two paths to test, but would not impact the default no-proxy configuration. ChallengesWith vsock+yamux, the agent shuts down cleanly meaning that full trace is available.... theoretically :) However, the overall system does not work. The problem is that whenever a client disconnect occurs, the agent's gRPC server will exit (and restart). Forcing the existing keep-alive option in the runtime helps to stop the agent gRPC server from exiting, but is insufficient since disconnects still occur in the runtime and that leads to the runtime getting gRPC errors it does not expect ( These errors caused by the fact that the api calls in One solution to this would be to have a new API that stores such options in a "handle" that is passed between the API calls for options like the existing Compromise behaviourAdding yamux support on top of vsock should not really be necessary. What we could do is to only trace the gRPC calls themselves. This could be accomplished by:
Pros
Cons
|
/cc @mcastelino. |
@jodh-intel I worked on this today, and I have been able to reproduce the weird behavior of the So to summarize, I have been testing with
If you want to make sure about the grpcServer.Serve() exiting without using any extra traces, you can use the following command:
right after you started your container. The Anyway, the point being that you should see the following traces from the socat terminal:
at the moment you exit your container. Oh yeah, one more thing, I have added a Let me know if |
@sboeuf - Mon Dieu! After spending the morning trying to recreate your findings, I think I now have. I was already using the latest vsock, but I was still using a go 1.10 compiler. Switching to 1.11 I am seeing what looks like the correct behaviour 🙏! It's like a miracle 😂 Compiler bugs suck though More testing, but this is looking promising at last! |
Glad to hear that 👍 |
Add OpenTracing [1] support using Jaeger [2]. Full details are provided in `TRACING.md`. Updated vendoring for github.com/mdlayher/vsock to resolve hangs using a vsock socket with `grpcServer.Serve()`. Changes: 498f144 Handle return result in Accept test fda437e Fix unblocking after closing listener 4b12813 Add go.mod ce2ff06 vsock: factor out newConn function on Linux d8b0f13 vsock: adjust listener test for nonblocking 7a158c6 vsock: enable timeouts also on listener connections f68ad55 vsock: allow timeouts with Go 1.11 d0067a6 vsock/conn: don't use struct embedding Note: the agent **must** be built with golang 1.11* or newer to ensure correct behaviour. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
@jodh-intel given that opencensus which supports creating custom exporters I attempted to create a simple poc to allow exporting of traces over any network transport. And at least conceptually this should work. https://github.com/mcastelino/custom_exporter Here I create a custom exporter which serializes the traces into the thrift format which is used by jaeger.
Basically you can chop the exporter into two parts On the host we would need something lile
|
Hi @mcastelino - opencensus sounds interesting but atm Kata is solely using opentracing. As such, I'd like to focus on landing basic agent tracing support using opentracing as soon as possible. Once that is in place, we can consider alternative tracers by raising the topic at the Architecture Committee meeting (and since it sounds from what you're saying that most of the alternatives are mostly compatible with each other, it shouldn't be too disruptive to switch at a later date). |
@jodh-intel I'd argue that we should use the custom exporter as part of the initial PR for Opentracing. The reason is simple, by using this exporter, we can use some Go code to handle the vsock connection, while for now the PR introduces more systemd services to start scripts handling the vsock to UDP translation. The huge benefit being than on the agent side, there will be no need for any vsock translation, and the agent code will be only manipulating generic Opentracing code. Last point, by removing the systemd services, we will be able to support Opentracing when agent is running as init as well. |
@sboeuf @jodh-intel we can do this after James's initial PR lands. I just wanted to highlight that moving to open census allows to address the issue once and for all. The bulk of the code should remain mostly unchanged. So the primary goal of moving to open census was not to support alternate backend. But more about getting back some control in our code so that we can control how we cross across the VM boundary. With opencensus we can do the following
Note: I was was forced to marshall to thrift because the opencensus span had The offending field is
|
@mcastelino but if we land @jodh-intel PR first, we would need to also merge the rootfs part of it including new systemd service files, and then later, when we include a custom exporter, we'll need to update the rootfs again by removing those files. |
@sboeuf, @mcastelino - I'm definitely in favour of dropping vsock/udp scaffolding code if we can replace it with something simpler. However, opencensus is going to take (me atleast) time to grok and internalise. That's not to say we shouldn't do it of course, but fwics we'd need to add support for another protocol definition language in the agent build (thrift auto-generated code to accompany the existing gRPC code) and we'll still need host-side scaffold support if I'm understanding this correctly. On the topic of thrift though @mcastelino - could you add the |
As I say, switching yet again to a different approach is going to take time. I'm happy to look at it but will need help from @mcastelino still I suspect. @egernst may have input based on the hope of getting rudimentary agent tracing support landed in the next release (which looks somewhat unlikely from where I'm sitting now). If we are indeed time-pressured, we can work collectively on this. In fact, there's no reason we couldn't land an opencensus custom exporter PR into the agent now and I could then rework this branch yet again to leverage it. |
OK. Sitting comfortably? @sboeuf, @mcastelino and myself have just had a chat about juggling eels^W^Wagent tracing. What we're planning to take a staged approach to introducing agent tracing. The plan: Land #415 with modificationsThe current implementation works and provides full tracing (from agent start to agent shutdown). However, the current implementation will theoretically require runtime support (to disable QMP What we plan to do is add two new gRPC API's to #415: The behaviour will be as follows:
With the exception of the Create a follow-up PR to rework the agent tracing code to switch to OpenCensusThe advantages of this being:
Update the runtime to support tracing
Replace the host-side vsock->udp proxy script#415 introduces a What is required is a service that runs for the duration of the VM to forward traces onto the eventual trace collector. But we already have such a service - the Further optimisationsThere may be future enhancements we can make to improve the tracing even further. |
@jodh-intel Thanks for summarizing this, it is a very accurate description!
I would expect the virtcontainers API (
Having the shim handling this could work, but we need to enable it only on the shim representing the container that represents the sandbox. From the top of my head, I don't recall if this is something easy to identify. |
That makes sense, certainly initially. I was imagining it could be useful for an admin to enable tracing on a long-running container at any point, hence the CLI options. But, yes, they could be added at a future date as required.
Yep - like you, I haven't yet investigated how practical this actually is.
Right. I think we'll all agree we want to minimise such binaries, but this may indeed be a requirement. We could conceivably add a new runtime CLI to handle this such that if the admin runs something like |
Add OpenTracing [1] support using Jaeger [2]. Introduces two new gRPC API calls to enable and disable tracing dynamically: `StartTracing()` and `StopTracing()`. Full details of this feature are provided in `TRACING.md`. Updated vendoring for github.com/mdlayher/vsock to resolve hangs using a vsock socket with `grpcServer.Serve()`. Changes: 498f144 Handle return result in Accept test fda437e Fix unblocking after closing listener 4b12813 Add go.mod ce2ff06 vsock: factor out newConn function on Linux d8b0f13 vsock: adjust listener test for nonblocking 7a158c6 vsock: enable timeouts also on listener connections f68ad55 vsock: allow timeouts with Go 1.11 d0067a6 vsock/conn: don't use struct embedding Note: the agent **must** be built with golang 1.11* or newer to ensure correct behaviour. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Introduces two new gRPC API calls to enable and disable tracing dynamically: `StartTracing()` and `StopTracing()`. Full details of this feature are provided in `TRACING.md`. Updated vendoring for github.com/mdlayher/vsock to resolve hangs using a vsock socket with `grpcServer.Serve()`. Changes: 498f144 Handle return result in Accept test fda437e Fix unblocking after closing listener 4b12813 Add go.mod ce2ff06 vsock: factor out newConn function on Linux d8b0f13 vsock: adjust listener test for nonblocking 7a158c6 vsock: enable timeouts also on listener connections f68ad55 vsock: allow timeouts with Go 1.11 d0067a6 vsock/conn: don't use struct embedding Note: the agent **must** be built with golang 1.11* or newer to ensure correct behaviour. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Introduces two new gRPC API calls to enable and disable tracing dynamically: `StartTracing()` and `StopTracing()`. Full details of this feature are provided in `TRACING.md`. Updated vendoring for github.com/mdlayher/vsock to resolve hangs using a vsock socket with `grpcServer.Serve()`. Changes: 498f144 Handle return result in Accept test fda437e Fix unblocking after closing listener 4b12813 Add go.mod ce2ff06 vsock: factor out newConn function on Linux d8b0f13 vsock: adjust listener test for nonblocking 7a158c6 vsock: enable timeouts also on listener connections f68ad55 vsock: allow timeouts with Go 1.11 d0067a6 vsock/conn: don't use struct embedding Note: the agent **must** be built with golang 1.11* or newer to ensure correct behaviour. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Introduces two new gRPC API calls to enable and disable tracing dynamically: `StartTracing()` and `StopTracing()`. Full details of this feature are provided in `TRACING.md`. Updated vendoring for github.com/mdlayher/vsock to resolve hangs using a vsock socket with `grpcServer.Serve()`. Changes: 498f144 Handle return result in Accept test fda437e Fix unblocking after closing listener 4b12813 Add go.mod ce2ff06 vsock: factor out newConn function on Linux d8b0f13 vsock: adjust listener test for nonblocking 7a158c6 vsock: enable timeouts also on listener connections f68ad55 vsock: allow timeouts with Go 1.11 d0067a6 vsock/conn: don't use struct embedding Note: the agent **must** be built with golang 1.11* or newer to ensure correct behaviour. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Introduces two new gRPC API calls to enable and disable tracing dynamically: `StartTracing()` and `StopTracing()`. Full details of this feature are provided in `TRACING.md`. Updated vendoring for github.com/mdlayher/vsock to resolve hangs using a vsock socket with `grpcServer.Serve()`. Changes: 498f144 Handle return result in Accept test fda437e Fix unblocking after closing listener 4b12813 Add go.mod ce2ff06 vsock: factor out newConn function on Linux d8b0f13 vsock: adjust listener test for nonblocking 7a158c6 vsock: enable timeouts also on listener connections f68ad55 vsock: allow timeouts with Go 1.11 d0067a6 vsock/conn: don't use struct embedding Note: the agent **must** be built with golang 1.11* or newer to ensure correct behaviour. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Introduces two new gRPC API calls to enable and disable tracing dynamically: `StartTracing()` and `StopTracing()`. Full details of this feature are provided in `TRACING.md`. Updated vendoring for github.com/mdlayher/vsock to resolve hangs using a vsock socket with `grpcServer.Serve()`. Changes: 498f144 Handle return result in Accept test fda437e Fix unblocking after closing listener 4b12813 Add go.mod ce2ff06 vsock: factor out newConn function on Linux d8b0f13 vsock: adjust listener test for nonblocking 7a158c6 vsock: enable timeouts also on listener connections f68ad55 vsock: allow timeouts with Go 1.11 d0067a6 vsock/conn: don't use struct embedding Note: the agent **must** be built with golang 1.11* or newer to ensure correct behaviour. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
Add OpenTracing [1] support using Jaeger [2]. Introduces two new gRPC API calls to enable and disable tracing dynamically: `StartTracing()` and `StopTracing()`. Full details of this feature are provided in `TRACING.md`. Updated vendoring for github.com/mdlayher/vsock to resolve hangs using a vsock socket with `grpcServer.Serve()`. Changes: 498f144 Handle return result in Accept test fda437e Fix unblocking after closing listener 4b12813 Add go.mod ce2ff06 vsock: factor out newConn function on Linux d8b0f13 vsock: adjust listener test for nonblocking 7a158c6 vsock: enable timeouts also on listener connections f68ad55 vsock: allow timeouts with Go 1.11 d0067a6 vsock/conn: don't use struct embedding Note: the agent **must** be built with golang 1.11* or newer to ensure correct behaviour. Fixes kata-containers#322. [1] - https://opentracing.io [2] - https://jaegertracing.io Signed-off-by: James O. D. Hunt <[email protected]>
See kata-containers/kata-containers#27.
The text was updated successfully, but these errors were encountered: