-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let's discuss recorded flag before FPWD announcement #167
Comments
Some key points here:
More about this change in context: Given that traceparent is not the gold state anyway, and position zero tells you who upstream is, the "recorded or not" thing to what degree that's even possible to know could be nested in vendor specific state. If the two headers are used properly together, when one isn't in the initial position, they can't rely on the upstream signal in traceparent anyway, at best an attempt to reuse the traceid, but definitely not the incoming span ID (which is what recorded would be talking about). Folks who sample 100% will ignore and shouldn't interfere with the initial sampling status. Footnote: there are pages and pages of discussions that precede the change in question, with a lot more people present than those in the change applied. I would argue for this reason alone, the change should be reverted as no-one was engaged in it that spent the prior years on this. |
final point, which is very much a practical implementation point: "recording" is a local concern, a side effect of your policy including what is propagated and local information. For example, "sampled" is a part of that decision, but other sources (such as a tracer policy to record always) are certainly at play. For example, if you have other local data or other headers, you may be recording when sampled says no, because you have multiple consumers. Also, that recording decision is very much a local (not propagated) override/overlay as it is an implementation detail. To the degree it isn't a different format could be used to address it, but definitely such a format should start as an open source tracestate thing so folks can see it work in practice prior to escalating it to the traceparent field. Remembering tracestate here.. if something didn't use that format, they also don't resume it (as their tracestate is a different key). They continue unaffected. That's the only compelling reason to have tracestate imho: that multiple things can coexist! Concede there are some other benefits to it, but point is that we have to make change to traceparent also knowing we designed tracestate as your gold data. Let's say we want to change it anyway.. use '03' or maybe '01' per node.. somehow operators are fine making bitwise aware Grok plugins etc. Before we did that we'd want to use flows about the before and after to justify reverting more than a year of stability in the traceparent format. For example, a table is needed, but it isn't enough. A flow example of the before and after will show how it plays with tracestate. Also, there absolutely must be a quorum for changes like this and active involvement from folks who are affected by it. tracestate entries are far more flexible with little impact vs traceparent. traceparent also has the most existing background reading.. it needs the highest level of rigor imho. |
Regarding interoperability, we should discuss how we treat these flags if they come from another untrusted system. Please, all implementors comment what you do and which information you will need. Please also add which system you want to pass to another system and why |
@AloisReitbauer as discussed during the last meeting - please post details on how DynaTrace plan to use these flags. |
Was thinking to write down my thoughts on this, and I think there are few use-cases that I have in my mind:
Based on this use-cases I think it is very important for any service owner to know the information if the previous request was sampled or not - see the case of the Spanner/Bigtable. Also I think it is important for the caller to know if the service sampled or not (sending back the callee sampling decision). One extra behavior that is not 100% covered with one bit in the trace-flags is the deferred decision that I showed in the B type of services and the second case of that. In LB world would be good to defer the sampling decision to the user so maybe a three-state where 0 - i did not record ; 1 - I record and i will export traces ; 2 - I record but have't decide to sample is a good option. |
@bogdandrutu so you suggest to change meaning of the fields this way:
|
So there will be no situation when service will want to communicate decision of dropping the data to the callee when sampling was requested originally. |
@SergeyKanzhelev service can communicate decision of dropping the data but that happens most likely in a response header. this flag that we discuss here is a flag that will be propagated with the request headers so that information is not available. I think we are making a mistake (my fault here) by trying to deal with "exporting" in this flag. Exporting is vendor specific property (some vendors record all data but exports only few of them (delayed sampling decision), others record only a subset of the data but export all recorded data (dapper like systems)). I think actually I like more just 1 bit where the callee passes only one information: data recorded or not. For example in OpenCensus we will record data when we "sample" the request, and we will export all recorded data, so essentially this flag is equivalent with our current sampled bit, others that record all data will always set this bit to true and defer the exporting decision. For the example in the previous comment where I mentioned a need for a "deferred" state, actually that is equivalent to send record true and tell users that LB will export trace data if they send back a specific header (response header), and this becomes a protocol between a LB service provider and the customer. |
So you ultimately will only use a single flag - "recorded" in current spec terminology? And the recommendation for vendors like Dynatrace and Lightstep which decide on recording later as well as any deferred sampling algorithms will always set it to |
And also - the value of this bit than should be settable/unsittable? In current spec you can promote |
YES & YES
I think for this flag we should support changes in both directions 0 -> 1 and 1 -> 0. If one vendor needs a bit that never changes then that can be done in the trace-state. |
From a Dynatrace perspective, we will propagate the flag(s) but will not use them for internal tracing decisions. The main reasons are:
Long story short. We will observe the flag but could even live without it not being part of the standard. |
IMHO the "recorded" flag would have some value, but actually we could also live without it:
Concerning the "force-trace" flag: As Alois already said, Dynatrace would ignore this flag, but of course we would propagate it. |
@AloisReitbauer @discostu105 I think you don't have to just propagate it, you also need to modify and set the right value there.
This will be provided by your customer, for example if your customer uses AWS they will give you access and all required informations to fetch data from X-Ray. Also things like application-id, tenant-id, etc. may be considered PII and you should probably not put them on the wire.
Everyone agreed that the tracestate is not intended to be shared between vendors. I don't think any vendor wants to make a commitment to share that.
If your customer has to pay for the requests to the foreign tracing-system I don't think they will be happy with your approach. |
In Jeeger clients, there is currently no context sensitive sampling, like sample on error, so our first implementation will be either always respect recorded (sampled) flag from upstream, or always ignore it and resample per internal sampling strategy. For downstream calls Jaeger can pass its sampling bit as recorded flag. Similarly, Jaeger clients can be configured to treat requested flag as debug flag, or ignore it completely. In both cases there may be additional rate limiting. |
@yurishkuro, very helpful. Thanks for sharing. It seems that like other vendors with the sampling you only need a single flag. Please take a look at PR |
"need" - yes, only one, but we can use both with a straightforward mapping to internal states. |
We switched back to a single flag. I'm closing this discussion. Please re-open if you think we need to discuss more |
From @adriancole:
The text was updated successfully, but these errors were encountered: