Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let's discuss recorded flag before FPWD announcement #167

Closed
SergeyKanzhelev opened this issue Aug 24, 2018 · 18 comments
Closed

Let's discuss recorded flag before FPWD announcement #167

SergeyKanzhelev opened this issue Aug 24, 2018 · 18 comments
Labels
enhancement The specification works as-is but could be improved.
Milestone

Comments

@SergeyKanzhelev
Copy link
Member

From @adriancole:

I think this flag should be removed as it adds little value and results in a hex character "03" which is not intuitive either
this silly splitting of hairs will mean people can't rely on simple situation of sampled being 1 means sampled anymore

@SergeyKanzhelev SergeyKanzhelev added the enhancement The specification works as-is but could be improved. label Aug 24, 2018
@SergeyKanzhelev SergeyKanzhelev added this to the 3. FPWD milestone Aug 24, 2018
@codefromthecrypt
Copy link

Some key points here:

  • what you mentioned, the usability problem this would incur routinely in version 1, which is the worst time to make something complicated
  • In prior form, the w3c traceparent matched both b3 and also census. With this change it matches neither

More about this change in context: Given that traceparent is not the gold state anyway, and position zero tells you who upstream is, the "recorded or not" thing to what degree that's even possible to know could be nested in vendor specific state. If the two headers are used properly together, when one isn't in the initial position, they can't rely on the upstream signal in traceparent anyway, at best an attempt to reuse the traceid, but definitely not the incoming span ID (which is what recorded would be talking about). Folks who sample 100% will ignore and shouldn't interfere with the initial sampling status.

Footnote: there are pages and pages of discussions that precede the change in question, with a lot more people present than those in the change applied. I would argue for this reason alone, the change should be reverted as no-one was engaged in it that spent the prior years on this.

@codefromthecrypt
Copy link

final point, which is very much a practical implementation point:

"recording" is a local concern, a side effect of your policy including what is propagated and local information. For example, "sampled" is a part of that decision, but other sources (such as a tracer policy to record always) are certainly at play. For example, if you have other local data or other headers, you may be recording when sampled says no, because you have multiple consumers. Also, that recording decision is very much a local (not propagated) override/overlay as it is an implementation detail. To the degree it isn't a different format could be used to address it, but definitely such a format should start as an open source tracestate thing so folks can see it work in practice prior to escalating it to the traceparent field.

Remembering tracestate here.. if something didn't use that format, they also don't resume it (as their tracestate is a different key). They continue unaffected. That's the only compelling reason to have tracestate imho: that multiple things can coexist! Concede there are some other benefits to it, but point is that we have to make change to traceparent also knowing we designed tracestate as your gold data.

Let's say we want to change it anyway.. use '03' or maybe '01' per node.. somehow operators are fine making bitwise aware Grok plugins etc. Before we did that we'd want to use flows about the before and after to justify reverting more than a year of stability in the traceparent format. For example, a table is needed, but it isn't enough. A flow example of the before and after will show how it plays with tracestate. Also, there absolutely must be a quorum for changes like this and active involvement from folks who are affected by it. tracestate entries are far more flexible with little impact vs traceparent. traceparent also has the most existing background reading.. it needs the highest level of rigor imho.

@AloisReitbauer
Copy link
Contributor

AloisReitbauer commented Aug 28, 2018

Regarding interoperability, we should discuss how we treat these flags if they come from another untrusted system. Please, all implementors comment what you do and which information you will need. Please also add which system you want to pass to another system and why

@SergeyKanzhelev
Copy link
Member Author

@AloisReitbauer as discussed during the last meeting - please post details on how DynaTrace plan to use these flags.

@bogdandrutu
Copy link
Contributor

bogdandrutu commented Aug 31, 2018

Was thinking to write down my thoughts on this, and I think there are few use-cases that I have in my mind:

  • A Services like Google Spanner, Google Bigtable (request does not leave the current system):
    • I need to know if the user intends (records, samples) to current request (1 bit). Even though I will not 100% respect the sampling decision I will definitely sample more of the sampled requests because this will allow me to have a full trace (picture) between client and service.
  • B Service like Google Cloud Functions / AppEngine / Amazon Lambdas:
    • Here we have 2 cases (UserA -> Middleware (the service) -> UserA) or (UserA -> Middleware (the service) -> UserB) - the difference is that in the first case the middleware is between userA services and in the second case userA talks to userB via the middleware (example an Load Balancer or Appengine).
    • In first case UserA -> Middleware (the service) -> UserA if the middleware wants to participate and produce spans the only information that I need is the sampling decision (1 bit).
    • In second case UserA -> Middleware (the service) -> UserB usually the "Middleware (the service)" is a service that the UserB pays for so the traces belongs to UserB. In this case we can offer user the possibility to configure the sampling decision (similar to the A type of services that I describe Google Spanner) OR we can defer the sampling decision to the user and wait for a response header to decide to export the span data.

Based on this use-cases I think it is very important for any service owner to know the information if the previous request was sampled or not - see the case of the Spanner/Bigtable. Also I think it is important for the caller to know if the service sampled or not (sending back the callee sampling decision).

One extra behavior that is not 100% covered with one bit in the trace-flags is the deferred decision that I showed in the B type of services and the second case of that. In LB world would be good to defer the sampling decision to the user so maybe a three-state where 0 - i did not record ; 1 - I record and i will export traces ; 2 - I record but have't decide to sample is a good option.

@SergeyKanzhelev
Copy link
Member Author

SergeyKanzhelev commented Sep 4, 2018

@bogdandrutu so you suggest to change meaning of the fields this way:

options proposed in spec currently in spec
0 I did not record 0 I definitely dropped the data and no one asked for it
1 I record and i will export traces 3 Maybe I recorded this and someone asked for it
2 I record but have't decide to sample is a good option 2 Maybe I recorded this but no one asked for it yet (maybe deferred)
3 N/A 1 I definitely dropped the data but someone asked for it

@SergeyKanzhelev
Copy link
Member Author

So there will be no situation when service will want to communicate decision of dropping the data to the callee when sampling was requested originally.

@bogdandrutu
Copy link
Contributor

@SergeyKanzhelev service can communicate decision of dropping the data but that happens most likely in a response header. this flag that we discuss here is a flag that will be propagated with the request headers so that information is not available.

I think we are making a mistake (my fault here) by trying to deal with "exporting" in this flag. Exporting is vendor specific property (some vendors record all data but exports only few of them (delayed sampling decision), others record only a subset of the data but export all recorded data (dapper like systems)). I think actually I like more just 1 bit where the callee passes only one information: data recorded or not. For example in OpenCensus we will record data when we "sample" the request, and we will export all recorded data, so essentially this flag is equivalent with our current sampled bit, others that record all data will always set this bit to true and defer the exporting decision. For the example in the previous comment where I mentioned a need for a "deferred" state, actually that is equivalent to send record true and tell users that LB will export trace data if they send back a specific header (response header), and this becomes a protocol between a LB service provider and the customer.

@SergeyKanzhelev
Copy link
Member Author

So you ultimately will only use a single flag - "recorded" in current spec terminology? And the recommendation for vendors like Dynatrace and Lightstep which decide on recording later as well as any deferred sampling algorithms will always set it to 1?

@SergeyKanzhelev
Copy link
Member Author

And also - the value of this bit than should be settable/unsittable? In current spec you can promote requested from 0 to 1, but not other way. This gives some guarantee of interoperability - once set you know that nobody will unset it back

@bogdandrutu
Copy link
Contributor

bogdandrutu commented Sep 7, 2018

So you ultimately will only use a single flag - "recorded" in current spec terminology? And the recommendation for vendors like Dynatrace and Lightstep which decide on recording later as well as any deferred sampling algorithms will always set it to 1?

YES & YES

And also - the value of this bit than should be settable/unsittable?

I think for this flag we should support changes in both directions 0 -> 1 and 1 -> 0.

If one vendor needs a bit that never changes then that can be done in the trace-state.

@AloisReitbauer
Copy link
Contributor

From a Dynatrace perspective, we will propagate the flag(s) but will not use them for internal tracing decisions. The main reasons are:

  • As data collection implies costs the user might decide not to trace certain transactions
  • We will always minimize the impact on the application. In rare scenarios, we will not trace transactions to avoid impact on the application.
  • We might detect that a transaction is important - e.g. because of an error - although it is set to not be traced.
  • We simply might run at a much higher sampling rate.

Long story short. We will observe the flag but could even live without it not being part of the standard.

@discostu105
Copy link
Member

IMHO the "recorded" flag would have some value, but actually we could also live without it:

  • The idea behind it was to know if the previous tracing system recorded the trace, we could ask it for the trace-data and integrate it into our data. That is to increase
  • BUT: In order to integrate with a foreign tracing system, a lot more is needed. Most likely the foreign tracestate needs to be interpreted as well. E.g. to identify any application-id, tenant-id, cluster-id, ... information to be able to query the foreign system.
  • So, alternatively this flag could also be part of the foreign tracestate part. Also: it would not be mandatory, as one could always ask a foreign tracing-system for a trace (the "recorded" piece would just increase the hit-ratio).
  • So, all in all, this flag would have some value for us, but not in all cases, and it could also be solved without being in the standard.

Concerning the "force-trace" flag: As Alois already said, Dynatrace would ignore this flag, but of course we would propagate it.

@bogdandrutu
Copy link
Contributor

bogdandrutu commented Sep 11, 2018

@AloisReitbauer @discostu105 I think you don't have to just propagate it, you also need to modify and set the right value there.

BUT: In order to integrate with a foreign tracing system, a lot more is needed. Most likely the foreign tracestate needs to be interpreted as well. E.g. to identify any application-id, tenant-id, cluster-id, ... information to be able to query the foreign system.

This will be provided by your customer, for example if your customer uses AWS they will give you access and all required informations to fetch data from X-Ray. Also things like application-id, tenant-id, etc. may be considered PII and you should probably not put them on the wire.

So, alternatively this flag could also be part of the foreign tracestate part.

Everyone agreed that the tracestate is not intended to be shared between vendors. I don't think any vendor wants to make a commitment to share that.

Also: it would not be mandatory, as one could always ask a foreign tracing-system for a trace (the "recorded" piece would just increase the hit-ratio).

If your customer has to pay for the requests to the foreign tracing-system I don't think they will be happy with your approach.

@yurishkuro
Copy link
Member

In Jeeger clients, there is currently no context sensitive sampling, like sample on error, so our first implementation will be either always respect recorded (sampled) flag from upstream, or always ignore it and resample per internal sampling strategy. For downstream calls Jaeger can pass its sampling bit as recorded flag. Similarly, Jaeger clients can be configured to treat requested flag as debug flag, or ignore it completely. In both cases there may be additional rate limiting.

@SergeyKanzhelev
Copy link
Member Author

@yurishkuro, very helpful. Thanks for sharing. It seems that like other vendors with the sampling you only need a single flag. Please take a look at PR

@yurishkuro
Copy link
Member

"need" - yes, only one, but we can use both with a straightforward mapping to internal states.

@SergeyKanzhelev
Copy link
Member Author

We switched back to a single flag. I'm closing this discussion. Please re-open if you think we need to discuss more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement The specification works as-is but could be improved.
Projects
None yet
Development

No branches or pull requests

6 participants