Let's discuss recorded flag before FPWD announcement #167

SergeyKanzhelev · 2018-08-24T16:12:50Z

I think this flag should be removed as it adds little value and results in a hex character "03" which is not intuitive either
this silly splitting of hairs will mean people can't rely on simple situation of sampled being 1 means sampled anymore

codefromthecrypt · 2018-08-25T00:53:00Z

Some key points here:

what you mentioned, the usability problem this would incur routinely in version 1, which is the worst time to make something complicated
In prior form, the w3c traceparent matched both b3 and also census. With this change it matches neither

More about this change in context: Given that traceparent is not the gold state anyway, and position zero tells you who upstream is, the "recorded or not" thing to what degree that's even possible to know could be nested in vendor specific state. If the two headers are used properly together, when one isn't in the initial position, they can't rely on the upstream signal in traceparent anyway, at best an attempt to reuse the traceid, but definitely not the incoming span ID (which is what recorded would be talking about). Folks who sample 100% will ignore and shouldn't interfere with the initial sampling status.

Footnote: there are pages and pages of discussions that precede the change in question, with a lot more people present than those in the change applied. I would argue for this reason alone, the change should be reverted as no-one was engaged in it that spent the prior years on this.

codefromthecrypt · 2018-08-25T01:04:49Z

final point, which is very much a practical implementation point:

"recording" is a local concern, a side effect of your policy including what is propagated and local information. For example, "sampled" is a part of that decision, but other sources (such as a tracer policy to record always) are certainly at play. For example, if you have other local data or other headers, you may be recording when sampled says no, because you have multiple consumers. Also, that recording decision is very much a local (not propagated) override/overlay as it is an implementation detail. To the degree it isn't a different format could be used to address it, but definitely such a format should start as an open source tracestate thing so folks can see it work in practice prior to escalating it to the traceparent field.

Remembering tracestate here.. if something didn't use that format, they also don't resume it (as their tracestate is a different key). They continue unaffected. That's the only compelling reason to have tracestate imho: that multiple things can coexist! Concede there are some other benefits to it, but point is that we have to make change to traceparent also knowing we designed tracestate as your gold data.

Let's say we want to change it anyway.. use '03' or maybe '01' per node.. somehow operators are fine making bitwise aware Grok plugins etc. Before we did that we'd want to use flows about the before and after to justify reverting more than a year of stability in the traceparent format. For example, a table is needed, but it isn't enough. A flow example of the before and after will show how it plays with tracestate. Also, there absolutely must be a quorum for changes like this and active involvement from folks who are affected by it. tracestate entries are far more flexible with little impact vs traceparent. traceparent also has the most existing background reading.. it needs the highest level of rigor imho.

AloisReitbauer · 2018-08-28T18:31:00Z

Regarding interoperability, we should discuss how we treat these flags if they come from another untrusted system. Please, all implementors comment what you do and which information you will need. Please also add which system you want to pass to another system and why

SergeyKanzhelev · 2018-08-30T21:33:51Z

@AloisReitbauer as discussed during the last meeting - please post details on how DynaTrace plan to use these flags.

bogdandrutu · 2018-08-31T18:28:36Z

Was thinking to write down my thoughts on this, and I think there are few use-cases that I have in my mind:

A Services like Google Spanner, Google Bigtable (request does not leave the current system):
- I need to know if the user intends (records, samples) to current request (1 bit). Even though I will not 100% respect the sampling decision I will definitely sample more of the sampled requests because this will allow me to have a full trace (picture) between client and service.
B Service like Google Cloud Functions / AppEngine / Amazon Lambdas:
- Here we have 2 cases (UserA -> Middleware (the service) -> UserA) or (UserA -> Middleware (the service) -> UserB) - the difference is that in the first case the middleware is between userA services and in the second case userA talks to userB via the middleware (example an Load Balancer or Appengine).
- In first case UserA -> Middleware (the service) -> UserA if the middleware wants to participate and produce spans the only information that I need is the sampling decision (1 bit).
- In second case UserA -> Middleware (the service) -> UserB usually the "Middleware (the service)" is a service that the UserB pays for so the traces belongs to UserB. In this case we can offer user the possibility to configure the sampling decision (similar to the A type of services that I describe Google Spanner) OR we can defer the sampling decision to the user and wait for a response header to decide to export the span data.

Based on this use-cases I think it is very important for any service owner to know the information if the previous request was sampled or not - see the case of the Spanner/Bigtable. Also I think it is important for the caller to know if the service sampled or not (sending back the callee sampling decision).

One extra behavior that is not 100% covered with one bit in the trace-flags is the deferred decision that I showed in the B type of services and the second case of that. In LB world would be good to defer the sampling decision to the user so maybe a three-state where 0 - i did not record ; 1 - I record and i will export traces ; 2 - I record but have't decide to sample is a good option.

SergeyKanzhelev · 2018-09-04T15:46:55Z

@bogdandrutu so you suggest to change meaning of the fields this way:

options	proposed	in spec	currently in spec
0	I did not record	0	I definitely dropped the data and no one asked for it
1	I record and i will export traces	3	Maybe I recorded this and someone asked for it
2	I record but have't decide to sample is a good option	2	Maybe I recorded this but no one asked for it yet (maybe deferred)
3	N/A	1	I definitely dropped the data but someone asked for it

SergeyKanzhelev · 2018-09-04T15:52:39Z

So there will be no situation when service will want to communicate decision of dropping the data to the callee when sampling was requested originally.

bogdandrutu · 2018-09-05T16:32:29Z

@SergeyKanzhelev service can communicate decision of dropping the data but that happens most likely in a response header. this flag that we discuss here is a flag that will be propagated with the request headers so that information is not available.

I think we are making a mistake (my fault here) by trying to deal with "exporting" in this flag. Exporting is vendor specific property (some vendors record all data but exports only few of them (delayed sampling decision), others record only a subset of the data but export all recorded data (dapper like systems)). I think actually I like more just 1 bit where the callee passes only one information: data recorded or not. For example in OpenCensus we will record data when we "sample" the request, and we will export all recorded data, so essentially this flag is equivalent with our current sampled bit, others that record all data will always set this bit to true and defer the exporting decision. For the example in the previous comment where I mentioned a need for a "deferred" state, actually that is equivalent to send record true and tell users that LB will export trace data if they send back a specific header (response header), and this becomes a protocol between a LB service provider and the customer.

SergeyKanzhelev · 2018-09-05T17:06:48Z

So you ultimately will only use a single flag - "recorded" in current spec terminology? And the recommendation for vendors like Dynatrace and Lightstep which decide on recording later as well as any deferred sampling algorithms will always set it to 1?

SergeyKanzhelev · 2018-09-05T17:08:32Z

And also - the value of this bit than should be settable/unsittable? In current spec you can promote requested from 0 to 1, but not other way. This gives some guarantee of interoperability - once set you know that nobody will unset it back

bogdandrutu · 2018-09-07T01:06:11Z

So you ultimately will only use a single flag - "recorded" in current spec terminology? And the recommendation for vendors like Dynatrace and Lightstep which decide on recording later as well as any deferred sampling algorithms will always set it to 1?

YES & YES

And also - the value of this bit than should be settable/unsittable?

I think for this flag we should support changes in both directions 0 -> 1 and 1 -> 0.

If one vendor needs a bit that never changes then that can be done in the trace-state.

AloisReitbauer · 2018-09-10T07:46:07Z

From a Dynatrace perspective, we will propagate the flag(s) but will not use them for internal tracing decisions. The main reasons are:

As data collection implies costs the user might decide not to trace certain transactions
We will always minimize the impact on the application. In rare scenarios, we will not trace transactions to avoid impact on the application.
We might detect that a transaction is important - e.g. because of an error - although it is set to not be traced.
We simply might run at a much higher sampling rate.

Long story short. We will observe the flag but could even live without it not being part of the standard.

discostu105 · 2018-09-10T10:52:31Z

IMHO the "recorded" flag would have some value, but actually we could also live without it:

The idea behind it was to know if the previous tracing system recorded the trace, we could ask it for the trace-data and integrate it into our data. That is to increase
BUT: In order to integrate with a foreign tracing system, a lot more is needed. Most likely the foreign tracestate needs to be interpreted as well. E.g. to identify any application-id, tenant-id, cluster-id, ... information to be able to query the foreign system.
So, alternatively this flag could also be part of the foreign tracestate part. Also: it would not be mandatory, as one could always ask a foreign tracing-system for a trace (the "recorded" piece would just increase the hit-ratio).
So, all in all, this flag would have some value for us, but not in all cases, and it could also be solved without being in the standard.

Concerning the "force-trace" flag: As Alois already said, Dynatrace would ignore this flag, but of course we would propagate it.

bogdandrutu · 2018-09-11T00:23:20Z

@AloisReitbauer @discostu105 I think you don't have to just propagate it, you also need to modify and set the right value there.

BUT: In order to integrate with a foreign tracing system, a lot more is needed. Most likely the foreign tracestate needs to be interpreted as well. E.g. to identify any application-id, tenant-id, cluster-id, ... information to be able to query the foreign system.

This will be provided by your customer, for example if your customer uses AWS they will give you access and all required informations to fetch data from X-Ray. Also things like application-id, tenant-id, etc. may be considered PII and you should probably not put them on the wire.

So, alternatively this flag could also be part of the foreign tracestate part.

Everyone agreed that the tracestate is not intended to be shared between vendors. I don't think any vendor wants to make a commitment to share that.

Also: it would not be mandatory, as one could always ask a foreign tracing-system for a trace (the "recorded" piece would just increase the hit-ratio).

If your customer has to pay for the requests to the foreign tracing-system I don't think they will be happy with your approach.

yurishkuro · 2018-09-11T00:37:18Z

In Jeeger clients, there is currently no context sensitive sampling, like sample on error, so our first implementation will be either always respect recorded (sampled) flag from upstream, or always ignore it and resample per internal sampling strategy. For downstream calls Jaeger can pass its sampling bit as recorded flag. Similarly, Jaeger clients can be configured to treat requested flag as debug flag, or ignore it completely. In both cases there may be additional rate limiting.

SergeyKanzhelev · 2018-09-11T23:34:11Z

@yurishkuro, very helpful. Thanks for sharing. It seems that like other vendors with the sampling you only need a single flag. Please take a look at PR

yurishkuro · 2018-09-12T01:16:51Z

"need" - yes, only one, but we can use both with a straightforward mapping to internal states.

SergeyKanzhelev · 2018-11-05T18:22:03Z

We switched back to a single flag. I'm closing this discussion. Please re-open if you think we need to discuss more

SergeyKanzhelev added the enhancement The specification works as-is but could be improved. label Aug 24, 2018

SergeyKanzhelev added this to the 3. FPWD milestone Aug 24, 2018

SergeyKanzhelev mentioned this issue Aug 24, 2018

Add clarification for #142 #162

Closed

AloisReitbauer added the trace-context label Aug 28, 2018

bogdandrutu closed this as completed Sep 7, 2018

bogdandrutu reopened this Sep 7, 2018

codefromthecrypt mentioned this issue Sep 10, 2018

WIP TraceContext propagation handler openzipkin/brave#693

Closed

SergeyKanzhelev modified the milestones: 3. FPWD, 4. CR Sep 11, 2018

SergeyKanzhelev mentioned this issue Sep 11, 2018

switch to a single recorded flag #180

Merged

SergeyKanzhelev closed this as completed Nov 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let's discuss recorded flag before FPWD announcement #167

Let's discuss recorded flag before FPWD announcement #167

SergeyKanzhelev commented Aug 24, 2018

codefromthecrypt commented Aug 25, 2018

codefromthecrypt commented Aug 25, 2018

AloisReitbauer commented Aug 28, 2018 •

edited

Loading

SergeyKanzhelev commented Aug 30, 2018

bogdandrutu commented Aug 31, 2018 •

edited

Loading

SergeyKanzhelev commented Sep 4, 2018 •

edited

Loading

SergeyKanzhelev commented Sep 4, 2018

bogdandrutu commented Sep 5, 2018

SergeyKanzhelev commented Sep 5, 2018

SergeyKanzhelev commented Sep 5, 2018

bogdandrutu commented Sep 7, 2018 •

edited

Loading

AloisReitbauer commented Sep 10, 2018

discostu105 commented Sep 10, 2018

bogdandrutu commented Sep 11, 2018 •

edited

Loading

yurishkuro commented Sep 11, 2018

SergeyKanzhelev commented Sep 11, 2018

yurishkuro commented Sep 12, 2018

SergeyKanzhelev commented Nov 5, 2018

Let's discuss recorded flag before FPWD announcement #167

Let's discuss recorded flag before FPWD announcement #167

Comments

SergeyKanzhelev commented Aug 24, 2018

codefromthecrypt commented Aug 25, 2018

codefromthecrypt commented Aug 25, 2018

AloisReitbauer commented Aug 28, 2018 • edited Loading

SergeyKanzhelev commented Aug 30, 2018

bogdandrutu commented Aug 31, 2018 • edited Loading

SergeyKanzhelev commented Sep 4, 2018 • edited Loading

SergeyKanzhelev commented Sep 4, 2018

bogdandrutu commented Sep 5, 2018

SergeyKanzhelev commented Sep 5, 2018

SergeyKanzhelev commented Sep 5, 2018

bogdandrutu commented Sep 7, 2018 • edited Loading

AloisReitbauer commented Sep 10, 2018

discostu105 commented Sep 10, 2018

bogdandrutu commented Sep 11, 2018 • edited Loading

yurishkuro commented Sep 11, 2018

SergeyKanzhelev commented Sep 11, 2018

yurishkuro commented Sep 12, 2018

SergeyKanzhelev commented Nov 5, 2018

AloisReitbauer commented Aug 28, 2018 •

edited

Loading

bogdandrutu commented Aug 31, 2018 •

edited

Loading

SergeyKanzhelev commented Sep 4, 2018 •

edited

Loading

bogdandrutu commented Sep 7, 2018 •

edited

Loading

bogdandrutu commented Sep 11, 2018 •

edited

Loading