Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing.Builder.alwaysReportSpans() which sends locally sampled data to the normal reporter #958

Closed
wants to merge 2 commits into from

Conversation

codefromthecrypt
Copy link
Member

@codefromthecrypt codefromthecrypt commented Aug 10, 2019

Currently, we have a test SecondarySampling which is complex to
implement as it assumes conditional decisions will result in use of
different reporting libraries. Ex normal zipkin for sampled data, and
a different format for secondary sinks.

After a chat with Nara, we realized that data routing logic is cheapest
to address outside the process. In other words, if a secondary sampling
decision says "record" but only for a part of the network, this data
for that subtree will still end up posting to the same place.

This adds Tracing.Builder.alwaysReportSpans() which allows the normal span reporter to get everything recorded.

Fixes #955

@@ -330,6 +332,42 @@
.isEqualTo("12345");
}

/** Example that sets local sampled when in an experiment. */
final static class LocalSampleCustomizer extends Customizer {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's an example but a more thorough one will come later

@codefromthecrypt
Copy link
Member Author

PS here's the before and after, porting redaction to the new api. At first glance, I'm happy with it and don't think anyone will be able to notice a difference:

before:

Benchmark                                                                                      Mode     Cnt     Score     Error   Units
ExtraFieldPropagationBenchmarks.redacted_extract                                             sample  393940     0.242 ±   0.018   us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p0.00                      sample             0.146             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p0.50                      sample             0.184             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p0.90                      sample             0.214             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p0.95                      sample             0.235             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p0.99                      sample             0.660             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p0.999                     sample            10.576             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p0.9999                    sample            29.435             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p1.00                      sample          1316.864             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.alloc.rate                              sample      15  1341.893 ± 149.469  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.alloc.rate.norm                         sample      15   336.037 ±   0.008    B/op
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.churn.G1_Eden_Space                     sample      15  1353.549 ± 169.610  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.churn.G1_Eden_Space.norm                sample      15   338.931 ±  19.833    B/op
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.churn.G1_Old_Gen                        sample      15     0.007 ±   0.003  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.churn.G1_Old_Gen.norm                   sample      15     0.002 ±   0.001    B/op
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.count                                   sample      15   123.000            counts
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.time                                    sample      15    82.000                ms
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra                                    sample  412916     0.224 ±   0.031   us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p0.00    sample             0.111             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p0.50    sample             0.161             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p0.90    sample             0.185             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p0.95    sample             0.239             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p0.99    sample             0.579             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p0.999   sample            10.528             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p0.9999  sample            30.436             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p1.00    sample          3588.096             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.alloc.rate                     sample      15  1378.802 ± 135.304  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.alloc.rate.norm                sample      15   312.034 ±   0.011    B/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.churn.G1_Eden_Space            sample      15  1380.626 ± 162.728  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.churn.G1_Eden_Space.norm       sample      15   312.776 ±  24.377    B/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.churn.G1_Old_Gen               sample      15     0.005 ±   0.002  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.churn.G1_Old_Gen.norm          sample      15     0.001 ±   0.001    B/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.count                          sample      15   109.000            counts
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.time                           sample      15    76.000                ms
ExtraFieldPropagationBenchmarks.redacted_extract_nothing                                     sample  530013     0.101 ±   0.005   us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p0.00      sample             0.040             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p0.50      sample             0.080             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p0.90      sample             0.093             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p0.95      sample             0.119             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p0.99      sample             0.226             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p0.999     sample             6.456             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p0.9999    sample            16.176             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p1.00      sample           788.480             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.alloc.rate                      sample      15  2098.407 ± 165.449  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.alloc.rate.norm                 sample      15   184.011 ±   0.002    B/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.churn.G1_Eden_Space             sample      15  2095.401 ± 218.316  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.churn.G1_Eden_Space.norm        sample      15   183.568 ±   9.072    B/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.churn.G1_Old_Gen                sample      15     0.004 ±   0.002  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.churn.G1_Old_Gen.norm           sample      15    ≈ 10⁻³              B/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.count                           sample      15   126.000            counts
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.time                            sample      15    88.000                ms
ExtraFieldPropagationBenchmarks.redacted_inject                                              sample  388674     0.133 ±   0.013   us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p0.00                        sample             0.070             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p0.50                        sample             0.100             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p0.90                        sample             0.113             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p0.95                        sample             0.125             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p0.99                        sample             0.327             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p0.999                       sample             9.296             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p0.9999                      sample            15.762             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p1.00                        sample           830.464             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.alloc.rate                               sample      15  2136.652 ±  30.371  MB/sec
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.alloc.rate.norm                          sample      15   256.016 ±   0.002    B/op
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.churn.G1_Eden_Space                      sample      15  2155.550 ± 127.850  MB/sec
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.churn.G1_Eden_Space.norm                 sample      15   258.291 ±  15.070    B/op
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.churn.G1_Old_Gen                         sample      15     0.005 ±   0.003  MB/sec
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.churn.G1_Old_Gen.norm                    sample      15     0.001 ±   0.001    B/op
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.count                                    sample      15   136.000            counts
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.time                                     sample      15    90.000                ms

after:

Benchmark                                                                                      Mode     Cnt     Score     Error   Units
ExtraFieldPropagationBenchmarks.redacted_extract                                             sample  387011     0.224 ±   0.018   us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p0.00                      sample             0.143             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p0.50                      sample             0.179             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p0.90                      sample             0.197             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p0.95                      sample             0.212             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p0.99                      sample             0.496             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p0.999                     sample             9.552             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p0.9999                    sample            16.183             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:redacted_extract·p1.00                      sample           901.120             us/op
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.alloc.rate                              sample      15  1462.564 ±  74.632  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.alloc.rate.norm                         sample      15   354.698 ±  29.213    B/op
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.churn.G1_Eden_Space                     sample      15  1460.072 ± 119.612  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.churn.G1_Eden_Space.norm                sample      15   354.005 ±  35.566    B/op
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.churn.G1_Old_Gen                        sample      15     0.004 ±   0.002  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.churn.G1_Old_Gen.norm                   sample      15     0.001 ±   0.001    B/op
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.count                                   sample      15   115.000            counts
ExtraFieldPropagationBenchmarks.redacted_extract:·gc.time                                    sample      15    77.000                ms
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra                                    sample  457201     0.178 ±   0.011   us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p0.00    sample             0.136             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p0.50    sample             0.151             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p0.90    sample             0.162             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p0.95    sample             0.176             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p0.99    sample             0.361             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p0.999   sample             9.088             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p0.9999  sample            13.430             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:redacted_extract_no_extra·p1.00    sample           893.952             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.alloc.rate                     sample      15  1529.872 ±   8.649  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.alloc.rate.norm                sample      15   312.026 ±   0.003    B/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.churn.G1_Eden_Space            sample      15  1539.687 ± 100.857  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.churn.G1_Eden_Space.norm       sample      15   314.011 ±  20.124    B/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.churn.G1_Old_Gen               sample      15     0.004 ±   0.003  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.churn.G1_Old_Gen.norm          sample      15     0.001 ±   0.001    B/op
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.count                          sample      15   113.000            counts
ExtraFieldPropagationBenchmarks.redacted_extract_no_extra:·gc.time                           sample      15    83.000                ms
ExtraFieldPropagationBenchmarks.redacted_extract_nothing                                     sample  581570     0.089 ±   0.001   us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p0.00      sample             0.057             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p0.50      sample             0.076             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p0.90      sample             0.081             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p0.95      sample             0.093             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p0.99      sample             0.193             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p0.999     sample             2.032             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p0.9999    sample            11.555             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:redacted_extract_nothing·p1.00      sample            41.024             us/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.alloc.rate                      sample      15  2303.184 ±  22.966  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.alloc.rate.norm                 sample      15   184.010 ±   0.001    B/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.churn.G1_Eden_Space             sample      15  2300.130 ±  93.815  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.churn.G1_Eden_Space.norm        sample      15   183.779 ±   7.630    B/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.churn.G1_Old_Gen                sample      15     0.005 ±   0.002  MB/sec
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.churn.G1_Old_Gen.norm           sample      15    ≈ 10⁻³              B/op
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.count                           sample      15   135.000            counts
ExtraFieldPropagationBenchmarks.redacted_extract_nothing:·gc.time                            sample      15    96.000                ms
ExtraFieldPropagationBenchmarks.redacted_inject                                              sample  412862     0.120 ±   0.010   us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p0.00                        sample             0.078             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p0.50                        sample             0.094             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p0.90                        sample             0.109             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p0.95                        sample             0.122             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p0.99                        sample             0.311             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p0.999                       sample             7.083             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p0.9999                      sample            14.761             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:redacted_inject·p1.00                        sample           912.384             us/op
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.alloc.rate                               sample      15  2269.710 ± 114.544  MB/sec
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.alloc.rate.norm                          sample      15   256.014 ±   0.002    B/op
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.churn.G1_Eden_Space                      sample      15  2289.061 ± 186.305  MB/sec
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.churn.G1_Eden_Space.norm                 sample      15   258.044 ±  13.759    B/op
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.churn.G1_Old_Gen                         sample      15     0.005 ±   0.002  MB/sec
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.churn.G1_Old_Gen.norm                    sample      15     0.001 ±   0.001    B/op
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.count                                    sample      15   134.000            counts
ExtraFieldPropagationBenchmarks.redacted_inject:·gc.time                                     sample      15   100.000                ms

* contain data the backend needs to route to appropriate data stores, you are done. If it needs a
* hint, you can add a {@link FinishedSpanHandler} to add a tag which ensures the backend consistently
* and statelessly honors the sampling intent.
*/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't changed the example code, yet, but this is the design. I still need to run it through @narayaruna to ensure I am following things properly

@codefromthecrypt
Copy link
Member Author

chatting with nara again, seems eventually they will need edge (root but ttl=1) sampling so probably best to make the example sophisticated enough to show how to do this.

* small subset of the architecture starting at an arbitrary place. As such, triage is more advanced
* that the other two systems.
*
* <h2>Trace data forwarder</h2>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cwensel @drolando @worldtiki @openzipkin/core this example will introduce a familiar concept to you, trace data forwarder. In this case, it is different than "firehose handler" as the app is not responsible for deciding to send to two places. Instead, we push that responsibility into the zipkin-compatible endpoint at the cost of adding a tag to help that endpoint know where to divert data to (in some cases sending the same json to multiple places)

@codefromthecrypt
Copy link
Member Author

I think the ExtraFieldPropagation.Customizer should be self-contained which would allow it to define the fields it needs also. probably this hints at a name Plugin instead. Basically, this will allow the state management code from edge and triage to be in different files, allowing folks to progressively add new systems in a decoupled fashion.

* will be called for each field added to {@link FactoryBuilder}, in the same order they were,
* added and regardless of whether the field is present or not.
*/
public interface FieldCustomizer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about the contracts on this interface. I see this method doing two things (setting a value or dropping a value) and I would like to understand the reason for not having two methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it is easier this way, also you can implement single-method types in a lambda.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by easier I mean there's a very direct relationship without ordering concerns about what we are doing here. it is also easier on the code to not have to enter two methods for the same outcome (ex the callsite is a loop of only one method call vs two calls one to change one to drop)

@codefromthecrypt codefromthecrypt changed the title Adds ExtraFieldPropagation.Customizer to implement secondary sampling Adds ExtraFieldPropagation.Plugin to implement secondary trace state Aug 13, 2019
@codefromthecrypt
Copy link
Member Author

need to rethink this if we are intentionally allowing duplicate span handlers. My hope was that someone can drop in a jar into something like spring boot that implements like this:

@Configuration
class AddContextPlugins {
   @Bean Plugin triageSampler(){...}
   @Bean Plugin customCorrelationField(){...}
   @Bean Plugin w3cTraceContext(){...}
}

The implementation would implicitly register a FinishedSpanHandler provided by that plugin. We have to design very carefully if not deduplicating in Tracing.Builder that someone doesn't accidentally also register the handler used in the plugin manually. Regardless, we would need to ensure there's no hierarchy shared... the lack of shared hierarchy is probably a good thing anyway, and we can take measures to make it harder to accidentally register FinishedSpanHandler (by making the plugin's provider marked protected. Anyway, it is something to think about.

cc @anuraaga

@codefromthecrypt codefromthecrypt force-pushed the extra-customizer branch 2 times, most recently from 44e4bb7 to b5847eb Compare August 14, 2019 06:41
Factory(Propagation.Factory delegate, String[] fieldNames, String[] keyNames, BitSet redacted) {
this(delegate, fieldNames, keyNames, keyToField(keyNames), redacted);
@SuppressWarnings("ThreadLocalUsage") // intentional: instances may have different plugin counts
final ThreadLocal<Object[]> FIELD_UPDATERS = new ThreadLocal<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, can you explain this further (even in just a reply here). I don't follow why a thread local is necessary here, I must be missing some context

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise we allocate one throw-away array for each incoming request and each outgoing one, on top of the other overhead added by this feature. This is an attempt to reduce the garbage impact.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway good question. basically I used to only put in features with one thing like one Reporter<Span>, which you wouldn't need to loop over. However in plugins, this was designed for multiple and we can be a bit more efficient than end users are likely to be when trying to compose them.

I will cleanup this anyway, for example only using it when there are in fact multiple plugins present. that will help isolate the benchmarking concern, too, and protect the unsuspecting reviewer from seeing too much in one file, too!

@codefromthecrypt
Copy link
Member Author

ok I rebased and polished up some refactorings..

current known TODOs:

  • SecondarySampling/SecondarySamplingTest need rewrite
  • consider how to collaborate on things besides headers
    • ex Netflix need access to path during secondary sampling.
    • Current hack is to put a filter in front of the tracing filter which converts path expressions to pseudo headers, which can then be read as extra-fields
  • consider adding the ability to perform manipulations based on extracted state, probably only tags.
    • ex TraceContextOrSamplingFlags has a function for adding tags which are added directly after span creation
    • this could remove a lot of tension.. I think jaeger and kamon already do this, jaeger for "debug header" iirc and kamon more flexibly.

@codefromthecrypt
Copy link
Member Author

owe some changes here, notably the tag used for sampling should be split into separate routing tags. ex "sampled.zipkin" ->"", "sampled.edge" ->"" vs a shared tag "sampled" -> "zipkin,edge". this decouples more and is easier on the back end cc @narayaruna

@codefromthecrypt
Copy link
Member Author

@openzipkin/core For those interested, I had to make a traffic simulator to verify things. Mainly, it is hard to understand where sampling starts and stops if you only have 2 services. You really need at least 10 I think to "get" the value of subtree sampling.

https://github.com/openzipkin/brave/blob/extra-customizer/brave/src/test/java/brave/features/propagation/TracedNode.java creates a service graph based on a made-up movie service with a static route through 2 paths intentionally overlapping some services.

    Map<String, String> initialContext = new LinkedHashMap<>();
    initialContext.put("b3", "1");
    gateway.execute("/recommend", initialContext);
    gateway.execute("/play", initialContext);

Unit tests make graph assertions, but if you were to look at the json, they are somewhat realistic, even if simplified. You can load it into lens for example:

Screenshot 2019-08-31 at 8 55 31 AM
Screenshot 2019-08-31 at 8 55 54 AM
Screenshot 2019-08-31 at 8 56 10 AM
Screenshot 2019-08-31 at 8 56 29 AM

Here are the json if you want to just look for yourself

/recommend

[
  {
    "traceId": "a468337851abb70b",
    "parentId": "31ba4774a2efd405",
    "id": "df3db528d93870ba",
    "kind": "SERVER",
    "name": "/recommend",
    "timestamp": 1567212713691469,
    "duration": 16,
    "localEndpoint": {
      "serviceName": "authdb",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "a468337851abb70b",
    "parentId": "31ba4774a2efd405",
    "id": "df3db528d93870ba",
    "kind": "CLIENT",
    "name": "/recommend",
    "timestamp": 1567212713691442,
    "duration": 11749,
    "localEndpoint": {
      "serviceName": "cache",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "a468337851abb70b",
    "parentId": "2adc9b0a7b3f60ce",
    "id": "31ba4774a2efd405",
    "kind": "SERVER",
    "name": "/recommend",
    "timestamp": 1567212713691428,
    "duration": 12161,
    "localEndpoint": {
      "serviceName": "cache",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "a468337851abb70b",
    "parentId": "2adc9b0a7b3f60ce",
    "id": "31ba4774a2efd405",
    "kind": "CLIENT",
    "name": "/recommend",
    "timestamp": 1567212713691402,
    "duration": 12349,
    "localEndpoint": {
      "serviceName": "auth",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "a468337851abb70b",
    "parentId": "a7d2afa4a0361d6e",
    "id": "2adc9b0a7b3f60ce",
    "kind": "SERVER",
    "name": "/recommend",
    "timestamp": 1567212713691385,
    "duration": 12610,
    "localEndpoint": {
      "serviceName": "auth",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "a468337851abb70b",
    "parentId": "a7d2afa4a0361d6e",
    "id": "2adc9b0a7b3f60ce",
    "kind": "CLIENT",
    "name": "/recommend",
    "timestamp": 1567212713691352,
    "duration": 12759,
    "localEndpoint": {
      "serviceName": "api",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "a468337851abb70b",
    "parentId": "7a2719293544afc3",
    "id": "092412b2a141d7cf",
    "kind": "SERVER",
    "name": "/recommend",
    "timestamp": 1567212713704567,
    "duration": 4,
    "localEndpoint": {
      "serviceName": "recodb",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "a468337851abb70b",
    "parentId": "7a2719293544afc3",
    "id": "092412b2a141d7cf",
    "kind": "CLIENT",
    "name": "/recommend",
    "timestamp": 1567212713704537,
    "duration": 1175,
    "localEndpoint": {
      "serviceName": "cache",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "a468337851abb70b",
    "parentId": "76e442e0b621230e",
    "id": "7a2719293544afc3",
    "kind": "SERVER",
    "name": "/recommend",
    "timestamp": 1567212713704520,
    "duration": 1478,
    "localEndpoint": {
      "serviceName": "cache",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "a468337851abb70b",
    "parentId": "76e442e0b621230e",
    "id": "7a2719293544afc3",
    "kind": "CLIENT",
    "name": "/recommend",
    "timestamp": 1567212713704491,
    "duration": 1596,
    "localEndpoint": {
      "serviceName": "recommendations",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "a468337851abb70b",
    "parentId": "a7d2afa4a0361d6e",
    "id": "76e442e0b621230e",
    "kind": "SERVER",
    "name": "/recommend",
    "timestamp": 1567212713704470,
    "duration": 1808,
    "localEndpoint": {
      "serviceName": "recommendations",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "a468337851abb70b",
    "parentId": "a7d2afa4a0361d6e",
    "id": "76e442e0b621230e",
    "kind": "CLIENT",
    "name": "/recommend",
    "timestamp": 1567212713704423,
    "duration": 1939,
    "localEndpoint": {
      "serviceName": "api",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "a468337851abb70b",
    "parentId": "a468337851abb70b",
    "id": "a7d2afa4a0361d6e",
    "kind": "SERVER",
    "name": "/recommend",
    "timestamp": 1567212713691320,
    "duration": 15237,
    "localEndpoint": {
      "serviceName": "api",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "a468337851abb70b",
    "parentId": "a468337851abb70b",
    "id": "a7d2afa4a0361d6e",
    "kind": "CLIENT",
    "name": "/recommend",
    "timestamp": 1567212713689598,
    "duration": 16945,
    "localEndpoint": {
      "serviceName": "gateway",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "a468337851abb70b",
    "id": "a468337851abb70b",
    "kind": "SERVER",
    "name": "/recommend",
    "timestamp": 1567212713688911,
    "duration": 17716,
    "localEndpoint": {
      "serviceName": "gateway",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  }
]

/play

[
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "6d9412fcae7ca484",
    "id": "ae953dc1896a55f4",
    "kind": "SERVER",
    "name": "/play",
    "timestamp": 1567212812427713,
    "duration": 17,
    "localEndpoint": {
      "serviceName": "authdb",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "6d9412fcae7ca484",
    "id": "ae953dc1896a55f4",
    "kind": "CLIENT",
    "name": "/play",
    "timestamp": 1567212812427690,
    "duration": 14235,
    "localEndpoint": {
      "serviceName": "cache",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "0d44c4974c51f286",
    "id": "6d9412fcae7ca484",
    "kind": "SERVER",
    "name": "/play",
    "timestamp": 1567212812427676,
    "duration": 14670,
    "localEndpoint": {
      "serviceName": "cache",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "0d44c4974c51f286",
    "id": "6d9412fcae7ca484",
    "kind": "CLIENT",
    "name": "/play",
    "timestamp": 1567212812427649,
    "duration": 14785,
    "localEndpoint": {
      "serviceName": "auth",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "69f5a557048760da",
    "id": "0d44c4974c51f286",
    "kind": "SERVER",
    "name": "/play",
    "timestamp": 1567212812427633,
    "duration": 14993,
    "localEndpoint": {
      "serviceName": "auth",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "69f5a557048760da",
    "id": "0d44c4974c51f286",
    "kind": "CLIENT",
    "name": "/play",
    "timestamp": 1567212812427606,
    "duration": 15099,
    "localEndpoint": {
      "serviceName": "api",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "ac9175fb1a4720d7",
    "id": "3f16f8d5292690e4",
    "kind": "SERVER",
    "name": "/play",
    "timestamp": 1567212812444761,
    "duration": 9,
    "localEndpoint": {
      "serviceName": "licensedb",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "ac9175fb1a4720d7",
    "id": "3f16f8d5292690e4",
    "kind": "CLIENT",
    "name": "/play",
    "timestamp": 1567212812444675,
    "duration": 275,
    "localEndpoint": {
      "serviceName": "cache",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "fa11333db32d3089",
    "id": "ac9175fb1a4720d7",
    "kind": "SERVER",
    "name": "/play",
    "timestamp": 1567212812443161,
    "duration": 2109,
    "localEndpoint": {
      "serviceName": "cache",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "fa11333db32d3089",
    "id": "ac9175fb1a4720d7",
    "kind": "CLIENT",
    "name": "/play",
    "timestamp": 1567212812443119,
    "duration": 2269,
    "localEndpoint": {
      "serviceName": "license",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "c44e913507e718fc",
    "id": "fa11333db32d3089",
    "kind": "SERVER",
    "name": "/play",
    "timestamp": 1567212812443096,
    "duration": 2569,
    "localEndpoint": {
      "serviceName": "license",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "c44e913507e718fc",
    "id": "fa11333db32d3089",
    "kind": "CLIENT",
    "name": "/play",
    "timestamp": 1567212812443059,
    "duration": 2724,
    "localEndpoint": {
      "serviceName": "playback",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "c44e913507e718fc",
    "id": "15467d016028a262",
    "kind": "SERVER",
    "name": "/play",
    "timestamp": 1567212812446190,
    "duration": 7,
    "localEndpoint": {
      "serviceName": "moviemetadata",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "c44e913507e718fc",
    "id": "15467d016028a262",
    "kind": "CLIENT",
    "name": "/play",
    "timestamp": 1567212812446144,
    "duration": 208,
    "localEndpoint": {
      "serviceName": "playback",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "c44e913507e718fc",
    "id": "50812c6770bde0af",
    "kind": "SERVER",
    "name": "/play",
    "timestamp": 1567212812446793,
    "duration": 7,
    "localEndpoint": {
      "serviceName": "streams",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "c44e913507e718fc",
    "id": "50812c6770bde0af",
    "kind": "CLIENT",
    "name": "/play",
    "timestamp": 1567212812446742,
    "duration": 198,
    "localEndpoint": {
      "serviceName": "playback",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "69f5a557048760da",
    "id": "c44e913507e718fc",
    "kind": "SERVER",
    "name": "/play",
    "timestamp": 1567212812443034,
    "duration": 4228,
    "localEndpoint": {
      "serviceName": "playback",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "69f5a557048760da",
    "id": "c44e913507e718fc",
    "kind": "CLIENT",
    "name": "/play",
    "timestamp": 1567212812442981,
    "duration": 4409,
    "localEndpoint": {
      "serviceName": "api",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "df6e654722df5dc3",
    "id": "69f5a557048760da",
    "kind": "SERVER",
    "name": "/play",
    "timestamp": 1567212812427578,
    "duration": 20095,
    "localEndpoint": {
      "serviceName": "api",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    },
    "shared": true
  },
  {
    "traceId": "df6e654722df5dc3",
    "parentId": "df6e654722df5dc3",
    "id": "69f5a557048760da",
    "kind": "CLIENT",
    "name": "/play",
    "timestamp": 1567212812426373,
    "duration": 21359,
    "localEndpoint": {
      "serviceName": "gateway",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  },
  {
    "traceId": "df6e654722df5dc3",
    "id": "df6e654722df5dc3",
    "kind": "SERVER",
    "name": "/play",
    "timestamp": 1567212812425805,
    "duration": 22057,
    "localEndpoint": {
      "serviceName": "gateway",
      "ipv4": "192.168.0.101"
    },
    "tags": {
      "systems": "zipkin"
    }
  }
]

* <p>Secondary fields and state management about them is the responsibility of {@link
* ExtraFieldPropagation}. Here, you can add an extra field that corresponds to a header sent out of
* process. You can also add an {@link ExtraFieldPlugin} to ensure a relevant tag gets to the trace
* forwarder.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Probably worth it to have a copy on Zipkin's wiki

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea I didn't even consider that. this is a massive javadoc :)

@codefromthecrypt
Copy link
Member Author

I will leave docs here until more stable (tomorrow?) then move to wiki. One thing I notice is poor choice of jargon:

  • s/edge/gateway -> the '1 hop down' use case is less about edge like "Edge computing" more like gateway like "api gateway". Makes a lot more sense I think to use the older and more accessible term. Ex gateway is a link between 2 things. that's why 1 hop down is more intuitive for this.

  • s/triage/customer support/ -> looking all our other docs, we talk about customer support sampling. the fact that this does conditional based on a subtree of the service graph is an implementation detail. Considering we discuss use of "customer-id" is also telling. triage is a what not why. Customer support makes more immediate sense.

  • s/zipkin/b3/ -> since the others are about the sampling decision a and they all go to zipkin, b3 is more descriptive about the base case which is using b3 protocol. Ex up-front sampling. It isn't even important if the primary is probabilistic or not.

anyway I think the above help clarify a bit.

@codefromthecrypt
Copy link
Member Author

took out using a different propagated field per sampler as it makes integration unnecessarily complicated. This buys back time and energy to focus on more important things, such as http request parsing.

codefromthecrypt pushed a commit that referenced this pull request Sep 2, 2019
Before, we allowed the "carrier" of trace context to be different than
the request object. This is more helpful on the inject side than the
extract side. If we have only one parameter extract side, we can do more
for example, secondary sampling which considers the http path using
`brave.http.HttpServerParser`

See #958
@codefromthecrypt
Copy link
Member Author

the solution to the http-layer sampling problem is to introduce a new type HttpServerRequest/Response which is similar to zipkin-go. This allows us to inspect the type of the carrier and then consider other parts than just http headers when deciding a secondary decision. I have a little refactoring homework before this can complete (especially keeping api compatible)

codefromthecrypt pushed a commit that referenced this pull request Sep 3, 2019
Initially, we had a partial function of (adapter, request) to parse attributes
like the HTTP path from a request. This made sense in the http abstraction, as
all types that needed these attributes, such as parsers and samplers, were in
the same package and library. It also saved an object allocation otherwise
needed to wrap a native type, and the indirection of unwrapping as needed.

This worked, but especially on the response path this broke down. We found that
attributes in the request, such as the method and matched route, are needed at
response time, but aren't themselves a part of the response. Over time, we
accumulated response wrappers or thread locals to pass these attributes to the
response as many response objects lacked a property bag.

Then, we noticed people wanting to do more during extraction than look at
headers. For example, Netflix wanted to inspect the http path to make a
secondary sampling decision. Without a known http request type, this couldn't
be accomplished portably.

Finally, we noticed zipkin-go side-step this problem by defining a top-level
type for http requests and responses. This allows the same object to be used
regardless of purpose, whether that is primary or secondary sampling, or tag
parsing.

All of this led to the introduction of `HttpServerRequest` and
`HttpServerResponse` types in Brave 5.7, retrofitted into the original
`HttpServerAdapter` so as to not break API.

See #958
* @see ExtraFieldPropagation.FactoryBuilder#addPlugin(ExtraFieldPlugin)
* @since 5.7
*/
public Builder alwaysReportSpans() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps to implement this in brave 5.7 we can use a finished span handler that wraps normal reporter and then sets tracing.builder.spanReporter to noop

@codefromthecrypt
Copy link
Member Author

After various discussions, it seems like the actual use case is not required to have multiple components collaborating. For example, the intent is to use the same propagated field for everything mildly renamed to something like sampling-keys

This means the original design is closer to the mark. I'm going to bring that back as this code wandered a lot based on some false assumption that the customer support style sampling would be firewalled off (ex using different state header) than the edge sampling

https://github.com/openzipkin/brave/blob/master/brave/src/test/java/brave/features/propagation/SecondarySampling.java

@codefromthecrypt codefromthecrypt force-pushed the extra-customizer branch 2 times, most recently from 178ed5c to 4a132a1 Compare September 10, 2019 06:40
@codefromthecrypt codefromthecrypt changed the title Adds ExtraFieldPropagation.Plugin to implement secondary trace state Tracing.Builder.alwaysReportSpans() which sends locally sampled data to the normal reporter Sep 11, 2019
* This is a <a href="https://github.com/openzipkin/openzipkin.github.io/wiki/secondary-sampling">Secondary
* Sampling</a> proof of concept.
*/
public class SecondarySamplingIntegratedTest {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now converged, one scenario (skipped hops in gatewayplay) doesn't work yet

@codefromthecrypt
Copy link
Member Author

everything works now as far as I can tell. I'll retrofit the wiki next.

@codefromthecrypt codefromthecrypt marked this pull request as ready for review September 12, 2019 10:06
@codefromthecrypt
Copy link
Member Author

closed via 17fe478

@codefromthecrypt codefromthecrypt deleted the extra-customizer branch September 12, 2019 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider alwaysSampleLocal helper when using extra fields propagation
4 participants