-
Notifications
You must be signed in to change notification settings - Fork 64
2018 09 20 Filtering, Aggregations and Proxies at Salesforce
Created by Adrian Cole, last modified on Sep 21, 2018
Our goal is to share knowledge on approaches that avail higher quality data and analysis from Zipkin data pipelines. Topics include filtering transformation and aggregation. The main topics of interest are firehose mode (100% sampling), service aggregations and data transformations.
Sep 20, 2018 1-6PM in PDT (UTC -7), some of us will have lunch together prior.
At Salesforce West: 50 Fremont Street, San Francisco, CA 94105
Folks signed up to attend will meet at 12:45pm in the Lobby and will have to complete a standard NDA
We will add notes at the bottom of this document including links to things discussed and any takeaways.
The scope of this workshop assumes attendees are designers or implementers of Zipkin pipelines. Being physically present is welcome, but on location constrained. Remote folks can join via Gitter and attend the video call.
-
Chris Wensel, Salesforce (host)
-
Adrian Cole, Pivotal
-
Andrey Falko, Lyft
-
Daniele Rolando, Yelp
-
Suman Karumuri, Pinterest
-
Total space is 8! Let’s see how this goes!
Please review the following before the meeting
SoundCloud Kubernetes Metadata
Aggregation and Analysis at Netflix
If any segment is highlighted in blue, it will be firmly coordinated for remote folks. Other segments may be lax or completely open ended.
1:00pm | introductions |
1:30pm | Collaborate on trace data pipeline diagram to set context for future talks |
2:30pm | Break out on data collection and quality |
(ex SoundCloud Zipkin service cleaner, Salesforce Filtering design etc, Netflix conversion pipeline) | | 3:30pm | Quick one on "firehose mode" as applies to instrumentation overhead and collection | | 4:00pm | Collaborate on aggregation aspects of trace data pipeline diagram | | 4:30pm | Break out on service graph aggregation approaches
(ex big shared graph with streaming updates and corresponding propagation features) | | 5:30pm | round-table on takeaways |
notes will run below here as the day progresses. Please don't write anything that shouldn't be publicly visible!
Daniele from Yelp
- Consolidation of tracer arch for firehose
- automatic detection
Ke from Salesforce
- Improve UI
- Best practice for Big Enterprise
Chris from Salesforce
- Ditto Above
- Health of Community
Ryan at Salesforce
- Smart Collection pipeline
- improve Stock UI
Andrey at Lyft
- east collection and transport
- correlated presentation of trace data with (metrics, etc)
Suman at Slack
- More value from Trace Data
- Automatic Analysis
- Reduce MTTR MTTD
Adrian at Pivotal
- Zipkin pipeline topology
- Instrumentation w/ multiple tracing systems
José Carlos TypeForm
- interested in learning and applying what we discuss
py_zipkin instrumented applications send 0.15% to global Kafka topic → global cassandra.
100% tracing is too costly for Kafka, so instead applications encode spans into a list and ships UDP to a local proxy. This proxy writes to a AZ-centric zipkin cluster via http.
Sampled span data are also in splunk (the sampled ones). One problem with aggregating metrics is skewing due to sample rate, somewhat addressed with a sample rate tag.
Sidebar: Suman mentioned they used a different transport than kafka for different reasons, rather dependency management complications especially in python)
Q: Ryan: are you propagating another decision?
A: Daniele: no we just ignore the sample bit.
Q: Suman: Why are you doing both simultaneously?
A: Daniele: we disable indexing on the 100% store, and use log-based search instead. Also, requests jump across regions, so we have a fan-out query across AZ for the 100% one.
Q: Andrey: Would zipkin mux work if you turn back on search?
A: Daniele: Currently, no because it only works with get by ID right now.
A: Adrian: Query features will be a little tricky due to sharding the query, limits, AND operators.
Q: Suman: Why cassandra, how much data?
A: Daniele: Optimized for write heavy workloads, 1M spans/second on aggregate globally on firehose
Q: Andrey: Any auditing of the final data
A: Daniele: No checks in the pipeline at the moment
Q: Ryan: is zipkin mux public
A: Daniele: POC is public, but service discovery will be site specific
Q: Ke: are you measuring readback latency
A: Daniele: not specific, possibly 50ms
Q: Chris: are you using client side libraries for javascript?
A: Daniele: custom, concerns about how to ship the data
A: Suman: custom for javascript, android, ios
A: Chris: we are still making decisions about this
Each applications have a trace recorder which writes spans to a log file. A component named Singer follows the log files and writes them to Kafka. Some transformation steps happen before storage to address things like data quality due to old clients (zipkin-sparkstreaming). These spans end up in hosted elasticsearch and are analyzed in kibana.
We now trace through CDNs from web, android and iOS clients. We generate trace IDs then reverse engineer front-end spans by stream processing the CDN logs. Client-originated data is also collected via a diagnostic pipeline. We have an experiment framework which allows you to set flags for any purpose. this is re-used for tracing to create B3 headers. Usually readback is within seconds (to materialize the different parts)
To reduce MTTR/MTTD we added more clients, we took canary traces before and after deployment. then compare the the timings and network call details (ex how many calls per endpoint, spans per service, average latency). Usually found that additional calls usually increase end-user latency. SRE job is easier with these summaries, and it easier to know which contacts we can ping. Before we couldn't tell which team to ping to resolve it.
In the client side, we were already writing client events. Once we convinced them to write the events in spans everything else becomes easy. Notably, the parent/child aspects. This made end-user integration a lot easier. Span spec made it easy.
https://www.slideshare.net/mansu/pintrace-lisa-2017-san-francisco
For us the most useful granularly it (service, operation) aka the service endpoint.
A: Ryan: at Salesforce programming is dynamic so you can't necessarily get a useful endpoint out of it. Especially hard with PII in request urls.
Q: Chris: Does amazon ES work at this scale?
A: Suman: We sample to remove overhead such as header generation before the data gets to Elasticsearch. Once we had a bug where sampling was accidentally stuck on which caused a sev 0 outage for 40m. We learned that the overhead cost of tracing was 15%, which led to a low sampling rate. Write about 2-20K spans/second for 0.001% sampling
Q: Ryan: when it autoscaled on error of sampled flag, it autoscaled your whole service graph?
A: Suman: it depends on the library, finagle does a lot less work when not sampled.
Q: Chris: Did you do any analysis of the performance problem that led to the overhead?
A: Suman: we didn't keep data long enough to do that.
Chris: there could be opportunities to analyze the hot spots as 100% could add value that could be larger than the cost if cost is managed properly.
Q: Chris: so your sampling rate is very low, what's the value?
A: Suman: Firstly was latency, originally application were very slow. Even with low rates programming/config problems like redundant calls are visible and easy to fix.
Q: Chris: where does it start and stop?
A: Suman: from web → DB
Q: Daniele: Is 0.001% enough for P99?
A: Suman: The load is sufficiently high and go through similar libraries, so you can find problems such as too large requests.
Q: Ryan: The example of tuning request size is a business decision why is it ok?
A: Suman: no recolleection of the business decision, why it was ok to make it smaller.
Q: Andrey: How does this work client side?
A: Suman: micro-targeting (eventually stored in zookeeper), allows you to micro-target. there's no point sampling your near data centers at 100%, we can test via technical properties like 3G all built-in to the exiting flags framework.
A: Ryan: over time different segments of traffic are more important.
Q: Daniele: Is the deployment analysis available for all services?
A: Suman: the analysis is per edge router request ex /pins or /boards by default. It can be redeployed to start with a mid-tier entry requests. It can also launch more spark jobs.
Q: Daniele: Does others see a big increase when sampled?
A: Ryan: We don't do in-process tracing, only ingress egress headers in headers out. http and grpc
Q: Chris: Is there a way to only get skeletal RPC spans only?
A: Adrian: Not implemented, but discussed often. Intermediate spans create a load factor that can't be directly calculated per-request.
How do we build multi-datacenter zipkin collection? Inside the datacenter we have multiple services, we want to make as few modifications to it as possible. "stock opensource is good opensource" We still wanted Kafka, our R&D datacenter had elasticsearch and we needed to get data into it. We also have envoy and linkerds and in-process tracing using Brave. Some of these, we can't control the codebase or they don't support Kafka. This is a fairly new setup.
One problem we had with linkerd was that it uses dtab which explodes and makes the service graph less useful.
Andrey built this: inside each datacenter we have a zipkin server with all the protocols, on the bottom we have a "write-only" storage component which takes these into kafka outbound messages. So all components think they are using normal http, but this forwards messages via kafka onto the giant shared kafka bus. We like that we can have all the features on the collectors themselves, but still use kafka.
On the TODO list is adding a rate-limiting filter. Another thing is that we run envoy in front of zipkin so that can do rate limiting as well. Also, envoy adds mutual TLS implicitly.
Q: Daniele: how did you get the storage to work?
A: Andrey/Ryan: It uses spring boot autoconfiguration.
Q: Daniele: Why can't you write to kafka?
A: Ryan: linkerd 1 only works with http, envoy only does http and v1 format. We have some requirements to splice into the model in the middle of the kafka stream, for example to splice in and tag the datacenter. Also, we have filters for GDPR compliance and whitelists we can't trust the developers to comply.
Lightstep is used as the service provider, some instrumentation is Zipkin with some Grafana integration. Lightstep is 100% sampling.
Q: Suman: How is Lightstep used?
A: Andrey: There is some envoy related questions you can ask, there are some alarms setup.
Q: Suman: Are there client-originated traces?
A: Andrey: most heavy lifting is done in envoy
Q: Ke: Is there plans to have Lightstep elsewhere?
A: Ryan: We were given a sales demo that said it can be run anywhere.
A: Suman: We evaluated lightstep and weren't interested in 100% sampling. Their metrics graph with traces attached UI was nice. We built something similar internally based on Zipkin queries + Grafana. We also added the favorite feature with a separate storage with infinite retention. Dependency graph by endpoint is also a nice feature of lightstep.
A: Daniele: We also had a demo, but they had only 100% sampling at the time (2 years ago) it was too expensive. In our 100% we only keep data 1hr so it is cheaper.
We'd have a filtering which would allow you to write to multiple kafka or storage layers at the same time. This could get complex with a bunch of filters.
Q: What are the use cases?
A: Andrey: filter or transform things with customer data. When we acquire from other companies but they may be re-using tags others use. so this allows reserving of tags (or rewrite them into a namespace). Sometimes the data you need isn't there.
A: Chris: Canopy does some things like buffering into trace is done. We could use things like this to mediate different types of sample rate. But we can get some leverage when we leverage Kafka to address things like burstiness (which kafka can handle). When we acquire we end up crossing data centers, which complicate things. We need to make sure we handle our datacenters first.
A: Suman: We want buffering to do things like tail buffering.
Q: Daniele: What do you mean about getting smarter
A: Chris: Right now we aggregate things into a central kafka thing for logs, we could also have a global topic for tracing. We don't know if we should do it (100% sampling) yet as it isn't analysed yet.
It could be possible to do multiple phases of transformations, even with trace ID partitioning by multiple deployments of Zipkin with some filtering configuration. For example, this should reduce the amount of data to buffer when transforming traces. This is easier for SREs to reason with.
We need a parts list including how the partitioning would work. For example, spans are mixed by default. Something needs to split these out by trace ID and then rebundle. Then configure partitions and topics. There's some sweet spot where even if Kafka allows a large amount of partitions, it is not efficient to do that. For example, it might be only a couple hundred in practice.
Favorites could be possible to weave back in with http posts from the UI.
Q: Daniele: what is good about flink
A: Chris: It offers true streaming and coherent joins on a partition. Simple kafka can be used besides that.
Andrey: Link based aggregations can help resolve where you can't tell the operation name.
Suman: performance characteristics of aggregations are different from trace ingest. After the fact processing of the data is helpful as often the outputs are used by different traces.
Chris: Aggregations have different SLAs, and they have to deal with the latency. It also can refine the tradeoffs.
Adrian: we need to also consider the production and ingress of the data, as 100% collection is often required for aggregation, yet collection might require customizations
Suman, Chris: We need to ensure downstream is always v2 as this assumption is helpful for the ecosystem.
Ke: newer version of kafka supports header
Adrian: Also need some introductions of folks so that work occurs in these areas.
Suman: We need serialization formats and java, python libraries.
Chris: yes and need plugability for these use cases; de-facto formats are good for long term analysis
Daniele: we had same problem (wanting data to persist) moving v1 to v2, so we have conversions.
Haystack (via homeaway team) are translating Zipkin on the wire and have offered to review this with us.
Daniele
- covered more stuff than expected, for example how we can create a Kafka pipeline that can address conversions and windowing
- understand better the firehose design document due to examples such as how sampling works
- would like to have more info on anomaly detection
Ke
- Glad to know how others use zipkin, particularly how other sites pipelines work
- Talked about UI vaguely wrt Grafana, but would like to explore more UI next time possibly modularity
- Firehose collection discussion were helpful
Chris
- Glad to see validation of thoughts we had internally about an enrichment pipeline
- Challenge is we have immediate needs about filtering upstream vs filtering downstream
- Like the idea of collaborating on the tools, and how people can bring their own infrastructure
- It would be nice to have a reference architecture which will help things from appearing sparse
- Want salesforce to help keep the community healthy including place to discuss (some policy change is needed)
Ryan
- Quite satisfied by the talk about pipelines, trace aggregations
- Didn't talk much about the Zipkin UI. would be nice to know about features like sort by trace complexity
- Would be nice to have some pre-canned UI cases.
Andrey
-
Love what Suman brought to the table about non-spark pipeline possiblities and the tie-in for storage of storage discussion
-
Liked how aggregation related to correlated presentation (for example grafana with traces co-plotted
Suman
-
Knowing about haystack was valuable
-
Was nice to understand where everyone is, especially the firehose
-
Netflix materials were new and interesting
-
Filters, Storage of Storage and packaging of a reference pipeline is excited, ideally towards better support or adoption of tools
-
Prefer the site-owners model vs vendors as it helps focus influence on the community
-
Reducing MTTR MTTD seems ahead of the pack with using traces
-
One thing I would like to discuss is how to address long running ops like slack sessions
Adrian
-
Love seeing people talk about things with a lot of common ground.
-
Let's try to get reference implementation together for the pipeline
20180920_134124.jpg (image/jpeg) 20180920_142356.jpg (image/jpeg) 20180920_145141.jpg (image/jpeg) 20180920_152008.jpg (image/jpeg) 20180920_152008.jpg (image/jpeg) 20180920_160853.jpg (image/jpeg)
Document generated by Confluence on Jun 18, 2019 18:50