-
Notifications
You must be signed in to change notification settings - Fork 64
LINE
At LINE, we have few thousands of engineers on hundreds of teams managing thousands of services. 1.5 engineer works on distributed tracing. Almost all services are in Java, some are in Erlang.
At LINE, we use <5% of infra cost and <1% of human cost to maintain our observability stack includes:
- Metrics (~ 200 millions metrics/minutes) (in-house tsdb (prev opentsdb/mysql))
- Logging (in-house build using elasticsearch)
- Tracing (Using openzipkin)
Primary integration point with zipkin is armeria, which has native support for instrumenting server and client requests.
- Some users using spring sleuth, and some using envoy to send span
- Use Brave for instrumentation.
- Most instrumentation happens within armeria, which instruments server and client requests.
- Other standard brave adapters used include mysql.
- Custom brave adapters have been implemented for data stores like redis, mongo.
- Custom reporter that uses armeria RPC-over-HTTP2 (Thrift) to send spans
Primary transport is normal HTTP2, as used by armeria for RPC
Monitoring API server between all instrumented servers and the data store.
- Implements zipkin-api as well.
- Exposes zipkin-ui as well
- Fully asynchronous, non-blocking
API is a simple thrift service with a union of two fields
- binary encoded_spans - Spans that have already been serialized as a list of zipkin thrift spans. This would be the result of using zipkin-java’s ThriftCodec. This field is prefered for languages with good implementations of zipkin already.
-
<zipkin2.Span>
spans - Spans that can be filled in by filling in generated language code. This should be useful for languages that don’t have implementations of zipkin as they can just fill in the generated code without worrying about duplicating models + business logic. Not used yet, but probably would be used from erlang.
Elasticsearch cluster - 10 nodes on physical machines with xeon CPUs, 64GB ram, big SSD (NvME each)
- Use elasticsearch’s Curator tool with cron to clean up indexes that are more than a week old
- Monitoring API server uses zipkin’s elasticsearch-storage to write spans into elasticsearch with no extra processing.
- Best effort - randomly lost spans will be lost (no separate storage like kafka)
- In practice, don’t see many failures
UI part: we created https://github.com/line/zipkin-lens to fit our usage and and moved it to openzipkin.
Best effort - as long as latency investigation can happen, occasional broken traces isn’t a big deal.
Eventually want to instrument all servers - currently only instrumenting one team’s servers which is comprised of several services each with dozens of serving machines.
Will need erlang instrumentation
- Ingest rate is around 50000 spans per sec
At LINE, service name is created freely by our users. Mostly user likely to create service name to represent their cluster purpose, like "bot-frontend-service" or "shop-ownership-service".
The following are span tags we frequently use in indexing or aggregation
Tag | Description | Usage |
---|---|---|
instance-id | Our company-wide naming for project |
|
instance-phase | Our company-wide naming for enviroment |
|