Skip to content

Commit

Permalink
add standards for log correlation
Browse files Browse the repository at this point in the history
  • Loading branch information
zenmoto committed Jun 8, 2020
1 parent c14bf4c commit f8cc2cf
Showing 1 changed file with 137 additions and 0 deletions.
137 changes: 137 additions & 0 deletions text/logs/0000-log-correlation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Conventions for Trace and Resource Association in Logs

To set out standards for correlating traces and resources in existing logs

## Motivation

In order to correlate log entries with the tracing spans that were occurring
when the log entry was created, we need to emit certain information in the logs
to correlate the data entries. This document attempts to lay out the information
that needs to be emitted, and sets forth several approaches for doing so.

## Explanation

Two types of correlation need to happen to tie a log record into its full
execution context. The first is Request Correlation, which ties the record to
operations that were occurring when the record was created. The second is
Resource Correlation, which associates the entry with where the event occurred
such as a host, a pod, or a virtual machine.

### Request Correlation

Trace correlation is achieved primarily with two values- a trace identifier and
a span identifier. A trace may contain multiple spans, arranged as a tree, and
may also contain links to other related spans. The combination of a trace
identifier and a span identifier corresponds to a specific scope of work. In the
tracing API, that scope can also contain attributes and events that describe
what work the program being traced was currently performing.

In most cases when traces and logs are used in tandem, the attributes of the
current scope do not need to be added to the log entry- doing so would duplicate
transmission of those values. As one span is likely associated with several (or
many) log entries, it is more efficient to transmit those attributes with the
span once rather than many times with each log entry. As a result, for most
purposes the goal is to set the three values of traceid, spanid, and traceflags
into each log entry that should be associated with that span.

| Field | Required | Format
| :--------- | :------- | :--------------------------------------------------
| traceid | Yes | 16-byte numeric value as Base16-encoded string
| spanid | Yes | 8-byte value represented as Base16-encoded string
| traceflags | No | 8-bit numeric value as a Base16-encoded string

### Resource Correlation
As important as what was occurring in a program’s execution is, where it was
occurring is just as important. Resource correlation allows a log entry to be
associated with an infrastructure resource, and in turn system and program
metrics that describe the wider program state. The form that the resource takes
is also more diverse than the tracing scope- the resource may be a pod running
in a Kubernetes cluster, a virtual machine running in a cloud, a serverless
lambda, or an old-school server sitting in a data center. An application
environment may be an orchestration of multiple types of resource working
together.

From a logging standpoint, a resource is also almost always a constant within an
application process- a container may not have the same identifier on every run,
but it does keep that identifier while it’s running, and that resource
identifier is constant for every log entry created on that resource. As a
result, resource correlation may not happen at the log entry level, so we may or
may not put the resource correlation information in the log entry itself.

Resource information may be managed as part of the log ingestion process- for
example a Docker logging driver will know which container logs came from.
Full resource information may also not be available, as when logs are
aggregated by syslog or a similar system that has less context available to it.
As a result resource correlation information in the logs entries themselves
should be considered optional. If included, the resource information should
follow the semantic conventions for resources.

If the log entry is expressed in key-value pairs, any resource keys should be
prepended with ‘resource’- for example, `resource.service.name=”shoppingcart”`.
If the entry is expressed in JSON, the resource key-values should be placed in
an object named “resource” at the top level of the object.

### Correlation Context

A Correlation Context is a set of key-value pairs that is shared amongst the
spans of a distributed trace. Like span attributes, the correlation context may
already be carried with span information, so duplicating this information may be
redundant. In certain cases it may be important to associate this context with
log entries. When the context is embedded in a log entry, the key-value
pairs should be placed in a ‘correlation’ namespace. Where key-value pairs are
supported, embed the correlation key as “correlation.key_name”. In JSON or
other formats that allow nested structures, the key-value pairs should be
placed in an object named ‘correlation’ at the top of the object.


## Internal details

### Encoding
#### Key-Value Pairs
2020-05-20 20:13:31 INFO Message logged. resource.hostname=myhost correlation.user=djones traceid=0354af75138b12921 spanid=14c902d73a traceflags=01
#### JSON
{
"time": "2020-05-20 20:13:31",
"msg": "Message logged.",
"level": "INFO",
"correlation": {
“user”: “djones”
},
"resource": {
"hostname":"myhost"
}
"traceid": "0354af75138b12921",
"spanid": "14c902d73a",
"traceflags": "01"
}

### Custom Format

Custom formats that don’t allow for automatic parsing of key-value pairs can be
used, but they will require synchronization between the output format and the
extraction mechanism. These types of extractions may have the advantage of being
less verbose, but they also have the disadvantage of requiring setting up a
custom extraction process, and may be more fragile. Since this approach is
vendor-dependent, there is little guidance that can be provided by
OpenTelemetry.

## Trade-offs and mitigations

As a convention, adherence to these standards may have some variation, and some
of this advice may be difficult to implement in certain circumstances.

## Prior art and alternatives

Elastic Common Schema [has standards](https://www.elastic.co/guide/en/ecs/current/ecs-tracing.html#ecs-tracing)
for adding trace information to JSON log formats, but they do not support the
full OT correlation model

## Open questions


## Future possibilities

A stronger specification should be created for logs that are generated by
OpenTelemetry instrumentation and adapters that support conversion from and
to OpenTelemetry's internal logging models. As that specification is created,
they should be kept in sync with these conventions.

0 comments on commit f8cc2cf

Please sign in to comment.