Skip to content

Commit

Permalink
Tail sampling (#407)
Browse files Browse the repository at this point in the history
Co-authored-by: Marcelo Trylesinski <[email protected]>
  • Loading branch information
alexmojaki and Kludex authored Sep 17, 2024
1 parent fb0a162 commit 63a0950
Show file tree
Hide file tree
Showing 22 changed files with 965 additions and 291 deletions.
1 change: 1 addition & 0 deletions docs/api/sampling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: logfire.sampling
312 changes: 293 additions & 19 deletions docs/guides/advanced/sampling.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,312 @@
# Sampling

Sampling is the practice of discarding some traces or spans in order to reduce the amount of data that needs to be stored and analyzed. Sampling is a trade-off between cost and completeness of data.
Sampling is the practice of discarding some traces or spans in order to reduce the amount of data that needs to be
stored and analyzed. Sampling is a trade-off between cost and completeness of data.

To configure sampling for the SDK:
_Head sampling_ means the decision to sample is made at the beginning of a trace. This is simpler and more common.

- Set the [`trace_sample_rate`][logfire.configure(trace_sample_rate)] option of [`logfire.configure()`][logfire.configure] to a number between 0 and 1, or
- Set the `LOGFIRE_TRACE_SAMPLE_RATE` environment variable, or
- Set the `trace_sample_rate` config file option.
_Tail sampling_ means the decision to sample is delayed, possibly until the end of a trace. This means there is more
information available to make the decision, but this adds complexity.

See [Configuration](../../reference/configuration.md) for more information.
Sampling usually happens at the trace level, meaning entire traces are kept or discarded. This way the remaining traces
are generally complete.

## Random head sampling

Here's an example of randomly sampling 50% of traces:

```python
import logfire

logfire.configure(sampling=logfire.SamplingOptions(head=0.5))

for x in range(10):
with logfire.span(f'span {x}'):
logfire.info(f'log {x}')
```

This outputs something like:

```
11:09:29.041 span 0
11:09:29.041 log 0
11:09:29.041 span 1
11:09:29.042 log 1
11:09:29.042 span 4
11:09:29.042 log 4
11:09:29.042 span 5
11:09:29.042 log 5
11:09:29.042 span 7
11:09:29.042 log 7
```

Note that 5 out of 10 traces are kept, and that the child log is kept if and only if the parent span is kept.

## Tail sampling by level and duration

Random head sampling often works well, but you may not want to lose any traces which indicate problems. In this case,
you can use tail sampling. Here's a simple example:

```python
import time

import logfire

logfire.configure(sampling=logfire.SamplingOptions.level_or_duration())

for x in range(3):
# None of these are logged
with logfire.span('excluded span'):
logfire.info(f'info {x}')

# All of these are logged
with logfire.span('included span'):
logfire.error(f'error {x}')

for t in range(1, 10, 2):
with logfire.span(f'span with duration {t}'):
time.sleep(t)
```

This outputs something like:

```
11:37:45.484 included span
11:37:45.484 error 0
11:37:45.485 included span
11:37:45.485 error 1
11:37:45.485 included span
11:37:45.485 error 2
11:37:49.493 span with duration 5
11:37:54.499 span with duration 7
11:38:01.505 span with duration 9
```

[`logfire.SamplingOptions.level_or_duration()`][logfire.sampling.SamplingOptions.level_or_duration] creates an instance
of [`logfire.SamplingOptions`][logfire.sampling.SamplingOptions] with simple tail sampling. With no arguments,
it means that a trace will be included if and only if it has at least one span/log that:

1. has a log level greater than `info` (the default of any span), or
2. has a duration greater than 5 seconds.

This way you won't lose information about warnings/errors or long-running operations. You can customize what to keep
with the `level_threshold` and `duration_threshold` arguments.

## Combining head and tail sampling

You can combine head and tail sampling. For example:

```python
import logfire

logfire.configure(sampling=logfire.SamplingOptions.level_or_duration(head=0.1))
```

This will only keep 10% of traces, even if they have a high log level or duration. Traces that don't meet the tail
sampling criteria will be discarded every time.

## Keeping a fraction of all traces

To keep some traces even if they don't meet the tail sampling criteria, you can use the `background_rate` argument. For
example, this script:

```python
import logfire

logfire.configure(sampling=logfire.SamplingOptions.level_or_duration(background_rate=0.3))

for x in range(10):
logfire.info(f'info {x}')
for x in range(5):
logfire.error(f'error {x}')
```

will output something like:

```
12:24:40.293 info 2
12:24:40.293 info 3
12:24:40.293 info 7
12:24:40.294 error 0
12:24:40.294 error 1
12:24:40.294 error 2
12:24:40.294 error 3
12:24:40.295 error 4
```

i.e. about 30% of the info logs and 100% of the error logs are kept.

(Technical note: the trace ID is compared against the head and background rates to determine inclusion, so the
probabilities don't depend on the number of spans in the trace, and the rates give the probabilities directly without
needing any further calculations. For example, with a head sample rate of `0.6` and a background rate of `0.3`, the
chance of a non-notable trace being included is `0.3`, not `0.6 * 0.3`.)

## Caveats of tail sampling

### Memory usage

For tail sampling to work, all the spans in a trace must be kept in memory until either the trace is included by
sampling or the trace is completed and discarded. In the above example, the spans named `included span` don't have a
high enough level to be included, so they are kept in memory until the error logs cause the entire trace to be included.
This means that traces with a large number of spans can consume a lot of memory, whereas without tail sampling the spans
would be regularly exported and freed from memory without waiting for the rest of the trace.

In practice this is usually OK, because such large traces will usually exceed the duration threshold, at which point the
trace will be included and the spans will be exported and freed. This works because the duration is measured as the time
between the start of the trace and the start/end of the most recent span, so the tail sampler can know that a span will
exceed the duration threshold even before it's complete. For example, running this script:

```python
import time

import logfire

logfire.configure(trace_sample_rate=0.5)
logfire.configure(sampling=logfire.SamplingOptions.level_or_duration())

with logfire.span('span'):
for x in range(1, 10):
time.sleep(1)
logfire.info(f'info {x}')
```

will do nothing for the first 5 seconds, before suddenly logging all this at once:

with logfire.span("my_span"): # This span will be sampled 50% of the time
pass
```
12:29:43.063 span
12:29:44.065 info 1
12:29:45.066 info 2
12:29:46.072 info 3
12:29:47.076 info 4
12:29:48.082 info 5
```

followed by additional logs once per second. This is despite the fact that at this stage the outer span hasn't completed
yet and the inner logs each have 0 duration.

However, memory usage can still be a problem in any of the following cases:

- The duration threshold is set to a high value
- Spans are produced extremely rapidly
- Spans contain large attributes

<!-- ## Fine grained sampling
### Distributed tracing

You can tweak sampling on a per module or per code block basis using
[`logfire.with_trace_sample_rate()`][logfire.Logfire.with_trace_sample_rate].
Logfire's tail sampling is implemented in the SDK and only works for traces within one process. If you need tail
sampling with distributed tracing, consider deploying
the [Tail Sampling Processor in the OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md).

If a trace was started on another process and its context was propagated to the process using the Logfire SDK tail
sampling, the whole trace will be included.

If you start a trace with the Logfire SDK with tail sampling, and then propagate the context to another process, the
spans generated by the SDK may be discarded, while the spans generated by the other process may be included, leading to
an incomplete trace.

## Custom head sampling

If you need more control than random sampling, you can pass an [OpenTelemetry
`Sampler`](https://opentelemetry-python.readthedocs.io/en/latest/sdk/trace.sampling.html). For example:

```python
from opentelemetry.sdk.trace.sampling import (
ALWAYS_OFF,
ALWAYS_ON,
ParentBased,
Sampler,
)

import logfire

logfire.configure()

sampled = logfire.with_trace_sample_rate(0.5)
class MySampler(Sampler):
def should_sample(
self,
parent_context,
trace_id,
name,
*args,
**kwargs,
):
if name == 'exclude me':
sampler = ALWAYS_OFF
else:
sampler = ALWAYS_ON
return sampler.should_sample(
parent_context,
trace_id,
name,
*args,
**kwargs,
)

def get_description(self):
return 'MySampler'


logfire.configure(
sampling=logfire.SamplingOptions(
head=ParentBased(
MySampler(),
)
)
)

with logfire.span('keep me'):
logfire.info('kept child')

with logfire.span('exclude me'):
logfire.info('excluded child')
```

This will output something like:

```
10:37:30.897 keep me
10:37:30.898 kept child
```

Note that the sampler explicitly excluded only the span named `exclude me`. The reason that the `excluded child` log is
not included is that `MySampler` was wrapped in a `ParentBased` sampler, which excludes spans whose parents are
excluded. If you remove that and simply pass `head=MySampler()`, the `excluded child` log will be included, resulting in
an incomplete trace.

You can also pass a `Sampler` to the `head` argument of `SamplingOptions.level_or_duration` to combine tail sampling
with custom head sampling.

## Custom tail sampling

If you want tail sampling with more control than `level_or_duration`, you can pass a function to [
`tail`][logfire.sampling.SamplingOptions.tail] which will accept an instance of [
`TailSamplingSpanInfo`][logfire.sampling.TailSamplingSpanInfo] and return a float between 0 and 1 representing the
probability that the trace should be included. For example:

```python
import logfire


def get_tail_sample_rate(span_info):
if span_info.duration >= 1:
return 0.5 # (1)!

if span_info.level > 'warn': # (2)!
return 0.3 # (3)!

return 0.1 # (4)!


logfire.configure(
sampling=logfire.SamplingOptions(
head=0.5, # (5)!
tail=get_tail_sample_rate,
),
)
```

with sampled.span("outer"): # This span will be sampled 50% of the time
# `with sampled.with_trace_sample_rate(0.1).span("inner")` would also work
with logfire.with_trace_sample_rate(0.1).span("inner"): # This span will be sampled 10% of the time
pass
``` -->
1. Keep 50% of traces with duration >= 1 second
2. `span_info.level` is a [special object][logfire.sampling.SpanLevel] that can be compared to log level names
3. Keep 30% of traces with a warning or error and with duration < 1 second
4. Keep 10% of other traces
5. Discard 50% of traces at the beginning to reduce the overhead of generating spans. This is optional, but improves
performance, and we know that `get_tail_sample_rate` will always return at most 0.5 so the other 50% of traces will
be discarded anyway. The probabilities are not independent - this will not discard traces that would otherwise have
been kept by tail sampling.
2 changes: 1 addition & 1 deletion logfire-api/logfire_api/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ def suppress_instrumentation():
class ConsoleOptions:
def __init__(self, *args, **kwargs) -> None: ...

class TailSamplingOptions:
class SamplingOptions:
def __init__(self, *args, **kwargs) -> None: ...

class ScrubbingOptions:
Expand Down
4 changes: 2 additions & 2 deletions logfire-api/logfire_api/__init__.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@ from ._internal.auto_trace.rewrite_ast import no_auto_trace as no_auto_trace
from ._internal.config import ConsoleOptions as ConsoleOptions, METRICS_PREFERRED_TEMPORALITY as METRICS_PREFERRED_TEMPORALITY, PydanticPlugin as PydanticPlugin, configure as configure
from ._internal.constants import LevelName as LevelName
from ._internal.exporters.file import load_file as load_spans_from_file
from ._internal.exporters.tail_sampling import TailSamplingOptions as TailSamplingOptions
from ._internal.main import Logfire as Logfire, LogfireSpan as LogfireSpan
from ._internal.scrubbing import ScrubMatch as ScrubMatch, ScrubbingOptions as ScrubbingOptions
from ._internal.utils import suppress_instrumentation as suppress_instrumentation
from .integrations.logging import LogfireLoggingHandler as LogfireLoggingHandler
from .integrations.structlog import LogfireProcessor as StructlogProcessor
from .version import VERSION as VERSION
from logfire.sampling import SamplingOptions as SamplingOptions

__all__ = ['Logfire', 'LogfireSpan', 'LevelName', 'ConsoleOptions', 'PydanticPlugin', 'configure', 'span', 'instrument', 'log', 'trace', 'debug', 'notice', 'info', 'warn', 'error', 'exception', 'fatal', 'force_flush', 'log_slow_async_callbacks', 'install_auto_tracing', 'instrument_fastapi', 'instrument_openai', 'instrument_anthropic', 'instrument_asyncpg', 'instrument_httpx', 'instrument_celery', 'instrument_requests', 'instrument_psycopg', 'instrument_django', 'instrument_flask', 'instrument_starlette', 'instrument_aiohttp_client', 'instrument_sqlalchemy', 'instrument_redis', 'instrument_pymongo', 'instrument_mysql', 'instrument_system_metrics', 'AutoTraceModule', 'with_tags', 'with_settings', 'shutdown', 'load_spans_from_file', 'no_auto_trace', 'METRICS_PREFERRED_TEMPORALITY', 'ScrubMatch', 'ScrubbingOptions', 'VERSION', 'suppress_instrumentation', 'StructlogProcessor', 'LogfireLoggingHandler', 'TailSamplingOptions']
__all__ = ['Logfire', 'LogfireSpan', 'LevelName', 'ConsoleOptions', 'PydanticPlugin', 'configure', 'span', 'instrument', 'log', 'trace', 'debug', 'notice', 'info', 'warn', 'error', 'exception', 'fatal', 'force_flush', 'log_slow_async_callbacks', 'install_auto_tracing', 'instrument_fastapi', 'instrument_openai', 'instrument_anthropic', 'instrument_asyncpg', 'instrument_httpx', 'instrument_celery', 'instrument_requests', 'instrument_psycopg', 'instrument_django', 'instrument_flask', 'instrument_starlette', 'instrument_aiohttp_client', 'instrument_sqlalchemy', 'instrument_redis', 'instrument_pymongo', 'instrument_mysql', 'instrument_system_metrics', 'AutoTraceModule', 'with_tags', 'with_settings', 'shutdown', 'load_spans_from_file', 'no_auto_trace', 'METRICS_PREFERRED_TEMPORALITY', 'ScrubMatch', 'ScrubbingOptions', 'VERSION', 'suppress_instrumentation', 'StructlogProcessor', 'LogfireLoggingHandler', 'SamplingOptions']

DEFAULT_LOGFIRE_INSTANCE = Logfire()
span = DEFAULT_LOGFIRE_INSTANCE.span
Expand Down
Loading

0 comments on commit 63a0950

Please sign in to comment.