Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tail sampling #263

Merged
merged 15 commits into from
Jun 21, 2024
Merged

Tail sampling #263

merged 15 commits into from
Jun 21, 2024

Conversation

alexmojaki
Copy link
Contributor

@alexmojaki alexmojaki commented Jun 14, 2024

There's some details which I'm unsure about such as defaults, but I'd like to get this through so we can test it out for ourselves. It could be quite useful in our backend, especially in places where we're using random trace sampling. Documentation and maybe some tweaking (including env var configuration) can come after we've used it.

Usage in a nutshell:

import logfire

logfire.configure(
    tail_sampling=logfire.TailSamplingOptions(
        # These are the defaults of TailSamplingOptions, tail_sampling is None by default.
        level='notice',  # include traces with at least one span/log at this level or higher
        duration=1.0,  # include traces with at least this duration
    ),
    # Also include 10% of traces randomly from the beginning, before checking the other conditions.
    trace_sample_rate=0.1,
)

Improper usage will eat up memory as the spans are buffered.

Copy link

cloudflare-workers-and-pages bot commented Jun 14, 2024

Deploying logfire-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 80e8f4e
Status: ✅  Deploy successful!
Preview URL: https://c7bf0f28.logfire-docs.pages.dev
Branch Preview URL: https://alex-tail-sampling.logfire-docs.pages.dev

View logs

Copy link

codecov bot commented Jun 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

📢 Thoughts on this report? Let us know!

@alexmojaki alexmojaki marked this pull request as ready for review June 18, 2024 18:56
def shutdown(self) -> None:
self.processor.shutdown()

def force_flush(self, timeout_millis: int = 30000) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the method we were using to get logfire to work with AWS lambda? If so, I guess we need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's moved to the base class.

@alexmojaki alexmojaki enabled auto-merge (squash) June 21, 2024 14:38
@alexmojaki alexmojaki merged commit e811c48 into main Jun 21, 2024
11 checks passed
@alexmojaki alexmojaki deleted the alex/tail-sampling branch June 21, 2024 14:40
Copy link
Member

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welp I guess you merged this already but please address comments

Comment on lines +342 to +343
tail_sampling: TailSamplingOptions | None
"""Tail sampling options"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just be a typeddict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that, I've also wanted ConsoleOptions and PydanticPlugin to be typed dicts, but they seemed worse. They can't define defaults in the class, and passing in a plain dict is actually not that user friendly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

Comment on lines +616 to +622
# Avoid using the usual sampler if we're using tail-based sampling.
# The TailSamplingProcessor will handle the random sampling part as well.
sampler = (
ParentBasedTraceIdRatio(self.trace_sample_rate)
if self.trace_sample_rate < 1 and self.tail_sampling is None
else None
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we document / tell users they can't combine these? Is there a world where I want to head sample down to 10% (to reduce overhead in the SDK) and then tail sample down to 1%? There's still advantages to head sampling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can combine them, see the PR body or the new test_random_sampling.

Is there a world where I want to head sample down to 10% (to reduce overhead in the SDK) and then tail sample down to 1%?

I don't know what this means. Tail sample down to a percentage?

It sounds like you want to be able to discard most spans up front randomly regardless of whether tail-sampling would include them, so that a span only gets through if it's 'notable' AND 'lucky'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly

return self.started[0][0]


class TailSamplingProcessor(WrapperSpanProcessor):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really starting to feel like we should stop adding layers / wrapping processors like this. I would prefer to have a single processor that handles everything (sampling, batching, retries, etc.). I feel like it could be optimized more and would be easier to understand. It can still have pluggable bits just more explicit and less abstract.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This layer wraps all processors, including the console and user-defined processors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that. But I feel like we should just make LogfireSpanProcessor which does all of those things in one place. In particular we avoid double buffering.

Comment on lines +52 to +54
self.duration: float = (
float('inf') if options.duration is None else options.duration * ONE_SECOND_IN_NANOSECONDS
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like these durations to have the unit in their name. So self.duration -> self.duration_ns and TailsamplingOptions.duration -> TailSamplingOptions.duration_sec or something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants