Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log record mutations are visible in next registered processors #4065

Closed
pellared opened this issue May 31, 2024 · 17 comments · Fixed by #4067
Closed

Log record mutations are visible in next registered processors #4065

pellared opened this issue May 31, 2024 · 17 comments · Fixed by #4067
Assignees
Labels
spec:logs Related to the specification/logs directory triage:accepted:ready-with-sponsor Ready to be implemented and has a specification sponsor assigned

Comments

@pellared
Copy link
Member

pellared commented May 31, 2024

What are you trying to achieve?

Define if log record mutations done in record processor are visible in next registered processors.

From #4010 (comment):

I believe the spec is not clear enough and can be interpreted in different ways leading to very different implementations (e.g. chained processors that can mutate data and see mutations of previous processor or parallel processors that don't see mutations of other processors).

We likely need to fix the spec. It is in a stable doc but is vague enough that I think it is a bug and should be fixed.

Since the spec is vague I am also worried that existing implementations in different language SDKs may have interpreted it differently and have made different choices. IMO, we need to start with auditing existing implementations and see where we stand currently and then decide how exactly we want to fix the spec (one of the options is that both "chained-allow-mutation" and "fanout-copy-to-parallel" should be possible like they are in the Collector pipelines).

Additional context.

The undocumented assumption was the log processor will follow the pattern that mutations will be visible to the next processor to be consistent with span processors. However, the C++ Logs SDK is not designed this way (log records are passed as a copy to subsequent registered processors). Other Logs SDKs follow the pattern or are not stable.

@pellared pellared added the spec:logs Related to the specification/logs directory label May 31, 2024
@pellared
Copy link
Member Author

pellared commented May 31, 2024

I will check if it is possible to follow the pattern in Go even though it's pitfalls I outlined #4010 (comment). My main goal is to keep a possibility to have zero allocations when the log records are not modified (as this the the usual use case).

If we succeed to find a satisfying design for Go (and Rust), we could update the specification to make this requirement explicit.
However, what are we going to do with the fact that the stable C++ Logs SDK already has taken a different route? Would it have to be a "SHOULD" instead of a "MUST"?

CC @open-telemetry/rust-maintainers @open-telemetry/cpp-maintainers @open-telemetry/technical-committee

@pellared pellared changed the title Define if log record mutations done in record processor SHOUDD be visible to the next registered processors Define if log record mutations done in record processor MUST (or SHOULD) be visible to the next registered processors May 31, 2024
@pellared pellared changed the title Define if log record mutations done in record processor MUST (or SHOULD) be visible to the next registered processors Define if log record mutations done in record processor are visible to the next registered processors May 31, 2024
@pellared pellared changed the title Define if log record mutations done in record processor are visible to the next registered processors Log record mutations are visible in next registered processors May 31, 2024
@jack-berg
Copy link
Member

If a log processor's mutations were not visible to the next processor, then there would be no (or very little) point of onEmit receiving a ReadWriteLogRecord. If mutations are only visible to the processor being called, then the SDK might as well call onEmit with a ReadableLogRecord, since that processor can easily mutate with copy-on-write semantics.

Other evidence that the "mutations are not visible to subsequent processors" is the wrong interpretation:

  • The log SDK was modeled after the trace SDK, and SpanProcessor#onStart modifications are visible to subsequent processors. It would be a very strange design if they weren't. .
  • "LogRecordProcessors are invoked in the same order they were registered" (see spec). There's no point of this if mutations aren't visible to subsequent processors.
  • "A LogRecordProcessor may freely modify logRecord for the duration of the OnEmit call. If logRecord is needed after OnEmit returns (i.e. for asynchronous processing) only reads are permitted" (see spec). The intent of this line is clarify that log processors can only mutate log records synchronously, since if they hold onto the record and mutate asynchronously, the mutation may conflict with one from a subsequent processor. See the Resolve lingering question of LogRecord mutability in LogRecordProcessor #2955 for more context, which was resolved by this statement.

However, what are we going to do with the fact that the stable C++ Logs SDK already has taken a different route?

Wondering if you can elaborate on this. I can't seem to find the LogRecordProcessor interface definition in c++ (I don't know c++ so I'm very naive navigating the code). Do log record processors call subsequent processors in c++? Is it possible to write a log processor which extracts and includes baggage attributes on log records in c++?

@pellared
Copy link
Member Author

pellared commented May 31, 2024

If a log processor's mutations were not visible to the next processor, then there would be no (or very little) point of onEmit receiving a ReadWriteLogRecord.

The modified record can be passed to a decorated processor. This is how mutations can be currently done in C++, Rust, and Go. E.g. mutatingProcessor found here: #4010 (comment)

Notice that currently there is nothing in the spec that allows creating a new/custom ReadableLogRecord.

Wondering if you can elaborate on this.

Sure. Regarding C++ my information is based on #4010 (comment).

@jack-berg
Copy link
Member

The modified record can be passed to a decorated processor. This is how mutations can be currently done in C++, Rust, and Go.

There's no mention of a decorated processor in the spec, so it sounds like C++, Rust and Go invented new concepts. To be clear, I support an effort to evolve the spec to include accommodations which allow for higher performance implementations. However, I feel strongly that the current language of the spec does imply that log record mutations are visible to the subsequent processors. The design juggles several priorities, including performance and conceptual symmetry with prior art in otel (the trace signal in particular). A design featuring chaining or decorating processors might pave a clear path for a high performance implementation, it does so at the expense of conceptual symmetry.

My rust, c++, and go skills are pretty poor, but there may be creative language-specific ways to keep the door open to zero allocations while also staying within the spirit of the spec. For example, a processor might have a method indicating whether or not it mutates data. If none of the processors mutate data, or if there is only one processor registered, then the SDK could use an alternative implementation of ReadWriteLogRecord which avoids allocations.

@pellared
Copy link
Member Author

pellared commented May 31, 2024

Actually, it may occur that it won't be a problem for Go. I am in progress of evaluating it. I remember from my previous prototypes for most cases the Go escape analysis was doing good job. Reference https://github.com/open-telemetry/opentelemetry-go/blob/main/log/DESIGN.md#passing-record-as-pointer-to-loggeremit

Rust Logs SDK is also not stable and there was a discussion whether to use the same pattern as for span processors. See #4010 (comment)

It may be a good compromise to follow span processor pattern to achieve better consistency across languages as well as with tracing.

@pellared
Copy link
Member Author

pellared commented Jun 3, 2024

@pellared
Copy link
Member Author

pellared commented Jun 3, 2024

For sake of transparency I want to share some thoughts based on feedback I received from @MrAlias.

The specification does not require log records to be concurrent safe (notice that the spans are required to be concurrent safe by the spec here). What happens when a subsequent processor modifies a log record that was passed to a batch processor? Does the batch processor need to make a deep copy of processed records?

How will we make sure that users will not modify the log record asynchronously? In Go, I thought about adding a comment. However, commenting something like this is not going to be a safe solution. There are going to be many user use cases that will violate this and having a comment is not an adequate design choice. We could make log record concurrent safe, but this would decrease the performance even further.

@jack-berg
Copy link
Member

A LogRecordProcessor may freely modify logRecord for the duration of the OnEmit call. If logRecord is needed after OnEmit returns (i.e. for asynchronous processing) only reads are permitted

From the spec). A processor which asynchronously modifies the ReadWriteLogRecord passed to LogRecordProcessor#onEmit is violating the spec and the behavior is undefined. In practice, some processors will need to make changes asynchronously. These implementations should make a local copy of the log record. I imagine an implementation could take stronger measures to prevent this by doing something to lock the log record after a log record processor has returned, which may be at the expense of performance (not sure because I haven't tried).

This appears to be one of those areas of the spec where we have a possible foot gun. And personally, I'm fine with it. We should have a well paved path that works out of the box for most of the things that users want to do. But there will be advanced users who want to step off that path, and they should have the ability to do things and take responsibility for those things. Its the "we're all adults here" argument.

@pellared
Copy link
Member Author

pellared commented Jun 3, 2024

A processor which asynchronously modifies the ReadWriteLogRecord passed to LogRecordProcessor#onEmit is violating the spec and the behavior is undefined.

A processor is synchronously modifying a record which is retained by another processor which retains is and only makes asynchronous reads. This could happen if e.g. one registers (e.g. by mistake) a "retracting processor" after a batch processor.

I think that the specification should say something like:

If the log record is not concurrent safe then OnEmit implementations:

  • must not modify the record asynchronously
  • must not be retain the record
  • must copy the record if it is needed after OnEmit returns

Currently the spec makes it easy to have implementations with race conditions.

@jack-berg
Copy link
Member

I'm supportive of clarifying the language. But it wouldn't effectively change any implementation would it? At the end of the day, the user is still responsible for not making asynchronous changes to the record. This is just providing better language explaining "why".

@pellared
Copy link
Member Author

pellared commented Jun 3, 2024

the user is still responsible for not making asynchronous changes to the record.

The user can make a synchronous change and another processor makes an asynchronous read. This leads to a race condition.

But it wouldn't effectively change any implementation would it?

E.g. you are not making a clone of record here: https://github.com/open-telemetry/opentelemetry-java/blob/d9cef81e298f93b2379aa87d96edacf834fd7fcc/sdk/logs/src/main/java/io/opentelemetry/sdk/logs/export/BatchLogRecordProcessor.java#L92 and the next registered processor can make log record modifications that would lead to a race condition.

@MrAlias
Copy link
Contributor

MrAlias commented Jun 3, 2024

Keep in mind that a user defined processor is not constrained by this specification. Saying they should or shouldn't do something is a meaningless statement. We can only make prescriptive statements about OTel processor behaviors.

Having designs the syntactically restrict or promote good user use or experience is the way you influence 3rd-party use cases.

@jack-berg
Copy link
Member

E.g. you are not making a clone of record here:

Nah, we're always going to make that clone in otel java. We have different representations for SDK processors vs. exporters. But it could save a non-stable implementation from needing to do something like that. Although I would argue that an implementation doesn't need to make that distinction today - they can safely at the record to the batch processor queue. Any issues with the record being mutated at the same time as the exporter is serializing it is the fault of the processor which violated the spec.

@pellared
Copy link
Member Author

pellared commented Jun 3, 2024

If we want consistency with tracing (over performance) for log processors then we should consider requiring the SDK's log record record to be concurrent safe.

EDIT: For reference open-telemetry/opentelemetry-go#5478

@pellared
Copy link
Member Author

pellared commented Jun 4, 2024

Leaving some thoughts why I still think that requiring log record mutations to be visible in next registered processors may be a performance+design footgun and why it may be undesired in some languages.

The design juggles several priorities, including performance and conceptual symmetry with prior art in otel (the trace signal in particular).

We should answer what is more important.

Performance or symmetry with tracing?
Symmetry with OTel tracing or language's standard logging library?

A design featuring chaining or decorating processors might pave a clear path for a high performance implementation, it does so at the expense of conceptual symmetry.

In my opinion, there can be application which emits "tons" of logs and performance is more important. Especially that having a similar design to tracing involves overhead that would not be possible to overcome. Some users are already complaining that tracing and metrics involves to many heap allocations (e.g. here) and adding allocations to logs would bring the problem to the next level. Reference: open-telemetry/opentelemetry-go#5478

But there will be advanced users who want to step off that path, and they should have the ability to do things and take responsibility for those things.

More advanced users may want to squeeze as much performance as possible.

From https://opentelemetry.io/community/mission/#we-value-_performance_:

OpenTelemetry users should not have to choose between high-quality telemetry and a performant application. High performance is a requirement for OpenTelemetry, and unexpected interference effects in the host application are unacceptable.

I think that in Go ecosystem having a design which allows zero-heap allocation logging (especially in the hot path) is important. Reference: https://go.dev/blog/slog#performance

A "decorator-based" design would be Go idiomatic as it would follow the design of the standard Go structures logging library: https://pkg.go.dev/log/slog#Handler. Creating pipelines for OTel Go Logs SDK and slog (Go standard library) would be very similar. Notice that it is possible to make a processor which affect others by making a "fan-out" processor.

My rust, c++, and go skills are pretty poor, but there may be creative language-specific ways to keep the door open to zero allocations while also staying within the spirit of the spec.

This is how I believe it can be done in Go: open-telemetry/opentelemetry-go#5470
However, I feel that this design is ugly. It gives the user several options how to propagate changes to other processors (as well how to isolate the changes).

For example, a processor might have a method indicating whether or not it mutates data. If none of the processors mutate data, or if there is only one processor registered, then the SDK could use an alternative implementation of ReadWriteLogRecord which avoids allocations.

In Go, representing a record as interface (to give a possibility to have several implementation) would already decrease the performance (slower calls and more heap allocations; more; I was doing some benchmarks months ago but I hope you will just trust me).

To sum up, I lean towards favoring logging performance and symmetry with structured logging package from the Go standard library over symmetry with OTel tracing design.

@pellared pellared changed the title Log record mutations are visible in next registered processors Proposal: Log record mutations are visible in next registered processors Jun 4, 2024
@pellared pellared changed the title Proposal: Log record mutations are visible in next registered processors Log record mutations are visible in next registered processors Jun 4, 2024
@pellared
Copy link
Member Author

pellared commented Jun 4, 2024

SIG meeting summary

We have a common agreement that for logs the performance is extremely important as we might expect more load than for e.g. tracing.

It looks that we are leaning towards changing the spec so that SDKs can require chaining processors for making mutations. Processors are responsible for calling the next in the chain which allows the records to stay on the stack. This is also how C++, Rust, Go Logs SDKs are currently designed.

@pellared
Copy link
Member Author

pellared commented Jun 4, 2024

changing the spec so that SDKs can require chaining processors for making mutations.

I will be unavailable for about next 3 weeks. I try to make some PR once I am back.

Thanks a lot for your feedback and support.

@tigrannajaryan tigrannajaryan added triage:accepted:ready-with-sponsor Ready to be implemented and has a specification sponsor assigned and removed triage:deciding:tc-inbox Needs attention from the TC in order to move forward labels Jun 5, 2024
carlosalberto pushed a commit that referenced this issue Jun 27, 2024
## Why

Fixes #4010 (read the issue for more details)

The SDK SHOULD offer a way to make independent log processing. I find it
as a missing feature that should be covered by by the specification.

## What

This PR tries to achieve it in the following way:
- SDK SHOULD provide an isolating processor allowing independent log
processing pipelines. Source:
#4010 (comment)

This approach should be solving the issue at least for Java, .NET,
Python, JavaScript, Go. This proposal should not make any existing log
SDKs not compliant with the specification.

I think that the same approach could be also adopted in tracing for span
processors (but I have not yet investigated it in depth).

## Remarks

I created a separate issue for specifying whether (by default) a change
to a record made in a processor is propagated to next registered
processor. See
#4065
@austinlparker austinlparker moved this to Spec - Accepted in 🔭 Main Backlog Jul 16, 2024
@austinlparker austinlparker moved this from Spec - Accepted to Spec - In Progress in 🔭 Main Backlog Jul 16, 2024
@MrAlias MrAlias removed this from Go: Logs (GA) Aug 1, 2024
carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this issue Oct 31, 2024
## Why

Fixes open-telemetry#4010 (read the issue for more details)

The SDK SHOULD offer a way to make independent log processing. I find it
as a missing feature that should be covered by by the specification.

## What

This PR tries to achieve it in the following way:
- SDK SHOULD provide an isolating processor allowing independent log
processing pipelines. Source:
open-telemetry#4010 (comment)

This approach should be solving the issue at least for Java, .NET,
Python, JavaScript, Go. This proposal should not make any existing log
SDKs not compliant with the specification.

I think that the same approach could be also adopted in tracing for span
processors (but I have not yet investigated it in depth).

## Remarks

I created a separate issue for specifying whether (by default) a change
to a record made in a processor is propagated to next registered
processor. See
open-telemetry#4065
carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this issue Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:logs Related to the specification/logs directory triage:accepted:ready-with-sponsor Ready to be implemented and has a specification sponsor assigned
Projects
Status: Spec - Closed
Development

Successfully merging a pull request may close this issue.

5 participants