Support byte stream log collection #3267

tigrannajaryan · 2021-04-28T13:52:46Z

Logs are today assumed to be UTF text, which is split into records using some delimiter pattern. In some cases it is useful to treat the logs files simply as byte streams with unspecified encoding. It would be useful to have this capability added.

To enable it we need the following:

Support binary data type (byte slice). It needs to be added to OTLP as one of the data types for AnyValue.
For filelog when encoding=nop populate the Body with binary data type instead of string data type (which according to OTLP's Protobuf requirements is limited to valid UTF-8 strings.
Allow disabling splitting of log entires (via multiline setting?). Split into chunks by honouring max_log_size setting to limit the size of each individual record.
Optionally: store the starting position of the chunk in the stream so that the stream can be reconstructed correctly if log records are delivered to the final destination out of order.

The text was updated successfully, but these errors were encountered:

tigrannajaryan · 2021-04-28T13:53:31Z

@djaglowski what do you think, are requirements 2 and 3 possible to implement?

Contributes to: open-telemetry/opentelemetry-specification#780 open-telemetry/opentelemetry-collector-contrib#3267 Issue open-telemetry/opentelemetry-specification#780 nicely describes why the data type is needed. There are several use cases for binary data, both for trace and log attributes and for log record Body. This is a backward compatible addition. After this change is merged no senders will initially exist that emit binary data. Nevertheless, if such data is received by the Collector it will correctly pass such data intact through the pipeline when receiving/sending OTLP (no Collector code changes are needed for this). We do not yet have binary data type in the OpenTelemetry API, so no existing sources can emit it yet. The receivers that do not understand the binary data type should also continue functioning normally. Collector's current implementation treats any unknown data type as NULL (and this would apply to binary data type until we teach the Collector to understand binary specifically). I checked the Collector source code and this should not result in crashes or overly surprising behavior (NULL is either ignored or treated as an "unknown" branch in the code which does not care about it). We will add full support for binary data to the Collector, particularly to support translating it correctly to other formats (e.g. Jaeger, which supports binary type natively). Note: the addition of this data type to the protocol is not an obligation to expose the data type in the Attributes API. OTLP has always been a superset of what is possible to express via the API. The addition of the data type in the Attributes API should be discussed separately in the specification repo.

djaglowski · 2021-05-03T17:40:17Z

@tigrannajaryan This should be possible.

Support binary data type (byte slice). It needs to be added to OTLP as one of the data types for AnyValue.

👍 entry.Entry.Bodyis of type interface{}, so will be ready once OTLP is updated.

For filelog when encoding=nop populate the Body with binary data type instead of string data type (which according to OTLP's Protobuf requirements is limited to valid UTF-8 strings.

encoding=nop is currently the default, so we need to be clear on whether this will be an additive change, a change to default behavior, or even a removal of current behavior. I think the current default behavior (including newline splitting) is the best for new users because it yields easily understandable results with minimum configuration. That said, it all depends on how they are likely to view the logs downstream, and whether those logs are rendered in a human readable fashion.

My initial thought here is that we should leave encoding=nop alone, and add a new setting (i.e. encoding=raw or encoding=bytes) for this use case.

Allow disabling splitting of log entires (via multiline setting?). Split into chunks by honouring max_log_size setting to limit the size of each individual record.

I like the max_log_size idea. There is at least one additional consideration here, which is what the expected behavior would be if the final bytes of a file do not reach max_log_size. As a starting point, I think we should just consume any available bytes. However, this may be a little too greedy in the common case where the log is still active.

In any case, we'll need to implement a new bufio.SplitFunc, but this should not be too difficult. Several custom implementations are already present in line_splitter.go.

Optionally: store the starting position of the chunk in the stream so that the stream can be reconstructed correctly if log records are delivered to the final destination out of order.

Makes sense. Should be easy enough - probably just add an offset attribute.

tigrannajaryan · 2021-05-03T17:57:35Z

@djaglowski Great, thanks for validating the idea.

encoding=nop is currently the default, so we need to be clear on whether this will be an additive change, a change to default behavior, or even a removal of current behavior.

What happens currently when encoding=nop and if the file contains a byte sequence that is not a valid UTF8? Do we just put the bytes as is in the Body's StringValue? If so I think it is not a correct approach since Protobuf requires strings to be UTF8. So, we may need to rethink this anyway. Perhaps make UTF8 the default? What does encoding=utf-8 do? Does it attempt to validate or anything?

Contributes to: open-telemetry/opentelemetry-specification#780 open-telemetry/opentelemetry-collector-contrib#3267 Issue open-telemetry/opentelemetry-specification#780 nicely describes why the data type is needed. There are several use cases for binary data, both for trace and log attributes and for log record Body. This is a backward compatible addition. After this change is merged no senders will initially exist that emit binary data. Nevertheless, if such data is received by the Collector it will correctly pass such data intact through the pipeline when receiving/sending OTLP (no Collector code changes are needed for this). We do not yet have binary data type in the OpenTelemetry API, so no existing sources can emit it yet. The receivers that do not understand the binary data type should also continue functioning normally. Collector's current implementation treats any unknown data type as NULL (and this would apply to binary data type until we teach the Collector to understand binary specifically). I checked the Collector source code and this should not result in crashes or overly surprising behavior (NULL is either ignored or treated as an "unknown" branch in the code which does not care about it). We will add full support for binary data to the Collector, particularly to support translating it correctly to other formats (e.g. Jaeger, which supports binary type natively). Note: the addition of this data type to the protocol is not an obligation to expose the data type in the Attributes API. OTLP has always been a superset of what is possible to express via the API. The addition of the data type in the Attributes API should be discussed separately in the specification repo.

maxgolov · 2021-05-06T23:53:31Z

I like the proposal of encoding=raw or encoding=bytes . Would there be a practical limit on how big the data should be, i.e. max_log_size - does it impose an upper boundary? Can this boundary be at least 64KB and preferably within megabytes range? For bigger-sized blobs the clients should probably use some form of alternate blob storage / blob collector...

tigrannajaryan · 2021-05-11T16:09:16Z

@maxgolov the current default for max_log_size is 1MiB for filelog: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver

The user can set a different value. For byte stream collection this essentially will define the maximum size of each chunk. The aim should be that it is not too small so that there is significant overhead per chunk and not too big so that it consume lots of memory. 64KiB sounds reasonable, 1MiB may be too much and consume too much memory if there is a large number of incoming streams handled simultaneously.

djaglowski · 2021-05-11T20:27:52Z

What happens currently when encoding=nop and if the file contains a byte sequence that is not a valid UTF8? Do we just put the bytes as is in the Body's StringValue? If so I think it is not a correct approach since Protobuf requires strings to be UTF8.

@tigrannajaryan, it appears that you are exactly correct on this, and that the current behavior of encoding=nop is invalid. From source:

// Nop is the nop encoding. Its transformed bytes are the same as the source
// bytes; it does not replace invalid UTF-8 sequences.

Unsurprisingly, encoding=utf8 replaces invalid bytes with the Unicode Replacement Character (\uFFFD). Since this is the minimal possible alteration of the byte sequence that makes it a valid string, I agree this should be the default. (In fact, I would like to change this as soon as I can run some validation experiments.)

With that, I think your original proposal details are just about spot on. A couple more points of clarification:

Allow disabling splitting of log entires (via multiline setting?)

I believe the multiline setting should just be ignored altogether when encoding=[nop|raw|bytes]. We can document this expectation, but I think it's intuitive as long as the user understands the purpose is to read raw bytes.

the current default for max_log_size is 1MiB ... The aim should be that it is not too small so that there is significant overhead per chunk and not too big so that it consume lots of memory. 64KiB sounds reasonable, 1MiB may be too much and consume too much memory if there is a large number of incoming streams handled simultaneously.

If the nop / !nop use cases are different enough to warrant it, we can have the default max_log_size depend on the encoding:

!nop => 1MiB
nop => 64KiB

djaglowski · 2021-09-16T21:29:15Z

This functionality has now been added to the log-collection library, and will be included in the next release.

tigrannajaryan added this to the Logs Support, Phase 2 milestone Apr 28, 2021

tigrannajaryan mentioned this issue Apr 28, 2021

bytes / Byte (uint8_t) arrays support open-telemetry/opentelemetry-specification#780

Closed

tigrannajaryan mentioned this issue Apr 28, 2021

Add bytes (binary) as a data type to AnyValue open-telemetry/opentelemetry-proto#297

Merged

djaglowski closed this as completed Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support byte stream log collection #3267

Support byte stream log collection #3267

tigrannajaryan commented Apr 28, 2021 •

edited

Loading

tigrannajaryan commented Apr 28, 2021

djaglowski commented May 3, 2021

tigrannajaryan commented May 3, 2021

maxgolov commented May 6, 2021

tigrannajaryan commented May 11, 2021

djaglowski commented May 11, 2021

djaglowski commented Sep 16, 2021

Support byte stream log collection #3267

Support byte stream log collection #3267

Comments

tigrannajaryan commented Apr 28, 2021 • edited Loading

tigrannajaryan commented Apr 28, 2021

djaglowski commented May 3, 2021

tigrannajaryan commented May 3, 2021

maxgolov commented May 6, 2021

tigrannajaryan commented May 11, 2021

djaglowski commented May 11, 2021

djaglowski commented Sep 16, 2021

tigrannajaryan commented Apr 28, 2021 •

edited

Loading