Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support byte stream log collection #3267

Closed
tigrannajaryan opened this issue Apr 28, 2021 · 7 comments
Closed

Support byte stream log collection #3267

tigrannajaryan opened this issue Apr 28, 2021 · 7 comments

Comments

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Apr 28, 2021

Logs are today assumed to be UTF text, which is split into records using some delimiter pattern. In some cases it is useful to treat the logs files simply as byte streams with unspecified encoding. It would be useful to have this capability added.

To enable it we need the following:

  1. Support binary data type (byte slice). It needs to be added to OTLP as one of the data types for AnyValue.
  2. For filelog when encoding=nop populate the Body with binary data type instead of string data type (which according to OTLP's Protobuf requirements is limited to valid UTF-8 strings.
  3. Allow disabling splitting of log entires (via multiline setting?). Split into chunks by honouring max_log_size setting to limit the size of each individual record.
  4. Optionally: store the starting position of the chunk in the stream so that the stream can be reconstructed correctly if log records are delivered to the final destination out of order.
@tigrannajaryan tigrannajaryan added this to the Logs Support, Phase 2 milestone Apr 28, 2021
@tigrannajaryan
Copy link
Member Author

@djaglowski what do you think, are requirements 2 and 3 possible to implement?

tigrannajaryan added a commit to tigrannajaryan/opentelemetry-proto that referenced this issue Apr 28, 2021
Contributes to:
open-telemetry/opentelemetry-specification#780
open-telemetry/opentelemetry-collector-contrib#3267

Issue open-telemetry/opentelemetry-specification#780
nicely describes why the data type is needed. There are several use cases
for binary data, both for trace and log attributes and for log record Body.

This is a backward compatible addition. After this change is merged no senders
will initially exist that emit binary data. Nevertheless, if such data is received
by the Collector it will correctly pass such data intact through the pipeline
when receiving/sending OTLP (no Collector code changes are needed for this).

We do not yet have binary data type in the OpenTelemetry API, so no existing
sources can emit it yet.

The receivers that do not understand the binary data type should also continue
functioning normally. Collector's current implementation treats any unknown
data type as NULL (and this would apply to binary data type until we teach
the Collector to understand binary specifically). I checked the Collector source
code and this should not result in crashes or overly surprising behavior
(NULL is either ignored or treated as an "unknown" branch in the code which
does not care about it).

We will add full support for binary data to the Collector, particularly to
support translating it correctly to other formats (e.g. Jaeger, which supports
binary type natively).

Note: the addition of this data type to the protocol is not an obligation to
expose the data type in the Attributes API. OTLP has always been a superset of
what is possible to express via the API. The addition of the data type in the
Attributes API should be discussed separately in the specification repo.
@djaglowski
Copy link
Member

@tigrannajaryan This should be possible.

  1. Support binary data type (byte slice). It needs to be added to OTLP as one of the data types for AnyValue.

👍 entry.Entry.Bodyis of type interface{}, so will be ready once OTLP is updated.

  1. For filelog when encoding=nop populate the Body with binary data type instead of string data type (which according to OTLP's Protobuf requirements is limited to valid UTF-8 strings.

encoding=nop is currently the default, so we need to be clear on whether this will be an additive change, a change to default behavior, or even a removal of current behavior. I think the current default behavior (including newline splitting) is the best for new users because it yields easily understandable results with minimum configuration. That said, it all depends on how they are likely to view the logs downstream, and whether those logs are rendered in a human readable fashion.

My initial thought here is that we should leave encoding=nop alone, and add a new setting (i.e. encoding=raw or encoding=bytes) for this use case.

  1. Allow disabling splitting of log entires (via multiline setting?). Split into chunks by honouring max_log_size setting to limit the size of each individual record.

I like the max_log_size idea. There is at least one additional consideration here, which is what the expected behavior would be if the final bytes of a file do not reach max_log_size. As a starting point, I think we should just consume any available bytes. However, this may be a little too greedy in the common case where the log is still active.

In any case, we'll need to implement a new bufio.SplitFunc, but this should not be too difficult. Several custom implementations are already present in line_splitter.go.

  1. Optionally: store the starting position of the chunk in the stream so that the stream can be reconstructed correctly if log records are delivered to the final destination out of order.

Makes sense. Should be easy enough - probably just add an offset attribute.

@tigrannajaryan
Copy link
Member Author

@djaglowski Great, thanks for validating the idea.

encoding=nop is currently the default, so we need to be clear on whether this will be an additive change, a change to default behavior, or even a removal of current behavior.

What happens currently when encoding=nop and if the file contains a byte sequence that is not a valid UTF8? Do we just put the bytes as is in the Body's StringValue? If so I think it is not a correct approach since Protobuf requires strings to be UTF8. So, we may need to rethink this anyway. Perhaps make UTF8 the default? What does encoding=utf-8 do? Does it attempt to validate or anything?

tigrannajaryan added a commit to open-telemetry/opentelemetry-proto that referenced this issue May 5, 2021
Contributes to:
open-telemetry/opentelemetry-specification#780
open-telemetry/opentelemetry-collector-contrib#3267

Issue open-telemetry/opentelemetry-specification#780
nicely describes why the data type is needed. There are several use cases
for binary data, both for trace and log attributes and for log record Body.

This is a backward compatible addition. After this change is merged no senders
will initially exist that emit binary data. Nevertheless, if such data is received
by the Collector it will correctly pass such data intact through the pipeline
when receiving/sending OTLP (no Collector code changes are needed for this).

We do not yet have binary data type in the OpenTelemetry API, so no existing
sources can emit it yet.

The receivers that do not understand the binary data type should also continue
functioning normally. Collector's current implementation treats any unknown
data type as NULL (and this would apply to binary data type until we teach
the Collector to understand binary specifically). I checked the Collector source
code and this should not result in crashes or overly surprising behavior
(NULL is either ignored or treated as an "unknown" branch in the code which
does not care about it).

We will add full support for binary data to the Collector, particularly to
support translating it correctly to other formats (e.g. Jaeger, which supports
binary type natively).

Note: the addition of this data type to the protocol is not an obligation to
expose the data type in the Attributes API. OTLP has always been a superset of
what is possible to express via the API. The addition of the data type in the
Attributes API should be discussed separately in the specification repo.
@maxgolov
Copy link

maxgolov commented May 6, 2021

I like the proposal of encoding=raw or encoding=bytes . Would there be a practical limit on how big the data should be, i.e. max_log_size - does it impose an upper boundary? Can this boundary be at least 64KB and preferably within megabytes range? For bigger-sized blobs the clients should probably use some form of alternate blob storage / blob collector...

@tigrannajaryan
Copy link
Member Author

@maxgolov the current default for max_log_size is 1MiB for filelog: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver

The user can set a different value. For byte stream collection this essentially will define the maximum size of each chunk. The aim should be that it is not too small so that there is significant overhead per chunk and not too big so that it consume lots of memory. 64KiB sounds reasonable, 1MiB may be too much and consume too much memory if there is a large number of incoming streams handled simultaneously.

@djaglowski
Copy link
Member

What happens currently when encoding=nop and if the file contains a byte sequence that is not a valid UTF8? Do we just put the bytes as is in the Body's StringValue? If so I think it is not a correct approach since Protobuf requires strings to be UTF8.

@tigrannajaryan, it appears that you are exactly correct on this, and that the current behavior of encoding=nop is invalid. From source:

// Nop is the nop encoding. Its transformed bytes are the same as the source
// bytes; it does not replace invalid UTF-8 sequences.

Unsurprisingly, encoding=utf8 replaces invalid bytes with the Unicode Replacement Character (\uFFFD). Since this is the minimal possible alteration of the byte sequence that makes it a valid string, I agree this should be the default. (In fact, I would like to change this as soon as I can run some validation experiments.)

With that, I think your original proposal details are just about spot on. A couple more points of clarification:

Allow disabling splitting of log entires (via multiline setting?)

I believe the multiline setting should just be ignored altogether when encoding=[nop|raw|bytes]. We can document this expectation, but I think it's intuitive as long as the user understands the purpose is to read raw bytes.

the current default for max_log_size is 1MiB ... The aim should be that it is not too small so that there is significant overhead per chunk and not too big so that it consume lots of memory. 64KiB sounds reasonable, 1MiB may be too much and consume too much memory if there is a large number of incoming streams handled simultaneously.

If the nop / !nop use cases are different enough to warrant it, we can have the default max_log_size depend on the encoding:

  • !nop => 1MiB
  • nop => 64KiB

@djaglowski
Copy link
Member

This functionality has now been added to the log-collection library, and will be included in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants