Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes / Byte (uint8_t) arrays support #780

Closed
maxgolov opened this issue Jul 15, 2020 · 11 comments
Closed

bytes / Byte (uint8_t) arrays support #780

maxgolov opened this issue Jul 15, 2020 · 11 comments
Labels
area:api Cross language API specification issue release:after-ga Not required before GA release, and not going to work on before GA spec:protocol Related to the specification/protocol directory spec:resource Related to the specification/resource directory spec:trace Related to the specification/trace directory triage:rejected:declined

Comments

@maxgolov
Copy link
Contributor

Current spec / protocol allows the following types:

bool
int32_t
int64_t
uint32_t
uint64_t
double
string (UTF-8, no UTF-16)

and

array of above primitive types.

Proposal is to add support for '8-bit byte array' event properties for Traces and Logs:

Practical reasons

  • existing / legacy telemetry flows could already allow support for byte arrays.

  • 8-bit byte array or bin type is available in MsgPack protocol.

  • same type known in protobuf as bytes

  • Most IoT protocols usually operate on 8-bit byte arrays or octet streams, allowing fields be either uint8_t or array / sequence of uint8_t, or the entire payload be 8-bit / octet stream.

  • UUID or GUID type is best represented on SDK API surface and on wire as 16-bytes rather than 36+-character string, resulting in at least x2 times better performance for native (C/C++) code, as well as more compact net-bytes on wire. Although it may be possible to represent it as array of int32_t - it'd be a bit unnatural to do it.

  • although it may be possible to convert byte array to base64-encoded string representation, this results in an extra encoding/decoding in SDK, elevated memory and CPU pressure, as well as type information (such as it's a binary blob, not a string) - is inherently lost.

Is it possible to add an optional provision to the protocol, that Platform SDKs that may require bin or octet stream or 8-bit byte array - may implement it on API surface?

Use-cases

Examples of exporters that would benefit from having byte array in the protocol spec:

  • ETW - Event Tracing for Windows supports both GUID (16-bytes) and other arbitrary Byte arrays.
  • Other telemetry systems previously using byte arrays flow for more compact binary data representation.

Not having this type natively supported by OpenTelemetry would impede the adoption of OT API / SDK by customers currently relying on octet / binary streams in their data flows. A reference implementation can be provided based on Open Telemetry C++ SDK to illustrate the use-cases, as well as perf implications of not having native support for byte arrays in API/SDK/proto/specification.

@maxgolov maxgolov changed the title Byte (uint8_t) arrays support bytes / Byte (uint8_t) arrays support Jul 16, 2020
@Oberon00
Copy link
Member

I also think that byte array support could be useful. One can use Base64 or hex encoded strings instead, but it does come with a performance cost.

@bogdandrutu bogdandrutu transferred this issue from open-telemetry/opentelemetry-proto Aug 11, 2020
@bogdandrutu bogdandrutu added area:api Cross language API specification issue spec:protocol Related to the specification/protocol directory spec:resource Related to the specification/resource directory spec:trace Related to the specification/trace directory release:after-ga Not required before GA release, and not going to work on before GA labels Aug 11, 2020
@maxgolov
Copy link
Contributor Author

maxgolov commented Nov 19, 2020

I would like to have conditional provisions for this type implemented in OpenTelemetry C++ SDK. Disabled by default. But still there. Have ability to experiment with it and enable it later on.

Practical reason: if we allow bytes support on API surface, then we can later on determine whether concrete exporter supports bytes or not. In those scenarios where exporter cannot handle byte arrays, the exporter itself may convert the byte array to base64-encoded string..

But if exporter supports byte arrays, e.g.

Then it can forward the binary blob using the most compact representation of it. This may become relevant for scenarios where app or service needs to emit a small memory dump, IoT memory dump, or maybe its own representation of encrypted or serialized data structure.

Example code how the value attribute could be defined in C++, where the need for binary data type is common:

using AttributeValue =
    nostd::variant<bool,
                   int32_t,
                   int64_t,
                   uint32_t,
                   uint64_t,
                   double,
                   nostd::string_view,
#ifdef HAVE_SPAN_BYTE
                   // TODO: 8-bit byte arrays / binary blobs are not part of OT spec yet!
                   // Ref: https://github.com/open-telemetry/opentelemetry-specification/issues/780
                   nostd::span<const uint8_t>,
#endif
                   nostd::span<const bool>,
                   nostd::span<const int32_t>,
                   nostd::span<const int64_t>,
                   nostd::span<const uint32_t>,
                   nostd::span<const uint64_t>,
                   nostd::span<const double>,
                   nostd::span<const nostd::string_view>>;

It is ironic that we have string_view today - which carries string type semantics. It is structured about the same, but carries string. And we have arrays of 32-bit and above... Okay. But no 8-bit byte arrays? :-/ Even Javascript technically has 8-bit byte arrays, as in Uint8Array.

@maxgolov
Copy link
Contributor Author

maxgolov commented Jan 21, 2021

@Oberon00 @bogdandrutu @jsuereth @pyohannes

One concern here - since we got this tagged after-ga : adding new type to C++ API surface is going to be ABI-breaking change. This will be significant hassle for us, in terms of backward- / forward- compat.

What I am proposing right now:

  • allow C++ SDK implement the incoming array of uint8_t bytes on API surface
  • for the current implementation of OTLP exporter - put bytes into array of ints (already supported)
  • for exporters that may care about byte arrays - allow them sending bytes as bytes, in the most compact binary representation. That'd be the case for MsgPack, fluentd-forward protocol, ETW.

What do you guys think?

@tigrannajaryan
Copy link
Member

I agree this can be useful for attributes and also for the Log Record Body.

A prior art exists too: binary is a supported data type in Jaeger: https://github.com/jaegertracing/jaeger-idl/blob/34396033ff11c60fced342ab2858ace278fedaa8/proto/api_v2/model.proto#L57 for Span, Log and Process.

This has come up a few times in the past and recently here.

@tigrannajaryan
Copy link
Member

Byte arrays (binary data) is also useful for log files where we don't want to interpret / don't know the encoding. See this feature request: open-telemetry/opentelemetry-collector-contrib#3267

tigrannajaryan added a commit to tigrannajaryan/opentelemetry-proto that referenced this issue Apr 28, 2021
Contributes to:
open-telemetry/opentelemetry-specification#780
open-telemetry/opentelemetry-collector-contrib#3267

Issue open-telemetry/opentelemetry-specification#780
nicely describes why the data type is needed. There are several use cases
for binary data, both for trace and log attributes and for log record Body.

This is a backward compatible addition. After this change is merged no senders
will initially exist that emit binary data. Nevertheless, if such data is received
by the Collector it will correctly pass such data intact through the pipeline
when receiving/sending OTLP (no Collector code changes are needed for this).

We do not yet have binary data type in the OpenTelemetry API, so no existing
sources can emit it yet.

The receivers that do not understand the binary data type should also continue
functioning normally. Collector's current implementation treats any unknown
data type as NULL (and this would apply to binary data type until we teach
the Collector to understand binary specifically). I checked the Collector source
code and this should not result in crashes or overly surprising behavior
(NULL is either ignored or treated as an "unknown" branch in the code which
does not care about it).

We will add full support for binary data to the Collector, particularly to
support translating it correctly to other formats (e.g. Jaeger, which supports
binary type natively).

Note: the addition of this data type to the protocol is not an obligation to
expose the data type in the Attributes API. OTLP has always been a superset of
what is possible to express via the API. The addition of the data type in the
Attributes API should be discussed separately in the specification repo.
tigrannajaryan added a commit to open-telemetry/opentelemetry-proto that referenced this issue May 5, 2021
Contributes to:
open-telemetry/opentelemetry-specification#780
open-telemetry/opentelemetry-collector-contrib#3267

Issue open-telemetry/opentelemetry-specification#780
nicely describes why the data type is needed. There are several use cases
for binary data, both for trace and log attributes and for log record Body.

This is a backward compatible addition. After this change is merged no senders
will initially exist that emit binary data. Nevertheless, if such data is received
by the Collector it will correctly pass such data intact through the pipeline
when receiving/sending OTLP (no Collector code changes are needed for this).

We do not yet have binary data type in the OpenTelemetry API, so no existing
sources can emit it yet.

The receivers that do not understand the binary data type should also continue
functioning normally. Collector's current implementation treats any unknown
data type as NULL (and this would apply to binary data type until we teach
the Collector to understand binary specifically). I checked the Collector source
code and this should not result in crashes or overly surprising behavior
(NULL is either ignored or treated as an "unknown" branch in the code which
does not care about it).

We will add full support for binary data to the Collector, particularly to
support translating it correctly to other formats (e.g. Jaeger, which supports
binary type natively).

Note: the addition of this data type to the protocol is not an obligation to
expose the data type in the Attributes API. OTLP has always been a superset of
what is possible to express via the API. The addition of the data type in the
Attributes API should be discussed separately in the specification repo.
@maxgolov
Copy link
Contributor Author

maxgolov commented May 6, 2021

@tigrannajaryan - thank you for your work! We already have a rough-in implementation in C++ SDK.
@open-telemetry/cpp-approvers

This feature also potentially enables us to reuse some of the existing APIs, or at least existing typing system for a feature like small (reasonably) -sized blob upload, expressing these as byte buffers.

tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue May 7, 2021
OTLP now supports bytes as a data type for Any value.
Byte arrays are an important case for log data.

Contributes to open-telemetry#780

TODO: consider adding bytes to Trace API too.
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue May 7, 2021
Contributes to open-telemetry#780

OTLP already supports bytes as a data type for Any value, see:
https://github.com/open-telemetry/opentelemetry-proto/blob/de4fc37940d39370194fb774e634ca408dacd865/opentelemetry/proto/common/v1/common.proto#L37

Byte arrays are an important case for unparsed, unstrusctured log data, so we are
formally adding them as a supported data type to the Log Data Model.

TODO: consider adding bytes to Trace API too.
tigrannajaryan added a commit that referenced this issue May 11, 2021
Contributes to #780

OTLP already supports bytes as a data type for Any value, see:
https://github.com/open-telemetry/opentelemetry-proto/blob/de4fc37940d39370194fb774e634ca408dacd865/opentelemetry/proto/common/v1/common.proto#L37

Byte arrays are an important case for unparsed, unstrusctured log data, so we are
formally adding them as a supported data type to the Log Data Model.

TODO: consider adding bytes to Trace API too.
@maxgolov
Copy link
Contributor Author

@tigrannajaryan - My understanding is we should be able to close this issue, since the protocol work for that has been complete?

@cwildman
Copy link

@tigrannajaryan can we add the bytes type to the list of primitive types supported in the spec here?

@tigrannajaryan
Copy link
Member

@tigrannajaryan can we add the bytes type to the list of primitive types supported in the spec here?

bytes data type was added to logs data model only. The attributes in the spec apply to all signals.

We have 2 possibilities:

  1. Add an exception to that common attributes page to tell bytes are supported for logs but not for traces and metrics. This is probably the easiest short-term.
  2. Decide that bytes are also valid type for traces/metrics. This may be the right long-term approach.

@bogdandrutu
Copy link
Member

Would be also useful to determine how to convert this to "string", should we use base64?

@austinlparker
Copy link
Member

Per https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/README.md#standard-attribute, we are not planning on extending the standard attribute types.

Our suggestion would be to use the forthcoming Events API/Data Model as a way to emit this information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:api Cross language API specification issue release:after-ga Not required before GA release, and not going to work on before GA spec:protocol Related to the specification/protocol directory spec:resource Related to the specification/resource directory spec:trace Related to the specification/trace directory triage:rejected:declined
Projects
None yet
Development

No branches or pull requests

6 participants