Skip to content
This repository has been archived by the owner on May 25, 2022. It is now read-only.

Default encoding should be utf8. #133

Closed
djaglowski opened this issue May 11, 2021 · 3 comments · Fixed by #147
Closed

Default encoding should be utf8. #133

djaglowski opened this issue May 11, 2021 · 3 comments · Fixed by #147
Assignees

Comments

@djaglowski
Copy link
Member

What happens currently when encoding=nop and if the file contains a byte sequence that is not a valid UTF8? Do we just put the bytes as is in the Body's StringValue? If so I think it is not a correct approach since Protobuf requires strings to be UTF8.

@tigrannajaryan, it appears that you are exactly correct on this, and that the current behavior of encoding=nop is invalid. From source:

// Nop is the nop encoding. Its transformed bytes are the same as the source
// bytes; it does not replace invalid UTF-8 sequences.

Unsurprisingly, encoding=utf8 replaces invalid bytes with the Unicode Replacement Character (\uFFFD). Since this is the minimal possible alteration of the byte sequence that makes it a valid string, I agree this should be the default.

Originally posted by @djaglowski in open-telemetry/opentelemetry-collector-contrib#3267 (comment)

@tigrannajaryan
Copy link
Member

@djaglowski do you expect this to have any measurable performance implications (slowdown) compared to the current state? It would be good to benchmark if we expect any change.

@djaglowski
Copy link
Member Author

@tigrannajaryan, it's possible we could see a performance impact.

As a close approximation, I've opened a PR against the original stanza repo, which triggers end-to-end performance benchmarks. This runs a fleet of VMs, each with a different throughput scenario, but all using file_input with the default encoding setting.

The resulting performance diff (vs master) is reported here. The results appear to indicate a negligible change in performance.

We should also be able to add code level Benchmarks to this repo. @Mrod1598, let's discuss adding benchmarks when you are ready to work on this issue.

@djaglowski djaglowski self-assigned this May 17, 2021
@tigrannajaryan
Copy link
Member

The resulting performance diff (vs master) is reported here. The results appear to indicate a negligible change in performance.

This is good to know, thanks for testing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants