Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved and batched logs translation #2694

Conversation

pmalek-sumo
Copy link
Contributor

Description: Introduce an aggregation layer to internal/stanza that translates entry.Entry into pdata.Logs aggregating logs coming from the same Resource into one entry.

Link to tracking Issue: #2330

Testing: unit tests added

@pmalek-sumo pmalek-sumo force-pushed the issue-2330-improved-logs-translation branch 3 times, most recently from 815cbed to 7f2e977 Compare March 15, 2021 14:51
@pmalek-sumo pmalek-sumo force-pushed the issue-2330-improved-logs-translation branch from 7f2e977 to 1397787 Compare March 15, 2021 16:55
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great - a few comments & questions though

internal/stanza/converter.go Show resolved Hide resolved
internal/stanza/config.go Outdated Show resolved Hide resolved
internal/stanza/converter.go Outdated Show resolved Hide resolved
internal/stanza/converter_test.go Show resolved Hide resolved
receiver/filelogreceiver/filelog_test.go Outdated Show resolved Hide resolved
@tigrannajaryan
Copy link
Member

@pmalek-sumo @djaglowski I haven't looked into this PR details so my comment may be wrong, but if I understand correctly this PR attempts to group logs after they are produced by github.com/open-telemetry/opentelemetry-log-collection

Can we instead change how they are produced? Presumably we collect a batch or log lines at once from each file that we tail. Can we send the batch as it is collected from the file instead of sending one log line at a time and then trying to batch later?

The benefits of the suggested approach would be:

  • No extra delays introduced due to the new batching logic.
  • Hopefully less performance impact.

@djaglowski
Copy link
Member

djaglowski commented Mar 17, 2021

@tigrannajaryan I think your suggestion may be a good idea to explore, but it would be a fairly substantial change.

Presumably we collect a batch or log lines at once from each file that we tail.

This presumption is not accurate. Logs "lines" are processed one at a time, with the exception being for multiline logs which are treated as one log entry.

When it comes down to it, most operations supported by opentelemetry-log-collection must be performed on each log entry individually, and each subsequent operation is basically just a function on a call stack, so batching doesn't provide a lot of additional performance in many cases.

You certainly have a good point in regards to the value that would be provided by keeping related entries together to avoid having to reaggregate them later. We can probably do this, but there are a lot of implications to think through and I don't believe it is a good next step.

@pmalek-sumo's change does appear to have performance benefits in its own right, so I suggest we should try to adopt the changes at this level and remove them later if we can make them redundant.

@pmalek-sumo
Copy link
Contributor Author

Can we send the batch as it is collected from the file instead of sending one log line at a time and then trying to batch later?

That could be an improvement (taking into account @djaglowski comments though) but this PR tries to batch logs per resource which as I understand the concepts used in OTC doesn't translate 1:1 into a tailed file, correct?

Hence the batching done with this approach is a bit different than what you're proposing @tigrannajaryan. Please correct me if I'm wrong.

@tigrannajaryan
Copy link
Member

This presumption is not accurate. Logs "lines" are processed one at a time, with the exception being for multiline logs which are treated as one log entry.

Got it.

Hence the batching done with this approach is a bit different than what you're proposing @tigrannajaryan. Please correct me if I'm wrong.

Correct. I was suggesting that a batch log lines that is read from a file at once is split, parsed and delivered by log-collection library as a slice of entry.LogEntry at once, instead of delivering them one by one.

Such a slice of LogEntrys would still be one pdata.Logs and would be emitted by the receiver using one ConsumeLogs call. However, you are right that these LogEntrys do not necessarily have the same Resource, so that's a different way to group them.

Grouping by the Resource is a different approach and it is more expensive. I would want to know what the performance impact of this approach is. The serialization of the Resource, lookup in the map have costs, I am curious how big it is. Alternatively, we can simply group subsequent LogEntrys which have the same Resource, without reordering and grouping. That's less expensive to do and may yield similar final results when reading at high speed from a small set of files.

@pmalek-sumo
Copy link
Contributor Author

One more thing: what is the benefit of receiving the logs grouped by Resource downstream? Any particular optimizations in terms of performance can be/are done in downstream consumers in such a scenario?

@pmalek-sumo pmalek-sumo force-pushed the issue-2330-improved-logs-translation branch 2 times, most recently from 1386330 to d813232 Compare March 18, 2021 17:49
@pmalek-sumo pmalek-sumo marked this pull request as ready for review March 18, 2021 17:51
@pmalek-sumo pmalek-sumo requested a review from a team March 18, 2021 17:51
@pmalek-sumo pmalek-sumo force-pushed the issue-2330-improved-logs-translation branch 4 times, most recently from 4053455 to e59915f Compare March 19, 2021 09:06
@djaglowski
Copy link
Member

@pmalek-sumo Any idea why CI is hanging on this PR?

Since the main purpose of this PR is to increase performance, can you please post before/after results for related testbed scenarios?

@pmalek-sumo
Copy link
Contributor Author

Not sure what's up with that. The workflow seems to be failing not hanging: https://app.circleci.com/pipelines/github/SumoLogic/opentelemetry-collector-contrib/1582/workflows/376f9b5f-6477-4f6e-bc77-2547379d66f3

The tests that seem to cause issues are windows tests https://app.circleci.com/pipelines/github/SumoLogic/opentelemetry-collector-contrib/1582/workflows/376f9b5f-6477-4f6e-bc77-2547379d66f3/jobs/13227

--- FAIL: TestIntDataPointSliceAt (3.03s)
    --- FAIL: TestIntDataPointSliceAt/w/_2nd_rate_calculation (1.01s)
        datapoint_test.go:310: 
            	Error Trace:	datapoint_test.go:310
            	Error:      	Max difference between 1 and 0.9891196834817013 allowed is 0.01, but difference was 0.010880316518298683
            	Test:       	TestIntDataPointSliceAt/w/_2nd_rate_calculation
logEvents[0].Timestamp=5577006791947779410.
logEvents[1].Timestamp=8674665223082153551.
logEvents[2].Timestamp=6129484611666145821.
logEvents[3].Timestamp=4037200794235010051.
logEvents[4].Timestamp=3916589616287113937.
logEvents[5].Timestamp=6334824724549167320.
logEvents[6].Timestamp=605394647632969758.
logEvents[7].Timestamp=1443635317331776148.
logEvents[8].Timestamp=894385949183117216.
logEvents[9].Timestamp=2775422040480279449.
logEvents[0].Timestamp=605394647632969758, logEvents[1].Timestamp=894385949183117216.
logEvents[1].Timestamp=894385949183117216, logEvents[2].Timestamp=1443635317331776148.
logEvents[2].Timestamp=1443635317331776148, logEvents[3].Timestamp=2775422040480279449.
logEvents[3].Timestamp=2775422040480279449, logEvents[4].Timestamp=3916589616287113937.
logEvents[4].Timestamp=3916589616287113937, logEvents[5].Timestamp=4037200794235010051.
logEvents[5].Timestamp=4037200794235010051, logEvents[6].Timestamp=5577006791947779410.
logEvents[6].Timestamp=5577006791947779410, logEvents[7].Timestamp=6129484611666145821.
logEvents[7].Timestamp=6129484611666145821, logEvents[8].Timestamp=6334824724549167320.
logEvents[8].Timestamp=6334824724549167320, logEvents[9].Timestamp=8674665223082153551.
FAIL
FAIL	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter	16.711s
ok  	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter/handler	0.038s
FAIL
Exited with code exit status 1

Any guidelines on how to perform the tests? Particular config that should be used?

@pmalek-sumo
Copy link
Contributor Author

Ok I guess I've figured it out.


baseline: 3bdf190

# Test PerformanceResults
Started: Fri, 19 Mar 2021 15:08:19 +0100

Test                                    |Result|Duration|CPU Avg%|CPU Max%|RAM Avg MiB|RAM Max MiB|Sent Items|Received Items|
----------------------------------------|------|-------:|-------:|-------:|----------:|----------:|---------:|-------------:|
Log10kDPS/OTLP                          |PASS  |     17s|    18.6|    16.3|         38|         49|    150000|        150000|
Log10kDPS/filelog                       |PASS  |     15s|    22.5|    23.0|         41|         49|    150000|        150000|
Log10kDPS/kubernetes_containers         |PASS  |     15s|    49.1|    50.3|         45|         54|    150000|        150000|
Log10kDPS/k8s_CRI-Containerd            |PASS  |     15s|    44.5|    44.7|         46|         55|    150000|        150000|
Log10kDPS/k8s_CRI-Containerd_no_attr_ops|PASS  |     15s|    39.6|    40.7|         45|         53|    150000|        150000|
Log10kDPS/CRI-Containerd                |PASS  |     15s|    24.6|    25.0|         44|         52|    150000|        150000|

Total duration: 94s

this PR at e59915f

first run:

# Test PerformanceResults
Started: Fri, 19 Mar 2021 14:49:30 +0100

Test                                    |Result|Duration|CPU Avg%|CPU Max%|RAM Avg MiB|RAM Max MiB|Sent Items|Received Items|
----------------------------------------|------|-------:|-------:|-------:|----------:|----------:|---------:|-------------:|
Log10kDPS/OTLP                          |PASS  |     17s|    17.3|    14.7|         39|         49|    149900|        149900|
Log10kDPS/filelog                       |PASS  |     15s|    17.6|    20.3|         40|         49|    150000|        150000|
Log10kDPS/kubernetes_containers         |PASS  |     15s|    46.5|    47.0|         44|         53|    150000|        150000|
Log10kDPS/k8s_CRI-Containerd            |PASS  |     15s|    41.7|    42.6|         44|         54|    150000|        150000|
Log10kDPS/k8s_CRI-Containerd_no_attr_ops|PASS  |     15s|    34.0|    34.3|         44|         54|    150000|        150000|
Log10kDPS/CRI-Containerd                |PASS  |     15s|    20.1|    20.7|         42|         51|    150000|        150000|

Total duration: 94s

second run:

# Test PerformanceResults
Started: Fri, 19 Mar 2021 15:14:46 +0100

Test                                    |Result|Duration|CPU Avg%|CPU Max%|RAM Avg MiB|RAM Max MiB|Sent Items|Received Items|
----------------------------------------|------|-------:|-------:|-------:|----------:|----------:|---------:|-------------:|
Log10kDPS/OTLP                          |PASS  |     17s|    19.5|    18.3|         40|         49|    149900|        149900|
Log10kDPS/filelog                       |PASS  |     15s|    18.6|    19.9|         40|         49|    150000|        150000|
Log10kDPS/kubernetes_containers         |PASS  |     15s|    47.5|    49.1|         44|         54|    150000|        150000|
Log10kDPS/k8s_CRI-Containerd            |PASS  |     15s|    42.2|    43.0|         45|         54|    150000|        150000|
Log10kDPS/k8s_CRI-Containerd_no_attr_ops|PASS  |     15s|    34.8|    36.0|         43|         52|    150000|        150000|
Log10kDPS/CRI-Containerd                |PASS  |     15s|    21.7|    22.6|         42|         51|    150000|        150000|

Total duration: 94s

@djaglowski
Copy link
Member

@pmalek-sumo Thanks for posting testbed results.

I've captured the flaky test issue here. I suppose we need to wait until it's resolved or reverted: #2792

@pmalek-sumo pmalek-sumo force-pushed the issue-2330-improved-logs-translation branch 2 times, most recently from 497c19a to 6e806f3 Compare March 24, 2021 16:04
@pmalek-sumo
Copy link
Contributor Author

@djaglowski I've rebased this PR on top of main as the fix for the flaky test #2792 has already been merged.

Please take a look.

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Not sure why the check isn't coming through, but it is passing:
image

Comment on lines 38 to 45
MaxFlushCount uint `mapstructure:"max_flush_count"`
FlushInterval time.Duration `mapstructure:"flush_interval"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments to explain these settings and/or add description in README.md.

internal/stanza/converter.go Show resolved Hide resolved
internal/stanza/converter.go Outdated Show resolved Hide resolved
c.dataMutex.RUnlock()
if count > 0 {
pLogs := c.convertBuffer()
go c.queueForFlush(pLogs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear why queueForFlush needs to be in a separate goroutine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my attempt to not introduce backpressure for the caller of Batch().

We might take a different approach introducing configurable amount of workers that would consume consume the logs queued for flushing in a round robin fashion but this would introduce backpressure to the caller whenever flushing stalls.

c.stopOnce.Do(func() {
close(c.stopChan)
c.wg.Wait()
close(c.pLogsChan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to close this channel? How are external readers expected to drain any items pending in pLogsChan? You can't drain until you call Stop because more data may be arriving and you can't drain after you call Stop since the channel is now closed. Is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation pLogsChan is unbuffered so there won't be any logs to flush from it.

My reasoning here was as follows:

  • we close the stopChan which signals with internal goroutines to return
  • we wait for them to return (to prevent producing dangling goroutines)
  • we close the pLogsChan to signal the consumers that they can stop consuming (gracefully stopping downstream consumers/pipelines)

If I missed anything or there are reasons to introduce buffering to the channel please let me know.

internal/stanza/converter.go Outdated Show resolved Hide resolved
internal/stanza/converter.go Outdated Show resolved Hide resolved
internal/stanza/converter.go Show resolved Hide resolved
internal/stanza/receiver.go Show resolved Hide resolved
internal/stanza/converter.go Outdated Show resolved Hide resolved
@pmalek-sumo pmalek-sumo force-pushed the issue-2330-improved-logs-translation branch from 6e806f3 to 45cfdf4 Compare March 25, 2021 20:13
@pmalek-sumo
Copy link
Contributor Author

Thanks for your comments @tigrannajaryan !

I've addressed most of them I believe. I can add more comments on the design if needed.

There's one more thing I believe I'll need to address which is the locking around checking if the flush is needed due to reached max count and the conversion and flush itself. With the current implementation because of goroutines scheduling there might be a couple of log entries more than max count set.

@pmalek-sumo pmalek-sumo force-pushed the issue-2330-improved-logs-translation branch from 45cfdf4 to 9e0bb57 Compare March 25, 2021 21:40
@bogdandrutu
Copy link
Member

@pmalek-sumo at one point you may want to move this to your repo :))) Probably when all comments are closed

@pmalek-sumo
Copy link
Contributor Author

pmalek-sumo commented Mar 26, 2021

@pmalek-sumo at one point you may want to move this to your repo :))) Probably when all comments are closed

Will do as soon as all the comments are resolved unless there's suggestion to do it before that.

I've pushed this to my personal fork and created a PR that supersedes this: #2892

CC: @tigrannajaryan

@pmalek-sumo pmalek-sumo deleted the issue-2330-improved-logs-translation branch March 26, 2021 18:11
alexperez52 referenced this pull request in open-o11y/opentelemetry-collector-contrib Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants