[processor/probabilisticsampler] fix panic when sampling on non-Bytes log record attribute #18223

e-dard · 2023-02-01T10:53:33Z

This PR fixes a panic from the processor, which occurs when the processor is configured to sample on any log attribute that does not have a Bytes attribute type.

Now, the processor will accept String attributes as well as Byte attributes. If a sampled attribute has a different type then rather than panicking the processor will skip the log record and log a warning.

Description:
Prevented panic by checking type of attribute before making a sampling decision.

Link to tracking Issue: #18222

Testing:
Added a test that previously would panic. Added a test that shows String attributes will be sampled.

Documentation:
n/a

This commit fixes a panic from the processor, which occurs when the processor is configured to sample on any log attribute that does not have a `Bytes` attribute type. Now, the processor will accept `String` attributes as well as `Byte` attributes. If a sampled attribute has a different type then rather than panicking the processor will skip the log record and log a warning.

linux-foundation-easycla · 2023-02-01T10:53:41Z

The committers listed above are authorized under a signed CLA.

✅ login: e-dard / name: Edd Robinson (95b6348, ef9b02f, a6d440d, b94921e, 5634904, 9374049)
✅ login: atoulme / name: Antoine Toulme (a6d440d, 1b2e1e5)

processor/probabilisticsamplerprocessor/logsprocessor.go

e-dard · 2023-02-01T10:54:48Z

processor/probabilisticsamplerprocessor/logsprocessor.go

-						lidBytes = value.Bytes().AsRaw()
+
+						switch value.Type() {
+						case pcommon.ValueTypeStr:


I could do with some input here.. I'm not sure if we should be handling string types without understanding the encoding. See PR discussion for more context.

I would say that this doesn't matter much, as a hash will be calculated based on the value. If it's string or int, it doesn't matter, as long as they are all converted to bytes in the end. Additionally, it doesn't matter if we are using the right encoding for the string, as long as all values are byte encoded in the same way. If we were to use this information for other than just calculating a hash, it would certainly be important to properly handle the encoding.

e-dard · 2023-02-01T10:56:28Z

processor/probabilisticsamplerprocessor/logsprocessor.go

+						case pcommon.ValueTypeBytes:
+							lidBytes = value.Bytes().AsRaw()
+						default:
+							lsp.logger.Warn("incompatible log record attribute, only String or Bytes supported; skipping log record",


In this case the user has configured a log record attribute that is not a supported type. Unfortunately, the RemoveIf functions used by this processor don't return errors, so I figure the only sane thing here without a bigger refactor is to log something. Is WARN the appropriate severity for a misconfigured processor?

I guess the only value that should not be supported is boolean. Otherwise, the probability provided by the user will be skewed by the fact that booleans can only have two values: there won't be a good distribution of values to make a proper probability.

I believe there is a method named, AsString which works regardless the type since it just calls fmt.Sprint ?

Would that work here instead?

processor/probabilisticsamplerprocessor/logsprocessor_test.go

e-dard · 2023-02-01T11:06:24Z

Hey folks, I came across this one whilst playing with the probabilistic sampler as part of some benchmarking I was doing.

Initially I figured that it would make as much sense to sample on a String attribute as it would a Bytes attribute, so I added support for that in the code-change.

However.. Now I have hunted around the original issues and PRs for this sampler it looks like the rationale behind being able to sample on log record attributes is to ensure that you sample all the log records associated with any spans that have been sampled. The underlying idea being (I assume) to stuff the same trace IDs into your log signals and your spans.

I'm now wondering if I should just change this PR to not accept String attributes and to lump them in with the default case of the switch "we can't sample based on this type".

If you have a trace ID ([16]byte) on some spans that get sampled, and you want to use this log processor to ensure related log records are also sampled, then if the attribute type is not Bytes then without knowing how it's been encoded into an alternative type it seems hard to ensure it will also have the same sampling decision.

Let me know if I should just remove the String case and only allow Bytes attributes to be sampled.

If that's the case, then as a end user that has decided to put the trace IDs on an attribute (rather than on the log record TraceID)

github-actions · 2023-02-21T05:17:32Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

runforesight · 2023-03-02T13:45:25Z

Foresight Summary

Major Impacts

build-and-test duration(26 minutes 19 seconds) has decreased 37 minutes 19 seconds compared to main branch avg(1 hour 3 minutes 38 seconds).

View More Details

⭕ build-and-test-windows workflow has finished in 9 seconds (41 minutes 22 seconds less than `main` branch avg.) and finished at 15th Mar, 2023.

Job	Failed Steps	Tests
windows-unittest-matrix	- 🔗	N/A	See Details
windows-unittest	- 🔗	N/A	See Details

✅ check-links workflow has finished in 47 seconds (40 seconds less than `main` branch avg.) and finished at 15th Mar, 2023.

Job	Failed Steps	Tests
changed files	- 🔗	N/A	See Details
check-links	- 🔗	N/A	See Details

✅ changelog workflow has finished in 2 minutes 25 seconds and finished at 15th Mar, 2023.

Job	Failed Steps	Tests
changelog	- 🔗	N/A	See Details

✅ telemetrygen workflow has finished in 1 minute 3 seconds (1 minute 2 seconds less than `main` branch avg.) and finished at 15th Mar, 2023.

Job	Failed Steps	Tests
build-dev	- 🔗	N/A	See Details
publish-latest	- 🔗	N/A	See Details
publish-stable	- 🔗	N/A	See Details

❌ build-and-test workflow has finished in 26 minutes 19 seconds (37 minutes 19 seconds less than `main` branch avg.) and finished at 15th Mar, 2023. 3 jobs failed. There are 2 test failures.

Job	Failed Steps	Tests
unittest-matrix (1.19, connector)	- 🔗	✅ 113 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.20, connector)	- 🔗	✅ 113 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, processor)	Run Unit Tests 🔗	✅ 822 ❌ 1 ⏭ 0 🔗	See Details
unittest-matrix (1.20, processor)	Run Unit Tests 🔗	✅ 822 ❌ 1 ⏭ 0 🔗	See Details
unittest-matrix (1.20, extension)	- 🔗	✅ 467 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.20, receiver-0)	- 🔗	✅ 564 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, receiver-0)	- 🔗	✅ 454 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, other)	- 🔗	✅ 0 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, internal)	- 🔗	✅ 551 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.20, internal)	- 🔗	✅ 316 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.20, exporter)	- 🔗	✅ 615 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.20, other)	- 🔗	✅ 0 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, exporter)	- 🔗	✅ 615 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, extension)	- 🔗	✅ 467 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, receiver-1)	- 🔗	✅ 301 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.20, receiver-1)	- 🔗	✅ 301 ❌ 0 ⏭ 0 🔗	See Details
correctness-traces	- 🔗	✅ 17 ❌ 0 ⏭ 0 🔗	See Details
correctness-metrics	- 🔗	✅ 2 ❌ 0 ⏭ 0 🔗	See Details
integration-tests	- 🔗	✅ 55 ❌ 0 ⏭ 0 🔗	See Details
setup-environment	- 🔗	N/A	See Details
checks	- 🔗	N/A	See Details
check-codeowners	- 🔗	N/A	See Details
build-examples	- 🔗	N/A	See Details
check-collector-module-version	- 🔗	N/A	See Details
lint-matrix (receiver-0)	- 🔗	N/A	See Details
lint-matrix (receiver-1)	- 🔗	N/A	See Details
lint-matrix (processor)	- 🔗	N/A	See Details
lint-matrix (exporter)	- 🔗	N/A	See Details
lint-matrix (extension)	- 🔗	N/A	See Details
lint-matrix (connector)	- 🔗	N/A	See Details
lint-matrix (internal)	- 🔗	N/A	See Details
lint-matrix (other)	- 🔗	N/A	See Details
unittest (1.20)	Interpret result 🔗	N/A	See Details
unittest (1.19)	- 🔗	N/A	See Details
lint	- 🔗	N/A	See Details
cross-compile	- 🔗	N/A	See Details
build-package	- 🔗	N/A	See Details
windows-msi	- 🔗	N/A	See Details
publish-check	- 🔗	N/A	See Details
publish-stable	- 🔗	N/A	See Details
publish-dev	- 🔗	N/A	See Details

✅ prometheus-compliance-tests workflow has finished in 10 minutes 43 seconds (⚠️ 3 minutes 17 seconds more than `main` branch avg.) and finished at 15th Mar, 2023.

Job	Failed Steps	Tests
prometheus-compliance-tests	- 🔗	✅ 21 ❌ 0 ⏭ 0 🔗	See Details

✅ load-tests workflow has finished in 16 minutes 46 seconds (⚠️ 3 minutes 42 seconds more than `main` branch avg.) and finished at 15th Mar, 2023.

Job	Failed Steps	Tests
loadtest (TestTraceAttributesProcessor)	- 🔗	✅ 3 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestIdleMode)	- 🔗	✅ 1 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestMetric10kDPS\|TestMetricsFromFile)	- 🔗	✅ 6 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestTraceNoBackend10kSPS\|TestTrace1kSPSWithAttrs)	- 🔗	✅ 8 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestTraceBallast1kSPSWithAttrs\|TestTraceBallast1kSPSAddAttrs)	- 🔗	✅ 10 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestMetricResourceProcessor\|TestTrace10kSPS)	- 🔗	✅ 12 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestBallastMemory\|TestLog10kDPS)	- 🔗	✅ 18 ❌ 0 ⏭ 0 🔗	See Details
setup-environment	- 🔗	N/A	See Details

✅ e2e-tests workflow has finished in 13 minutes 50 seconds and finished at 15th Mar, 2023.

Job	Failed Steps	Tests
kubernetes-test (v1.26.0)	- 🔗	N/A	See Details
kubernetes-test (v1.24.7)	- 🔗	N/A	See Details
kubernetes-test (v1.23.13)	- 🔗	N/A	See Details
kubernetes-test (v1.25.3)	- 🔗	N/A	See Details

🔎 See details on Foresight

^{*You can configure Foresight comments in your organization settings page.}

jpkrohling

Sorry for the delay, I was out for a few weeks. I left a few comments, let me know what you think.

jpkrohling · 2023-03-02T13:58:21Z

processor/probabilisticsamplerprocessor/logsprocessor.go

-						lidBytes = value.Bytes().AsRaw()
+
+						switch value.Type() {
+						case pcommon.ValueTypeStr:


I would say that this doesn't matter much, as a hash will be calculated based on the value. If it's string or int, it doesn't matter, as long as they are all converted to bytes in the end. Additionally, it doesn't matter if we are using the right encoding for the string, as long as all values are byte encoded in the same way. If we were to use this information for other than just calculating a hash, it would certainly be important to properly handle the encoding.

jpkrohling · 2023-03-02T13:59:31Z

processor/probabilisticsamplerprocessor/logsprocessor.go

+						case pcommon.ValueTypeBytes:
+							lidBytes = value.Bytes().AsRaw()
+						default:
+							lsp.logger.Warn("incompatible log record attribute, only String or Bytes supported; skipping log record",


I guess the only value that should not be supported is boolean. Otherwise, the probability provided by the user will be skewed by the fact that booleans can only have two values: there won't be a good distribution of values to make a proper probability.

.chloggen/fix_probabilisticsamplerprocessor-panic.yaml

atoulme

LGTM - with a slight reduction of scope, just adding string support for now. We can add additional types in subsequent PRs.

Co-authored-by: Antoine Toulme <[email protected]>

evantorrie · 2023-03-15T15:47:00Z

Re: addition of string support. This is actually something I wanted yesterday for a use-case (sampling logs probabilistically without having any TraceID associated with a log), so thanks for adding it. 🙏

When I dived a little deeper into this, stanza's log parsers and OTTL, it was unclear to me how in fact one could get a pcommon.ValueTypeBytes into a log entry field. There's no casting function in OTTL to cast from a string to a []bytes. It appears the only way to create a ValueTypeBytes field is via a byte literal (i.e. 0xdeadbeef)?

atoulme · 2023-03-18T02:21:01Z

Please attend to the failing test:

    logsprocessor_test.go:215: 
        	Error Trace:	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/processor/probabilisticsamplerprocessor/logsprocessor_test.go:215
        	Error:      	Not equal: 
        	            	expected: 0
        	            	actual  : 1
        	Test:       	TestLogsSamplingInvalidType

jpkrohling · 2023-03-27T18:23:02Z

Please, ping me once the test failures are fixed so that I can review this again.

github-actions · 2023-05-26T22:02:29Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2023-06-10T05:19:41Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

jpkrohling · 2023-07-11T18:08:21Z

@e-dard, are you still interested in this PR? If not, would you mind if someone else continues your work? @daianmartinho, would you be interested in picking this up if @e-dard isn't interested anymore?

atoulme · 2023-07-11T21:46:13Z

We have to get this in. Reopening and will follow up.

daianmartinho · 2023-07-17T18:41:13Z

@jpkrohling,
we are working on architecture definitions of our telemetry platform right now.. probably will hit this bug in near future, as we saw in some tests.
i think we can handle that as soon as we get in the collector configuration.
i will keep you guys updated if anything changes

github-actions · 2023-08-06T05:18:26Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

MovieStoreGuy

I don't see anything that concerns me, just needs to be rebased with main and it should be good to go :)

MovieStoreGuy · 2023-08-09T13:33:36Z

processor/probabilisticsamplerprocessor/logsprocessor.go

+						case pcommon.ValueTypeBytes:
+							lidBytes = value.Bytes().AsRaw()
+						default:
+							lsp.logger.Warn("incompatible log record attribute, only String or Bytes supported; skipping log record",


I believe there is a method named, AsString which works regardless the type since it just calls fmt.Sprint ?

Would that work here instead?

github-actions · 2023-08-24T05:19:05Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2023-09-08T05:19:10Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

jpkrohling · 2023-09-08T09:16:45Z

@daianmartinho, would you be interested in working on @MovieStoreGuy's last comment? I think this is very close to being merged, only needs that clarification.

e-dard · 2023-09-15T13:07:46Z

@jpkrohling @MovieStoreGuy apologies for this falling off my plate for such a long time. I have fixed the test case and rebased.

jpkrohling · 2023-09-20T09:14:27Z

@e-dard, I ended up opening a PR for this last week as well: #26564 . The advantage of that other PR is that it makes all records to be acceptable instead of rejecting non-string/non-byte attributes.

github-actions · 2023-10-05T05:20:25Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2023-10-20T05:19:00Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

e-dard added 2 commits February 1, 2023 10:11

chore: add changelog entry

ef9b02f

e-dard requested a review from a team February 1, 2023 10:53

e-dard requested a review from jpkrohling as a code owner February 1, 2023 10:53

github-actions bot assigned codeboten Feb 1, 2023

github-actions bot added the processor/probabilisticsampler Probabilistic Sampler processor label Feb 1, 2023

e-dard commented Feb 1, 2023

View reviewed changes

processor/probabilisticsamplerprocessor/logsprocessor.go Show resolved Hide resolved

e-dard commented Feb 1, 2023

View reviewed changes

processor/probabilisticsamplerprocessor/logsprocessor_test.go Show resolved Hide resolved

github-actions bot added the Stale label Feb 21, 2023

jpkrohling removed the Stale label Mar 2, 2023

jpkrohling reviewed Mar 2, 2023

View reviewed changes

atoulme reviewed Mar 10, 2023

View reviewed changes

.chloggen/fix_probabilisticsamplerprocessor-panic.yaml Outdated Show resolved Hide resolved

atoulme approved these changes Mar 10, 2023

View reviewed changes

Update .chloggen/fix_probabilisticsamplerprocessor-panic.yaml

a6d440d

Co-authored-by: Antoine Toulme <[email protected]>

github-actions bot added the Stale label May 26, 2023

github-actions bot closed this Jun 10, 2023

atoulme reopened this Jul 11, 2023

github-actions bot removed the Stale label Jul 12, 2023

Merge branch 'main' into er/fix/processor/sample_types

1b2e1e5

github-actions bot added the Stale label Aug 6, 2023

MovieStoreGuy removed the Stale label Aug 9, 2023

MovieStoreGuy approved these changes Aug 9, 2023

View reviewed changes

github-actions bot added the Stale label Aug 24, 2023

github-actions bot closed this Sep 8, 2023

jpkrohling reopened this Sep 8, 2023

github-actions bot removed the Stale label Sep 9, 2023

e-dard added 2 commits September 15, 2023 14:03

fix test case

b94921e

Merge branch 'main' into er/fix/processor/sample_types

5634904

Merge branch 'main' into er/fix/processor/sample_types

9374049

github-actions bot added the Stale label Oct 5, 2023

github-actions bot closed this Oct 20, 2023

[processor/probabilisticsampler] fix panic when sampling on non-Bytes log record attribute #18223

[processor/probabilisticsampler] fix panic when sampling on non-Bytes log record attribute #18223

Conversation

e-dard commented Feb 1, 2023

linux-foundation-easycla bot commented Feb 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-dard commented Feb 1, 2023

github-actions bot commented Feb 21, 2023

runforesight bot commented Mar 2, 2023 • edited Loading

Foresight Summary

build-and-test duration(26 minutes 19 seconds) has decreased 37 minutes 19 seconds compared to main branch avg(1 hour 3 minutes 38 seconds).

⭕ build-and-test-windows workflow has finished in 9 seconds (41 minutes 22 seconds less than main branch avg.) and finished at 15th Mar, 2023.

✅ check-links workflow has finished in 47 seconds (40 seconds less than main branch avg.) and finished at 15th Mar, 2023.

✅ changelog workflow has finished in 2 minutes 25 seconds and finished at 15th Mar, 2023.

✅ telemetrygen workflow has finished in 1 minute 3 seconds (1 minute 2 seconds less than main branch avg.) and finished at 15th Mar, 2023.

❌ build-and-test workflow has finished in 26 minutes 19 seconds (37 minutes 19 seconds less than main branch avg.) and finished at 15th Mar, 2023. 3 jobs failed. There are 2 test failures.

✅ prometheus-compliance-tests workflow has finished in 10 minutes 43 seconds (⚠️ 3 minutes 17 seconds more than main branch avg.) and finished at 15th Mar, 2023.

✅ load-tests workflow has finished in 16 minutes 46 seconds (⚠️ 3 minutes 42 seconds more than main branch avg.) and finished at 15th Mar, 2023.

✅ e2e-tests workflow has finished in 13 minutes 50 seconds and finished at 15th Mar, 2023.

jpkrohling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atoulme left a comment • edited Loading

Choose a reason for hiding this comment

evantorrie commented Mar 15, 2023

atoulme commented Mar 18, 2023

jpkrohling commented Mar 27, 2023

github-actions bot commented May 26, 2023

github-actions bot commented Jun 10, 2023

jpkrohling commented Jul 11, 2023

atoulme commented Jul 11, 2023

daianmartinho commented Jul 17, 2023

github-actions bot commented Aug 6, 2023

MovieStoreGuy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Aug 24, 2023

github-actions bot commented Sep 8, 2023

jpkrohling commented Sep 8, 2023

e-dard commented Sep 15, 2023

jpkrohling commented Sep 20, 2023

github-actions bot commented Oct 5, 2023

github-actions bot commented Oct 20, 2023

linux-foundation-easycla bot commented Feb 1, 2023 •

edited

Loading

runforesight bot commented Mar 2, 2023 •

edited

Loading

`build-and-test` duration(26 minutes 19 seconds) has decreased 37 minutes 19 seconds compared to main branch avg(1 hour 3 minutes 38 seconds).

⭕ build-and-test-windows workflow has finished in 9 seconds (41 minutes 22 seconds less than `main` branch avg.) and finished at 15th Mar, 2023.

✅ check-links workflow has finished in 47 seconds (40 seconds less than `main` branch avg.) and finished at 15th Mar, 2023.

✅ telemetrygen workflow has finished in 1 minute 3 seconds (1 minute 2 seconds less than `main` branch avg.) and finished at 15th Mar, 2023.

❌ build-and-test workflow has finished in 26 minutes 19 seconds (37 minutes 19 seconds less than `main` branch avg.) and finished at 15th Mar, 2023. 3 jobs failed. There are 2 test failures.

✅ prometheus-compliance-tests workflow has finished in 10 minutes 43 seconds (⚠️ 3 minutes 17 seconds more than `main` branch avg.) and finished at 15th Mar, 2023.

✅ load-tests workflow has finished in 16 minutes 46 seconds (⚠️ 3 minutes 42 seconds more than `main` branch avg.) and finished at 15th Mar, 2023.

atoulme left a comment •

edited

Loading