Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/probabilisticsampler] fix panic when sampling on non-Bytes log record attribute #18223

Closed

Conversation

e-dard
Copy link

@e-dard e-dard commented Feb 1, 2023

This PR fixes a panic from the processor, which occurs when the processor is configured to sample on any log attribute that does not have a Bytes attribute type.

Now, the processor will accept String attributes as well as Byte attributes. If a sampled attribute has a different type then rather than panicking the processor will skip the log record and log a warning.

Description:
Prevented panic by checking type of attribute before making a sampling decision.

Link to tracking Issue: #18222

Testing:
Added a test that previously would panic. Added a test that shows String attributes will be sampled.

Documentation:
n/a

This commit fixes a panic from the processor, which occurs when
the processor is configured to sample on any log attribute that
does not have a `Bytes` attribute type.

Now, the processor will accept `String` attributes as well as
`Byte` attributes. If a sampled attribute has a different type
then rather than panicking the processor will skip the log record
and log a warning.
@e-dard e-dard requested a review from a team February 1, 2023 10:53
@e-dard e-dard requested a review from jpkrohling as a code owner February 1, 2023 10:53
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Feb 1, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@github-actions github-actions bot added the processor/probabilisticsampler Probabilistic Sampler processor label Feb 1, 2023
lidBytes = value.Bytes().AsRaw()

switch value.Type() {
case pcommon.ValueTypeStr:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do with some input here.. I'm not sure if we should be handling string types without understanding the encoding. See PR discussion for more context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that this doesn't matter much, as a hash will be calculated based on the value. If it's string or int, it doesn't matter, as long as they are all converted to bytes in the end. Additionally, it doesn't matter if we are using the right encoding for the string, as long as all values are byte encoded in the same way. If we were to use this information for other than just calculating a hash, it would certainly be important to properly handle the encoding.

case pcommon.ValueTypeBytes:
lidBytes = value.Bytes().AsRaw()
default:
lsp.logger.Warn("incompatible log record attribute, only String or Bytes supported; skipping log record",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the user has configured a log record attribute that is not a supported type. Unfortunately, the RemoveIf functions used by this processor don't return errors, so I figure the only sane thing here without a bigger refactor is to log something. Is WARN the appropriate severity for a misconfigured processor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the only value that should not be supported is boolean. Otherwise, the probability provided by the user will be skewed by the fact that booleans can only have two values: there won't be a good distribution of values to make a proper probability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is a method named, AsString which works regardless the type since it just calls fmt.Sprint ?

Would that work here instead?

@e-dard
Copy link
Author

e-dard commented Feb 1, 2023

Hey folks, I came across this one whilst playing with the probabilistic sampler as part of some benchmarking I was doing.

Initially I figured that it would make as much sense to sample on a String attribute as it would a Bytes attribute, so I added support for that in the code-change.

However.. Now I have hunted around the original issues and PRs for this sampler it looks like the rationale behind being able to sample on log record attributes is to ensure that you sample all the log records associated with any spans that have been sampled. The underlying idea being (I assume) to stuff the same trace IDs into your log signals and your spans.

I'm now wondering if I should just change this PR to not accept String attributes and to lump them in with the default case of the switch "we can't sample based on this type".

If you have a trace ID ([16]byte) on some spans that get sampled, and you want to use this log processor to ensure related log records are also sampled, then if the attribute type is not Bytes then without knowing how it's been encoded into an alternative type it seems hard to ensure it will also have the same sampling decision.

Let me know if I should just remove the String case and only allow Bytes attributes to be sampled.

If that's the case, then as a end user that has decided to put the trace IDs on an attribute (rather than on the log record TraceID)

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Feb 21, 2023
@jpkrohling jpkrohling removed the Stale label Mar 2, 2023
@runforesight
Copy link

runforesight bot commented Mar 2, 2023

Foresight Summary

    
Major Impacts

build-and-test duration(26 minutes 19 seconds) has decreased 37 minutes 19 seconds compared to main branch avg(1 hour 3 minutes 38 seconds).
View More Details

⭕  build-and-test-windows workflow has finished in 9 seconds (41 minutes 22 seconds less than main branch avg.) and finished at 15th Mar, 2023.


Job Failed Steps Tests
windows-unittest-matrix -     🔗  N/A See Details
windows-unittest -     🔗  N/A See Details

✅  check-links workflow has finished in 47 seconds (40 seconds less than main branch avg.) and finished at 15th Mar, 2023.


Job Failed Steps Tests
changed files -     🔗  N/A See Details
check-links -     🔗  N/A See Details

✅  changelog workflow has finished in 2 minutes 25 seconds and finished at 15th Mar, 2023.


Job Failed Steps Tests
changelog -     🔗  N/A See Details

✅  telemetrygen workflow has finished in 1 minute 3 seconds (1 minute 2 seconds less than main branch avg.) and finished at 15th Mar, 2023.


Job Failed Steps Tests
build-dev -     🔗  N/A See Details
publish-latest -     🔗  N/A See Details
publish-stable -     🔗  N/A See Details

❌  build-and-test workflow has finished in 26 minutes 19 seconds (37 minutes 19 seconds less than main branch avg.) and finished at 15th Mar, 2023. 3 jobs failed. There are 2 test failures.


Job Failed Steps Tests
unittest-matrix (1.19, connector) -     🔗  ✅ 113  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, connector) -     🔗  ✅ 113  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, processor) Run Unit Tests     🔗  ✅ 822  ❌ 1  ⏭ 0    🔗 See Details
unittest-matrix (1.20, processor) Run Unit Tests     🔗  ✅ 822  ❌ 1  ⏭ 0    🔗 See Details
unittest-matrix (1.20, extension) -     🔗  ✅ 467  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, receiver-0) -     🔗  ✅ 564  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, receiver-0) -     🔗  ✅ 454  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, other) -     🔗  ✅ 0  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, internal) -     🔗  ✅ 551  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, internal) -     🔗  ✅ 316  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, exporter) -     🔗  ✅ 615  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, other) -     🔗  ✅ 0  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, exporter) -     🔗  ✅ 615  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, extension) -     🔗  ✅ 467  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, receiver-1) -     🔗  ✅ 301  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, receiver-1) -     🔗  ✅ 301  ❌ 0  ⏭ 0    🔗 See Details
correctness-traces -     🔗  ✅ 17  ❌ 0  ⏭ 0    🔗 See Details
correctness-metrics -     🔗  ✅ 2  ❌ 0  ⏭ 0    🔗 See Details
integration-tests -     🔗  ✅ 55  ❌ 0  ⏭ 0    🔗 See Details
setup-environment -     🔗  N/A See Details
checks -     🔗  N/A See Details
check-codeowners -     🔗  N/A See Details
build-examples -     🔗  N/A See Details
check-collector-module-version -     🔗  N/A See Details
lint-matrix (receiver-0) -     🔗  N/A See Details
lint-matrix (receiver-1) -     🔗  N/A See Details
lint-matrix (processor) -     🔗  N/A See Details
lint-matrix (exporter) -     🔗  N/A See Details
lint-matrix (extension) -     🔗  N/A See Details
lint-matrix (connector) -     🔗  N/A See Details
lint-matrix (internal) -     🔗  N/A See Details
lint-matrix (other) -     🔗  N/A See Details
unittest (1.20) Interpret result     🔗  N/A See Details
unittest (1.19) -     🔗  N/A See Details
lint -     🔗  N/A See Details
cross-compile -     🔗  N/A See Details
build-package -     🔗  N/A See Details
windows-msi -     🔗  N/A See Details
publish-check -     🔗  N/A See Details
publish-stable -     🔗  N/A See Details
publish-dev -     🔗  N/A See Details

✅  prometheus-compliance-tests workflow has finished in 10 minutes 43 seconds (⚠️ 3 minutes 17 seconds more than main branch avg.) and finished at 15th Mar, 2023.


Job Failed Steps Tests
prometheus-compliance-tests -     🔗  ✅ 21  ❌ 0  ⏭ 0    🔗 See Details

✅  load-tests workflow has finished in 16 minutes 46 seconds (⚠️ 3 minutes 42 seconds more than main branch avg.) and finished at 15th Mar, 2023.


Job Failed Steps Tests
loadtest (TestTraceAttributesProcessor) -     🔗  ✅ 3  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestIdleMode) -     🔗  ✅ 1  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestMetric10kDPS|TestMetricsFromFile) -     🔗  ✅ 6  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestTraceNoBackend10kSPS|TestTrace1kSPSWithAttrs) -     🔗  ✅ 8  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestTraceBallast1kSPSWithAttrs|TestTraceBallast1kSPSAddAttrs) -     🔗  ✅ 10  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestMetricResourceProcessor|TestTrace10kSPS) -     🔗  ✅ 12  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestBallastMemory|TestLog10kDPS) -     🔗  ✅ 18  ❌ 0  ⏭ 0    🔗 See Details
setup-environment -     🔗  N/A See Details

✅  e2e-tests workflow has finished in 13 minutes 50 seconds and finished at 15th Mar, 2023.


Job Failed Steps Tests
kubernetes-test (v1.26.0) -     🔗  N/A See Details
kubernetes-test (v1.24.7) -     🔗  N/A See Details
kubernetes-test (v1.23.13) -     🔗  N/A See Details
kubernetes-test (v1.25.3) -     🔗  N/A See Details

🔎 See details on Foresight

*You can configure Foresight comments in your organization settings page.

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I was out for a few weeks. I left a few comments, let me know what you think.

lidBytes = value.Bytes().AsRaw()

switch value.Type() {
case pcommon.ValueTypeStr:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that this doesn't matter much, as a hash will be calculated based on the value. If it's string or int, it doesn't matter, as long as they are all converted to bytes in the end. Additionally, it doesn't matter if we are using the right encoding for the string, as long as all values are byte encoded in the same way. If we were to use this information for other than just calculating a hash, it would certainly be important to properly handle the encoding.

case pcommon.ValueTypeBytes:
lidBytes = value.Bytes().AsRaw()
default:
lsp.logger.Warn("incompatible log record attribute, only String or Bytes supported; skipping log record",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the only value that should not be supported is boolean. Otherwise, the probability provided by the user will be skewed by the fact that booleans can only have two values: there won't be a good distribution of values to make a proper probability.

Copy link
Contributor

@atoulme atoulme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - with a slight reduction of scope, just adding string support for now. We can add additional types in subsequent PRs.

@evantorrie
Copy link
Contributor

Re: addition of string support. This is actually something I wanted yesterday for a use-case (sampling logs probabilistically without having any TraceID associated with a log), so thanks for adding it. 🙏

When I dived a little deeper into this, stanza's log parsers and OTTL, it was unclear to me how in fact one could get a pcommon.ValueTypeBytes into a log entry field. There's no casting function in OTTL to cast from a string to a []bytes. It appears the only way to create a ValueTypeBytes field is via a byte literal (i.e. 0xdeadbeef)?

@atoulme
Copy link
Contributor

atoulme commented Mar 18, 2023

Please attend to the failing test:

    logsprocessor_test.go:215: 
        	Error Trace:	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/processor/probabilisticsamplerprocessor/logsprocessor_test.go:215
        	Error:      	Not equal: 
        	            	expected: 0
        	            	actual  : 1
        	Test:       	TestLogsSamplingInvalidType

@jpkrohling
Copy link
Member

Please, ping me once the test failures are fixed so that I can review this again.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label May 26, 2023
@github-actions
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Jun 10, 2023
@jpkrohling
Copy link
Member

@e-dard, are you still interested in this PR? If not, would you mind if someone else continues your work? @daianmartinho, would you be interested in picking this up if @e-dard isn't interested anymore?

@atoulme atoulme reopened this Jul 11, 2023
@atoulme
Copy link
Contributor

atoulme commented Jul 11, 2023

We have to get this in. Reopening and will follow up.

@github-actions github-actions bot removed the Stale label Jul 12, 2023
@daianmartinho
Copy link

@jpkrohling,
we are working on architecture definitions of our telemetry platform right now.. probably will hit this bug in near future, as we saw in some tests.
i think we can handle that as soon as we get in the collector configuration.
i will keep you guys updated if anything changes

@github-actions
Copy link
Contributor

github-actions bot commented Aug 6, 2023

This PR was marked stale due to lack of activity. It will be closed in 14 days.

Copy link
Contributor

@MovieStoreGuy MovieStoreGuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything that concerns me, just needs to be rebased with main and it should be good to go :)

case pcommon.ValueTypeBytes:
lidBytes = value.Bytes().AsRaw()
default:
lsp.logger.Warn("incompatible log record attribute, only String or Bytes supported; skipping log record",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is a method named, AsString which works regardless the type since it just calls fmt.Sprint ?

Would that work here instead?

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Aug 24, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2023

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Sep 8, 2023
@jpkrohling jpkrohling reopened this Sep 8, 2023
@jpkrohling
Copy link
Member

@daianmartinho, would you be interested in working on @MovieStoreGuy's last comment? I think this is very close to being merged, only needs that clarification.

@github-actions github-actions bot removed the Stale label Sep 9, 2023
@e-dard
Copy link
Author

e-dard commented Sep 15, 2023

@jpkrohling @MovieStoreGuy apologies for this falling off my plate for such a long time. I have fixed the test case and rebased.

@jpkrohling
Copy link
Member

@e-dard, I ended up opening a PR for this last week as well: #26564 . The advantage of that other PR is that it makes all records to be acceptable instead of rejecting non-string/non-byte attributes.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2023

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Oct 5, 2023
@github-actions
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor/probabilisticsampler Probabilistic Sampler processor Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants