Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SplunkHecExporter] Don't send 'zero' timestamps to Splunk HEC. #1157

Merged

Conversation

nebffa
Copy link
Contributor

@nebffa nebffa commented Sep 30, 2020

Don't send 'zero' timestamps to Splunk HEC. - see #965

@nebffa nebffa requested a review from a team September 30, 2020 05:34
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 30, 2020

CLA Check
The committers are authorized under a signed CLA.

  • ✅ Ben Lucato (3bb14190b08a0133ca73285904910d2959014619)

@tigrannajaryan
Copy link
Member

@nebffa please sign the CLA so that we can review the PR.

@nebffa nebffa force-pushed the signalfx-receiver-default-timestamps branch from 3d3f0e9 to cff3191 Compare September 30, 2020 21:58
@nebffa
Copy link
Contributor Author

nebffa commented Sep 30, 2020

Sorry @tigrannajaryan , I had committed under the wrong account - I've fixed that now.

@tigrannajaryan
Copy link
Member

Normally Collector's philosophy is that we don’t alter the data unless explicitly configured to do so by the user. This change does not look right to me. If we received data with a missing timestamp we need to forward it with a missing timestamp.

If missing timestamp is considered an invalid situation then it either needs to be fixed at the source that emits the metric or if interpretation of "missing means now" is desirable then it needs to be a backend behavior. It is not Collector's business to make decisions like this.

Also in SignalFx missing timestamp just means "this is the latest datapoint" where the exact time is kind of irrelevant.

I wonder what exactly is the problem with #965
Why is the current behavior not good enough?

@nebffa
Copy link
Contributor Author

nebffa commented Oct 1, 2020

@tigrannajaryan that philosophy sounds fine. In my case I am sending data to a Splunk HEC endpoint - given that, perhaps it's better to configure default timestamps for the Splunk HEC exporter?

@tigrannajaryan
Copy link
Member

@tigrannajaryan that philosophy sounds fine. In my case I am sending data to a Splunk HEC endpoint - given that, perhaps it's better to configure default timestamps for the Splunk HEC exporter?

Yes, it would make sense to do so if missing timestampt is considered an invalid input for Splunk HEC endpoint. I do not know if it is the case. Do you know?

@atoulme perhaps you can help with this if you know?

@tigrannajaryan tigrannajaryan self-assigned this Oct 2, 2020
@atoulme
Copy link
Contributor

atoulme commented Oct 6, 2020

I checked with my team ; Splunk will add an index timestamp if no timestamp is present. So there is no absolute need for a timestamp to be present. Of course it's better to have a timestamp on the record.

@atoulme
Copy link
Contributor

atoulme commented Oct 6, 2020

I would offer that maybe a processor step could enrich the data explicitly to add a timestamp to the records being processed. That sounds at least like something that folks can opt in into.

@nebffa
Copy link
Contributor Author

nebffa commented Oct 6, 2020

I checked with my team ; Splunk will add an index timestamp if no timestamp is present. So there is no absolute need for a timestamp to be present. Of course it's better to have a timestamp on the record.

Hey @atoulme , by default Splunk doesn't do this. I've tested it thoroughly locally. There are ways to have Splunk add custom/default timestamps to incoming events/metrics but it appears to be more complicated with the Splunk HTTP Event Collector. I could not get it working even after a few hours of trying. Perhaps I was doing something wrong, but at the very least it wasn't 'dead-easy' to configure.

As for adding the timestamps - I am happy to do it with a processor or in the Splunk HEC exporter. Is there a preference?

@tigrannajaryan
Copy link
Member

@nebffa which Splunk backend are you sending to? I believe @atoulme is right and that is also what HEC description says: https://docs.splunk.com/Documentation/Splunk/8.0.6/Data/FormateventsforHTTPEventCollector It says "These keys are all optional", so it should be OK not to have the timestamp.

It may be worth to try to clarify this with whatever Splunk backend you are sending to. If you don't know who to ask I can try to find the right person at Splunk, but I need to know which backend.

@nebffa
Copy link
Contributor Author

nebffa commented Oct 9, 2020

@tigrannajaryan @atoulme I did some more reviewing of the actual data that's being sent. Some metrics from the SignalFx agent are sent with an explicit timestamp of 0 (i.e. not nil). These events are sent from the otel-contrib-collector to Splunk, which accepts this '0' timestamp without modification, so the events are all sitting at the beginning of epoch time. So I think there should be some way for the otel-contrib-collector to modify that timestamp value in some way.

To clarify, this doesn't happen for all SignalFx metrics - some monitors emit metrics with timestamps - e.g.https://docs.signalfx.com/en/latest/integrations/agent/monitors/collectd-disk.html, and some monitors emit emtrics without timestamps - e.g. https://docs.signalfx.com/en/latest/integrations/agent/monitors/cpu.html.

As for the Splunk backend @tigrannajaryan . I'm not exactly sure what you mean - but we are sending data to a Splunk Enterprise instance, version 8.0.5.1, and we are using the services/collector Splunk HEC rest endpoint.

@nebffa
Copy link
Contributor Author

nebffa commented Oct 12, 2020

Hey @tigrannajaryan @atoulme - any thoughts on the above?

@tigrannajaryan
Copy link
Member

@nebffa sorry for replying late, I was away for a few days.

I did some more reviewing of the actual data that's being sent. Some metrics from the SignalFx agent are sent with an explicit timestamp of 0 (i.e. not nil). These events are sent from the otel-contrib-collector to Splunk, which accepts this '0' timestamp without modification, so the events are all sitting at the beginning of epoch time. So I think there should be some way for the otel-contrib-collector to modify that timestamp value in some way.

Just to confirm: are you using SignalFx SmartAgent to collect metrics and then send them to OpenTelemetry Collector, which then sends them to Splunk Enterprise?

@keitwb @jrcamp can you perhaps help with this? Is SmartAgent expected to send metrics with 0 (not nil) timestamps?

@nebffa a different question: do you need to use SignalFx SmartAgent? Can you collect the metrics using OpenTelemetry Collector?

@keitwb
Copy link
Contributor

keitwb commented Oct 14, 2020

@tigrannajaryan Yes, the Smart Agent generally does not set timestamps on metrics as every datapoint represents the "latest" data available and the exact moment it was sampled is not particularly important. In the Metrics protobuf, there is no way to send nil timestamps as far as I know (nullable is false in the proto definition), so 0 just means "no timestamp" and can be interpreted that way by the Collector if need be.

@tigrannajaryan
Copy link
Member

OK, so from what I understand the problem is not in SignalFx receiver. The problem is that Splunk HEC exporter converts 0 timestamp, which represent missing value to an actual 0 Unix epoch value here:

Time: timestampToEpochMilliseconds(tsPoint.GetTimestamp()),

I believe this is where a fix would be appropriate. When the input timestamp is "zero", i.e. represents a missing timestamp then the outgoing JSON messsage must not have the "time" field. I believe this would be the correct fix, so that Splunk backend then can correctly populate the current timestamp.

@atoulme do you agree with this?

@nebffa
Copy link
Contributor Author

nebffa commented Oct 14, 2020

@nebffa sorry for replying late, I was away for a few days.

I did some more reviewing of the actual data that's being sent. Some metrics from the SignalFx agent are sent with an explicit timestamp of 0 (i.e. not nil). These events are sent from the otel-contrib-collector to Splunk, which accepts this '0' timestamp without modification, so the events are all sitting at the beginning of epoch time. So I think there should be some way for the otel-contrib-collector to modify that timestamp value in some way.

Just to confirm: are you using SignalFx SmartAgent to collect metrics and then send them to OpenTelemetry Collector, which then sends them to Splunk Enterprise?

@keitwb @jrcamp can you perhaps help with this? Is SmartAgent expected to send metrics with 0 (not nil) timestamps?

@nebffa a different question: do you need to use SignalFx SmartAgent? Can you collect the metrics using OpenTelemetry Collector?

It's fine @tigrannajaryan . I should actually apologise for being impatient. I got caught up subconsciously thinking you were a colleague and decided to 'poke' you after not getting an immediate response - that's my bad!

Yes, we are using the SignalFx SmartAgent to do exactly as you say. We don't need the SmartAgent, and we could achieve similar monitoring using other tools that replace SmartAgent, but we found that it offers us the best immediate coverage across our stack (we are just starting on our metrics journey).

@atoulme
Copy link
Contributor

atoulme commented Oct 16, 2020

@tigrannajaryan yes, now I understand, that's the right fix. That might require us to change the Time field to use a pointer float instead of a float in the event that events have a timestamp of 0.

@tigrannajaryan
Copy link
Member

@tigrannajaryan yes, now I understand, that's the right fix. That might require us to change the Time field to use a pointer float instead of a float in the event that events have a timestamp of 0.

@atoulme yes, correct. @nebffa would you be able to change this PR to fix the Splunk HEC exporter instead?

@nebffa
Copy link
Contributor Author

nebffa commented Oct 19, 2020

@tigrannajaryan yes, now I understand, that's the right fix. That might require us to change the Time field to use a pointer float instead of a float in the event that events have a timestamp of 0.

@atoulme yes, correct. @nebffa would you be able to change this PR to fix the Splunk HEC exporter instead?

Yes, happy to do so - I'll have it up for review probably by around the end of this week.

@nebffa nebffa force-pushed the signalfx-receiver-default-timestamps branch from cff3191 to 0b14e86 Compare October 20, 2020 02:50
if event.Time == nil {
logger.Debug("Unsupported event time", zap.Any("value", event.Event))
return ld, errors.New(nilTimestamp)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The splunkhecreceiver needs to be changed as it uses the same model. I wasn't sure of what validation you guys would want so I've done something basic here. Previously the timestamps coming into the receiver presumably all had a value so perhaps this is not the best thing to do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From examples here (https://docs.splunk.com/Documentation/Splunk/8.0.6/Data/HECExamples) looks like time is an optional field, always.
We should accept events with missing time field and just pass them as best we can, without setting the timestamp in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks - I'll fix that when I get sometime this week

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So based on what @atoulme says, I assume this if is not needed? If it is a valid situation for the time to be missing why are we erroring on it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - that was an oversight by me. I've now adjusted this code. Now the behaviour is that in the event of event.Time == nil, we won't set a timestamp on the log records. That means they'll get their default value (0). I don't think there's a way to set the log record timestamp to be nil/empty

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, and if there was any confusion here, that's on me for doing a bad job the first time around.

output: func() pdata.ResourceLogsSlice {
return createLogsSlice("value")
}(),
wantErr: errors.New(nilTimestamp),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corresponding test for that primitive validation - again, I'm not sure if this is the way you'd like to go with the splunkhecreceiver

@nebffa nebffa force-pushed the signalfx-receiver-default-timestamps branch 2 times, most recently from 3bb1419 to 29df8cb Compare October 20, 2020 03:22
@nebffa
Copy link
Contributor Author

nebffa commented Oct 20, 2020

@tigrannajaryan @atoulme I've now pushed up the proper implementation as per our discussion. There's a particular point that I'd like your input on in the splunkhecreceiver - I've made some comments on the changes which highlight it.

A unit test within splunkhecexporter fails due to the 30s timeout - I don't think I've changed anything that impacts performance, and the tests run fine with this time limit on my machine (insert joke here). Is this a common issue with the build pipeline, or have I done something unintentional?

@atoulme
Copy link
Contributor

atoulme commented Oct 20, 2020

Thank you for the patch. I replied to your changes. Regarding tests, not quite sure. I have seen it happen, maybe a rerun can help.

@nebffa nebffa force-pushed the signalfx-receiver-default-timestamps branch from 29df8cb to c841ef3 Compare October 22, 2020 03:51
@nebffa
Copy link
Contributor Author

nebffa commented Oct 22, 2020

@nebffa please fix commit message / PR description / issue description to match what the change actually is doing.

For the failing test, if it fails randomly file an issue/bug to get it fixed and assign to the component contributor (use git blame to find who wrote the test). I will restart the test if it fails after your final changes to this PR.

Hey @tigrannajaryan , I've gone and made those adjustments to the commit message / PR description / issue description - I hope that's what you were looking for?

The test run succeeded now. As for what you said, it's no test in particular, it's just that all the splunkhecexporter tests together take longer than the 30s limit that's set in the Makefile.Common. I've created this item for that behaviour here - #1346 - but I don't think there's anyone clear to assign it to since there's no specific test to point to

@tigrannajaryan
Copy link
Member

Hey @tigrannajaryan , I've gone and made those adjustments to the commit message / PR description / issue description - I hope that's what you were looking for?

The PR description is

[SignalFxReceiver] Don't send 'zero' timestamps to Splunk HEC

Why does it say "SignalFxReceiver"? This PR does not touch anything in signalfx receiver.

Comment on lines 145 to 151
if val == 0 {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to explain why this is important.
Also, I think it is better to check for 0 before doing the conversion to seconds. Technically it may be 0.1 seconds which is not a missing timestamp but rounds to 0 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've done this now

Comment on lines 123 to 130
if val == 0 {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to explain what this does. I am not sure this is needed. What case does this handle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment here. If I'm not mistaken, it's to handle the case when the event has a timestamp, but it's 0 (as opposed to having no timestamp at all)

@@ -36,7 +36,7 @@ type AccessTokenPassthroughConfig struct {

// Event represents a metric in Splunk HEC format
type Event struct {
Time float64 `json:"time"` // epoch time
Time *float64 `json:"time,omitempty"` // optional epoch time - set to nil if the event has a timestamp set to the start of epoch time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not more correct to say this?

Suggested change
Time *float64 `json:"time,omitempty"` // optional epoch time - set to nil if the event has a timestamp set to the start of epoch time
Time *float64 `json:"time,omitempty"` // optional epoch time - set to nil if the event timestamp is missing or unknown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends how we want to frame it. This change is intended to fix the case where metrics are sent to the Open Telemetry collector with a timestamp set to 0. So the timestamp is not technically 'missing or unknown', although in practice it actually is. So I think what you say makes sense - I'll include it in this change

if event.Time == nil {
logger.Debug("Unsupported event time", zap.Any("value", event.Event))
return ld, errors.New(nilTimestamp)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So based on what @atoulme says, I assume this if is not needed? If it is a valid situation for the time to be missing why are we erroring on it?

@nebffa nebffa changed the title [SignalFxReceiver] Don't send 'zero' timestamps to Splunk HEC. [SplunkHecExporter] Don't send 'zero' timestamps to Splunk HEC. Oct 25, 2020
@nebffa nebffa force-pushed the signalfx-receiver-default-timestamps branch from c841ef3 to f8ba329 Compare October 26, 2020 00:25
doubleVal := 1234.5678
var metrics []*metricspb.Metric
for i := 0; i < numberOfDataPoints; i++ {
tsUnix := time.Unix(unixSecs+int64(i), unixNSecs)
tsUnix := time.Unix(int64(i), int64(i)*time.Millisecond.Nanoseconds())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted these tests so that we could test logs/metrics for 0-value timestamp event behaviour. Is this okay?

@@ -280,23 +280,23 @@ func TestReceiveTraces(t *testing.T) {
func TestReceiveLogs(t *testing.T) {
actual, err := runLogExport(true, 3, t)
assert.NoError(t, err)
expected := `{"time":0,"host":"myhost","source":"myapp","sourcetype":"myapp-type","event":"mylog","fields":{"custom":"custom"}}`
expected := `{"host":"myhost","source":"myapp","sourcetype":"myapp-type","event":"mylog","fields":{"custom":"custom"}}`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the above comment, these tests are now testing the 0-value timestamp behaviour

},
},
output: func() pdata.ResourceLogsSlice {
return createLogsSlice("value", 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nebffa
Copy link
Contributor Author

nebffa commented Oct 26, 2020

Hey @tigrannajaryan , I've gone and made those adjustments to the commit message / PR description / issue description - I hope that's what you were looking for?

The PR description is

[SignalFxReceiver] Don't send 'zero' timestamps to Splunk HEC

Why does it say "SignalFxReceiver"? This PR does not touch anything in signalfx receiver.

Sorry @tigrannajaryan , that was an oversight by me - I've fixed that now.

@nebffa
Copy link
Contributor Author

nebffa commented Oct 26, 2020

The build failed again, this time because of load test memory consumption (81MB over the limit of 80MB). I think that this is unrelated to my change once again, so would you like me to create another item for this?

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but would like @atoulme or @keitwb to also have a look.

Comment on lines +146 to +149
// some telemetry sources send data with timestamps set to 0 by design, as their original target destinations
// (i.e. before Open Telemetry) are setup with the know-how on how to consume them. In this case,
// we want to omit the time field when sending data to the Splunk HEC so that the HEC adds a timestamp
// at indexing time, which will be much more useful than a 0-epoch-time value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice comment.

@tigrannajaryan
Copy link
Member

The build failed again, this time because of load test memory consumption (81MB over the limit of 80MB). I think that this is unrelated to my change once again, so would you like me to create another item for this?

@nebffa This should be already fixed and must pass if you rebase from master.

Copy link
Contributor

@atoulme atoulme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tigrannajaryan
Copy link
Member

@nebffa please rebase.

@nebffa nebffa force-pushed the signalfx-receiver-default-timestamps branch from f8ba329 to b6e83d3 Compare October 28, 2020 05:35
@codecov
Copy link

codecov bot commented Oct 28, 2020

Codecov Report

Merging #1157 into master will increase coverage by 17.90%.
The diff coverage is 93.75%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #1157       +/-   ##
===========================================
+ Coverage   70.87%   88.78%   +17.90%     
===========================================
  Files          29      341      +312     
  Lines        1298    16659    +15361     
===========================================
+ Hits          920    14790    +13870     
- Misses        321     1403     +1082     
- Partials       57      466      +409     
Flag Coverage Δ
#integration 70.87% <ø> (ø)
#unit 88.05% <93.75%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
internal/splunk/common.go 100.00% <ø> (ø)
...eiver/splunkhecreceiver/splunkhec_to_metricdata.go 93.10% <50.00%> (ø)
exporter/splunkhecexporter/logdata_to_splunk.go 93.97% <100.00%> (ø)
exporter/splunkhecexporter/metricdata_to_splunk.go 61.79% <100.00%> (ø)
receiver/splunkhecreceiver/splunk_to_logdata.go 93.67% <100.00%> (ø)
receiver/carbonreceiver/transport/mock_reporter.go 100.00% <0.00%> (ø)
exporter/elasticexporter/exporter.go 75.51% <0.00%> (ø)
exporter/datadogexporter/model.go 70.00% <0.00%> (ø)
exporter/splunkhecexporter/client.go 90.00% <0.00%> (ø)
receiver/statsdreceiver/protocol/statsd_parser.go 97.52% <0.00%> (ø)
... and 324 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68d65d7...a65f3f8. Read the comment docs.

@nebffa
Copy link
Contributor Author

nebffa commented Oct 28, 2020

@nebffa please rebase.

Hey @tigrannajaryan , I've rebased now but there's a failure due to code coverage. It must be a new requirement since I don't remember it being there two days ago. I'll write another unit test tomorrow to cover the missed code.

@bogdandrutu
Copy link
Member

Try one more time the rebase :)

@nebffa nebffa force-pushed the signalfx-receiver-default-timestamps branch from b6e83d3 to a65f3f8 Compare October 29, 2020 04:02
@nebffa
Copy link
Contributor Author

nebffa commented Oct 29, 2020

Try one more time the rebase :)

I tried this - it's definitely a code coverage issue. I didn't get a chance to write that test today. I probably won't have time for it tomorrow either, so I'll get on this early next week.

@bogdandrutu bogdandrutu merged commit f305e19 into open-telemetry:master Oct 29, 2020
@nebffa
Copy link
Contributor Author

nebffa commented Oct 29, 2020

Oh - that's one way to do it @bogdandrutu ! XD

Thanks for your patience and help everyone @atoulme @tigrannajaryan @bogdandrutu

dyladan referenced this pull request in dynatrace-oss-contrib/opentelemetry-collector-contrib Jan 29, 2021
This makes it a bit more convenient to assemble data structures.

It is also congruent with the existing Go slice functionality with
the same name.
ljmsc referenced this pull request in ljmsc/opentelemetry-collector-contrib Feb 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants