-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SplunkHecExporter] Don't send 'zero' timestamps to Splunk HEC. #1157
[SplunkHecExporter] Don't send 'zero' timestamps to Splunk HEC. #1157
Conversation
@nebffa please sign the CLA so that we can review the PR. |
3d3f0e9
to
cff3191
Compare
Sorry @tigrannajaryan , I had committed under the wrong account - I've fixed that now. |
Normally Collector's philosophy is that we don’t alter the data unless explicitly configured to do so by the user. This change does not look right to me. If we received data with a missing timestamp we need to forward it with a missing timestamp. If missing timestamp is considered an invalid situation then it either needs to be fixed at the source that emits the metric or if interpretation of "missing means now" is desirable then it needs to be a backend behavior. It is not Collector's business to make decisions like this. Also in SignalFx missing timestamp just means "this is the latest datapoint" where the exact time is kind of irrelevant. I wonder what exactly is the problem with #965 |
@tigrannajaryan that philosophy sounds fine. In my case I am sending data to a Splunk HEC endpoint - given that, perhaps it's better to configure default timestamps for the Splunk HEC exporter? |
Yes, it would make sense to do so if missing timestampt is considered an invalid input for Splunk HEC endpoint. I do not know if it is the case. Do you know? @atoulme perhaps you can help with this if you know? |
I checked with my team ; Splunk will add an index timestamp if no timestamp is present. So there is no absolute need for a timestamp to be present. Of course it's better to have a timestamp on the record. |
I would offer that maybe a processor step could enrich the data explicitly to add a timestamp to the records being processed. That sounds at least like something that folks can opt in into. |
Hey @atoulme , by default Splunk doesn't do this. I've tested it thoroughly locally. There are ways to have Splunk add custom/default timestamps to incoming events/metrics but it appears to be more complicated with the Splunk HTTP Event Collector. I could not get it working even after a few hours of trying. Perhaps I was doing something wrong, but at the very least it wasn't 'dead-easy' to configure. As for adding the timestamps - I am happy to do it with a processor or in the Splunk HEC exporter. Is there a preference? |
@nebffa which Splunk backend are you sending to? I believe @atoulme is right and that is also what HEC description says: https://docs.splunk.com/Documentation/Splunk/8.0.6/Data/FormateventsforHTTPEventCollector It says "These keys are all optional", so it should be OK not to have the timestamp. It may be worth to try to clarify this with whatever Splunk backend you are sending to. If you don't know who to ask I can try to find the right person at Splunk, but I need to know which backend. |
@tigrannajaryan @atoulme I did some more reviewing of the actual data that's being sent. Some metrics from the SignalFx agent are sent with an explicit timestamp of 0 (i.e. not To clarify, this doesn't happen for all SignalFx metrics - some monitors emit metrics with timestamps - e.g.https://docs.signalfx.com/en/latest/integrations/agent/monitors/collectd-disk.html, and some monitors emit emtrics without timestamps - e.g. https://docs.signalfx.com/en/latest/integrations/agent/monitors/cpu.html. As for the Splunk backend @tigrannajaryan . I'm not exactly sure what you mean - but we are sending data to a Splunk Enterprise instance, version 8.0.5.1, and we are using the |
Hey @tigrannajaryan @atoulme - any thoughts on the above? |
@nebffa sorry for replying late, I was away for a few days.
Just to confirm: are you using SignalFx SmartAgent to collect metrics and then send them to OpenTelemetry Collector, which then sends them to Splunk Enterprise? @keitwb @jrcamp can you perhaps help with this? Is SmartAgent expected to send metrics with 0 (not nil) timestamps? @nebffa a different question: do you need to use SignalFx SmartAgent? Can you collect the metrics using OpenTelemetry Collector? |
@tigrannajaryan Yes, the Smart Agent generally does not set timestamps on metrics as every datapoint represents the "latest" data available and the exact moment it was sampled is not particularly important. In the Metrics protobuf, there is no way to send |
OK, so from what I understand the problem is not in SignalFx receiver. The problem is that Splunk HEC exporter converts 0 timestamp, which represent missing value to an actual 0 Unix epoch value here: opentelemetry-collector-contrib/exporter/splunkhecexporter/metricdata_to_splunk.go Line 103 in 8d6f04c
I believe this is where a fix would be appropriate. When the input timestamp is "zero", i.e. represents a missing timestamp then the outgoing JSON messsage must not have the "time" field. I believe this would be the correct fix, so that Splunk backend then can correctly populate the current timestamp. @atoulme do you agree with this? |
It's fine @tigrannajaryan . I should actually apologise for being impatient. I got caught up subconsciously thinking you were a colleague and decided to 'poke' you after not getting an immediate response - that's my bad! Yes, we are using the SignalFx SmartAgent to do exactly as you say. We don't need the SmartAgent, and we could achieve similar monitoring using other tools that replace SmartAgent, but we found that it offers us the best immediate coverage across our stack (we are just starting on our metrics journey). |
@tigrannajaryan yes, now I understand, that's the right fix. That might require us to change the Time field to use a pointer float instead of a float in the event that events have a timestamp of 0. |
@atoulme yes, correct. @nebffa would you be able to change this PR to fix the Splunk HEC exporter instead? |
Yes, happy to do so - I'll have it up for review probably by around the end of this week. |
cff3191
to
0b14e86
Compare
if event.Time == nil { | ||
logger.Debug("Unsupported event time", zap.Any("value", event.Event)) | ||
return ld, errors.New(nilTimestamp) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The splunkhecreceiver
needs to be changed as it uses the same model. I wasn't sure of what validation you guys would want so I've done something basic here. Previously the timestamps coming into the receiver presumably all had a value so perhaps this is not the best thing to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From examples here (https://docs.splunk.com/Documentation/Splunk/8.0.6/Data/HECExamples) looks like time is an optional field, always.
We should accept events with missing time field and just pass them as best we can, without setting the timestamp in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks - I'll fix that when I get sometime this week
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So based on what @atoulme says, I assume this if
is not needed? If it is a valid situation for the time to be missing why are we erroring on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - that was an oversight by me. I've now adjusted this code. Now the behaviour is that in the event of event.Time == nil
, we won't set a timestamp on the log records. That means they'll get their default value (0). I don't think there's a way to set the log record timestamp to be nil/empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, and if there was any confusion here, that's on me for doing a bad job the first time around.
output: func() pdata.ResourceLogsSlice { | ||
return createLogsSlice("value") | ||
}(), | ||
wantErr: errors.New(nilTimestamp), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corresponding test for that primitive validation - again, I'm not sure if this is the way you'd like to go with the splunkhecreceiver
3bb1419
to
29df8cb
Compare
@tigrannajaryan @atoulme I've now pushed up the proper implementation as per our discussion. There's a particular point that I'd like your input on in the A unit test within |
Thank you for the patch. I replied to your changes. Regarding tests, not quite sure. I have seen it happen, maybe a rerun can help. |
29df8cb
to
c841ef3
Compare
Hey @tigrannajaryan , I've gone and made those adjustments to the commit message / PR description / issue description - I hope that's what you were looking for? The test run succeeded now. As for what you said, it's no test in particular, it's just that all the |
The PR description is
Why does it say "SignalFxReceiver"? This PR does not touch anything in signalfx receiver. |
if val == 0 { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment to explain why this is important.
Also, I think it is better to check for 0 before doing the conversion to seconds. Technically it may be 0.1 seconds which is not a missing timestamp but rounds to 0 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've done this now
if val == 0 { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment to explain what this does. I am not sure this is needed. What case does this handle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment here. If I'm not mistaken, it's to handle the case when the event has a timestamp, but it's 0 (as opposed to having no timestamp at all)
internal/common/splunk/common.go
Outdated
@@ -36,7 +36,7 @@ type AccessTokenPassthroughConfig struct { | |||
|
|||
// Event represents a metric in Splunk HEC format | |||
type Event struct { | |||
Time float64 `json:"time"` // epoch time | |||
Time *float64 `json:"time,omitempty"` // optional epoch time - set to nil if the event has a timestamp set to the start of epoch time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not more correct to say this?
Time *float64 `json:"time,omitempty"` // optional epoch time - set to nil if the event has a timestamp set to the start of epoch time | |
Time *float64 `json:"time,omitempty"` // optional epoch time - set to nil if the event timestamp is missing or unknown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends how we want to frame it. This change is intended to fix the case where metrics are sent to the Open Telemetry collector with a timestamp set to 0. So the timestamp is not technically 'missing or unknown', although in practice it actually is. So I think what you say makes sense - I'll include it in this change
if event.Time == nil { | ||
logger.Debug("Unsupported event time", zap.Any("value", event.Event)) | ||
return ld, errors.New(nilTimestamp) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So based on what @atoulme says, I assume this if
is not needed? If it is a valid situation for the time to be missing why are we erroring on it?
c841ef3
to
f8ba329
Compare
doubleVal := 1234.5678 | ||
var metrics []*metricspb.Metric | ||
for i := 0; i < numberOfDataPoints; i++ { | ||
tsUnix := time.Unix(unixSecs+int64(i), unixNSecs) | ||
tsUnix := time.Unix(int64(i), int64(i)*time.Millisecond.Nanoseconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted these tests so that we could test logs/metrics for 0-value timestamp event behaviour. Is this okay?
@@ -280,23 +280,23 @@ func TestReceiveTraces(t *testing.T) { | |||
func TestReceiveLogs(t *testing.T) { | |||
actual, err := runLogExport(true, 3, t) | |||
assert.NoError(t, err) | |||
expected := `{"time":0,"host":"myhost","source":"myapp","sourcetype":"myapp-type","event":"mylog","fields":{"custom":"custom"}}` | |||
expected := `{"host":"myhost","source":"myapp","sourcetype":"myapp-type","event":"mylog","fields":{"custom":"custom"}}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the above comment, these tests are now testing the 0-value timestamp behaviour
}, | ||
}, | ||
output: func() pdata.ResourceLogsSlice { | ||
return createLogsSlice("value", 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timestamp is set to 0 on the log record, as mentioned in https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/1157/files#r511671475
Sorry @tigrannajaryan , that was an oversight by me - I've fixed that now. |
The build failed again, this time because of load test memory consumption (81MB over the limit of 80MB). I think that this is unrelated to my change once again, so would you like me to create another item for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// some telemetry sources send data with timestamps set to 0 by design, as their original target destinations | ||
// (i.e. before Open Telemetry) are setup with the know-how on how to consume them. In this case, | ||
// we want to omit the time field when sending data to the Splunk HEC so that the HEC adds a timestamp | ||
// at indexing time, which will be much more useful than a 0-epoch-time value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice comment.
@nebffa This should be already fixed and must pass if you rebase from master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@nebffa please rebase. |
f8ba329
to
b6e83d3
Compare
Codecov Report
@@ Coverage Diff @@
## master #1157 +/- ##
===========================================
+ Coverage 70.87% 88.78% +17.90%
===========================================
Files 29 341 +312
Lines 1298 16659 +15361
===========================================
+ Hits 920 14790 +13870
- Misses 321 1403 +1082
- Partials 57 466 +409
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Hey @tigrannajaryan , I've rebased now but there's a failure due to code coverage. It must be a new requirement since I don't remember it being there two days ago. I'll write another unit test tomorrow to cover the missed code. |
Try one more time the rebase :) |
b6e83d3
to
a65f3f8
Compare
I tried this - it's definitely a code coverage issue. I didn't get a chance to write that test today. I probably won't have time for it tomorrow either, so I'll get on this early next week. |
Oh - that's one way to do it @bogdandrutu ! XD Thanks for your patience and help everyone @atoulme @tigrannajaryan @bogdandrutu |
This makes it a bit more convenient to assemble data structures. It is also congruent with the existing Go slice functionality with the same name.
Don't send 'zero' timestamps to Splunk HEC. - see #965