-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[exporter/prometheusremotewrite] hookup WAL #7304
[exporter/prometheusremotewrite] hookup WAL #7304
Conversation
This change completes the WAL capability for the Prometheus Remote-Write exporter that allows recovery of data if the prior write requests weren't yet exported. This wires up a Write-Ahead-Log (WAL) that was implemented in PR open-telemetry/opentelemetry-collector#3597, which was split off from PR open-telemetry/opentelemetry-collector#3017. Note: there is very rare condition for which we can perhaps send the same data a couple of times, and it can happen if we haven't yet truncated the WAL, the RemoteWrite endpoint received the prior data but the process received a Ctrl+C, kill signal. However, this will be very rare and would have to be timed so fast and precisely. Replaces PR open-telemetry/opentelemetry-collector#3017 Updates PR open-telemetry/opentelemetry-collector#3597 Fixes open-telemetry#4751 Fixes open-telemetry/prometheus-interoperability-spec#9
Signed-off-by: Anthony J Mirabella <[email protected]>
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL Signed-off-by: Anthony J Mirabella <[email protected]>
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
…pulation Signed-off-by: Anthony J Mirabella <[email protected]>
Signed-off-by: Anthony J Mirabella <[email protected]>
a4b9205
to
875897b
Compare
Signed-off-by: Anthony J Mirabella <[email protected]>
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL Signed-off-by: Anthony J Mirabella <[email protected]>
6cadf74
to
c055725
Compare
Signed-off-by: Anthony J Mirabella <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm overall
go func() { | ||
prwe.continuallyPopWALThenExport(ctx, func() { close(waitUntilStartedCh) }) | ||
}() | ||
<-waitUntilStartedCh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the waitUntilStartedCh actually do anything here? It looks like signalStart
is called after just creating a timer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the purpose of this. It feels like an attempt to ensure that the scheduler started work on the goroutine just started, even when GOMAXPROCS
== 1 while avoiding using a sync.Once
inside the for
loop.
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
Signed-off-by: Anthony J Mirabella <[email protected]>
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
…p-WAL' into exporter-prometheusremotewrite-hookup-WAL
RemoteWriteQueue: RemoteWriteQueue{NumConsumers: 1}, | ||
WAL: &WALConfig{ | ||
Directory: tempDir, | ||
TruncateFrequency: 60 * time.Microsecond, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the implications of this? Wouldn't this be too hard on the CPU used in the CI machines? Perhaps 1ms wouldn't be too bad in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or will it just truncate once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this as it doesn't seem to affect the tests. I expect this is gated on the batch size instead of frequency.
case <-timer.C: | ||
shouldExport = true | ||
default: | ||
shouldExport = len(reqL) >= maxCountPerUpload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will this be called? Won't this select
always fall into the timer.C
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most often I'd expect this to hit the default
case, actually. Unless the timer is very short I would expect multiple metrics to be processed from the WAL between timer expirations. The default timer length is 1min and readPrompbFromWAL()
looks like it shouldn't block for more than about 8s at the outside.
return nil | ||
} | ||
|
||
if errL := prwe.exportSink(ctx, reqL); len(errL) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the lock include the export sink? A concurrent call to this would potentially send the same "front" twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think locking is a concern here, but walking back from here does have me concerned that an error returned here would stop the goroutine that reads from the WAL, but not the exporter, and there doesn't appear to be a mechanism to restart reading from the WAL.
I need to spend some more time with this.
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
Signed-off-by: Anthony J Mirabella <[email protected]>
…p-WAL' into exporter-prometheusremotewrite-hookup-WAL
Signed-off-by: Anthony J Mirabella <[email protected]>
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
…p-WAL' into exporter-prometheusremotewrite-hookup-WAL
Signed-off-by: Anthony J Mirabella <[email protected]>
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
Signed-off-by: Anthony J Mirabella <[email protected]>
Signed-off-by: Anthony J Mirabella <[email protected]>
Signed-off-by: Anthony J Mirabella <[email protected]>
Signed-off-by: Anthony J Mirabella <[email protected]>
Signed-off-by: Anthony J Mirabella <[email protected]>
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few go.mod/sum files that were changed in this PR that don't appear to be related to the change, please revert those.
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
Signed-off-by: Anthony J Mirabella <[email protected]>
* exporter/prometheusremotewrite: glue up and use Write-Ahead-Log This change completes the WAL capability for the Prometheus Remote-Write exporter that allows recovery of data if the prior write requests weren't yet exported. This wires up a Write-Ahead-Log (WAL) that was implemented in PR open-telemetry/opentelemetry-collector#3597, which was split off from PR open-telemetry/opentelemetry-collector#3017. Note: there is very rare condition for which we can perhaps send the same data a couple of times, and it can happen if we haven't yet truncated the WAL, the RemoteWrite endpoint received the prior data but the process received a Ctrl+C, kill signal. However, this will be very rare and would have to be timed so fast and precisely. Replaces PR open-telemetry/opentelemetry-collector#3017 Updates PR open-telemetry/opentelemetry-collector#3597 Fixes open-telemetry#4751 Fixes open-telemetry/prometheus-interoperability-spec#9 * fix go.mod Signed-off-by: Anthony J Mirabella <[email protected]> * exporter/prw: WALConfig should be exported to allow programmatic manipulation Signed-off-by: Anthony J Mirabella <[email protected]> * Add CHANGELOG entry Signed-off-by: Anthony J Mirabella <[email protected]> * fix lint error Signed-off-by: Anthony J Mirabella <[email protected]> * mod tidy Signed-off-by: Anthony J Mirabella <[email protected]> * tidy up WAL logic and comments Signed-off-by: Anthony J Mirabella <[email protected]> * Handle error from closing WAL Signed-off-by: Anthony J Mirabella <[email protected]> * Ensure WAL processor keeps running after error Signed-off-by: Anthony J Mirabella <[email protected]> * prwe/WAL: refactor WAL run loop Signed-off-by: Anthony J Mirabella <[email protected]> * make gotidy Signed-off-by: Anthony J Mirabella <[email protected]> * lint Signed-off-by: Anthony J Mirabella <[email protected]> * fix data races Signed-off-by: Anthony J Mirabella <[email protected]> * Ensure locking around entire WAL persistence routine Signed-off-by: Anthony J Mirabella <[email protected]> * Address PR feedback Signed-off-by: Anthony J Mirabella <[email protected]> * fix lint issues Signed-off-by: Anthony J Mirabella <[email protected]> * update read index inside WAL reader routine Signed-off-by: Anthony J Mirabella <[email protected]> * Update CHANGELOG.md * Undo unrelated go.mod changes Signed-off-by: Anthony J Mirabella <[email protected]> Co-authored-by: Emmanuel T Odeke <[email protected]> Co-authored-by: Alex Boten <[email protected]> Co-authored-by: Juraci Paixão Kröhling <[email protected]>
Description: Enables a write-ahead log for the PRW exporter to provide durable temporary storage of metric data in flight.