-
Notifications
You must be signed in to change notification settings - Fork 624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sdk: fix force_flush in batch span processor #397
Conversation
open-telemetry#389 implemented force_flush() for the span processor. For BatchSpanProcessor it was implemented by exposing an already existing _flush() method, it created a race condition because the _flush() method was intended to be called only from the context of the worker thread, this because it uses the export() method that is not thread safe. The result after that PR is that some tests were failing randomly because export() was being executed in two different threads, the worker thread and the user thread calling force_flush(). This commit fixes it by implementing a more sophisticated flush mechanism. When a flush is requested, a special span token is inserted in the spans queue, a flag indicating a flush operation is on progress is set and the worker thread is waken up, after it a condition variable is monitored waiting for the worker thread to indicate that the token has been processed. The worker thread has a new logic to avoid sleeping (waiting on the condition variable) when there is a flush operation going on, it also notifies the caller (using another condition variable) when the token has been processed.
Codecov Report
@@ Coverage Diff @@
## master #397 +/- ##
==========================================
+ Coverage 85.29% 85.32% +0.03%
==========================================
Files 38 38
Lines 1931 1956 +25
Branches 227 231 +4
==========================================
+ Hits 1647 1669 +22
- Misses 219 221 +2
- Partials 65 66 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change overall looks good, just a couple of questions in the comments.
timeout_millis: The maximum amount of time to wait for spans to be exported. | ||
|
||
Returns: | ||
False if the timeout is exceeded, True otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts around returning a boolean here rather than raising an exception if the timeout is exceeded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I updated to raise a RuntimeError
exception. I'm not sure about the type of this exception, there is a timeout exception on Python but the description doesn't fully align with our case: "Raised when a system function timed out at the system level. Corresponds to errno ETIMEDOUT."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see #397 (comment))
"""Export all ended spans to the configured Exporter that have not | ||
yet been exported. | ||
|
||
An exception is raised if the timeout is exceeeded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(continuing from #397 (comment))
I think an exception is kinda OK if we use a default timeout, but I would prefer a boolean if we had a default timeout of infinite. I am slightly in favor of making Infinite the default.
But consider https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/error-handling.md#basic-error-handling-principles, which says that we should rather not throw exceptions.
In order to unblock this PR and move ahead, I went back to returning a boolean and implemented a warning as discussed in the SIG meeting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick cosmetic patch to wrap some docstrings:
diff --git a/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py b/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py
index 7690e319..e4294670 100644
--- a/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py
+++ b/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py
@@ -71,11 +71,12 @@ class SpanProcessor:
"""
def force_flush(self, timeout_millis: int = 30000) -> bool:
- """Export all ended spans to the configured Exporter that have not
- yet been exported.
+ """Export all ended spans to the configured Exporter that have not yet
+ been exported.
Args:
- timeout_millis: The maximum amount of time to wait for spans to be exported.
+ timeout_millis: The maximum amount of time to wait for spans to be
+ exported.
Returns:
False if the timeout is exceeded, True otherwise.
LGTM otherwise!
I would have pushed it myself but it looks like I don't have write access(?) to the kinvolk fork.
open-telemetry#389 implemented force_flush() for the span processor. For BatchSpanProcessor it was implemented by exposing an already existing _flush() method, it created a race condition because the _flush() method was intended to be called only from the context of the worker thread, this because it uses the export() method that is not thread safe. The result after that PR is that some tests were failing randomly because export() was being executed in two different threads, the worker thread and the user thread calling force_flush(). This commit fixes it by implementing a more sophisticated flush mechanism. When a flush is requested, a special span token is inserted in the spans queue, a flag indicating a flush operation is on progress is set and the worker thread is waken up, after it a condition variable is monitored waiting for the worker thread to indicate that the token has been processed. The worker thread has a new logic to avoid sleeping (waiting on the condition variable) when there is a flush operation going on, it also notifies the caller (using another condition variable) when the token has been processed.
* fix(plugin-http): ensure no leaks closes open-telemetry#397 Signed-off-by: Olivier Albertini <[email protected]> * fix: add @Flarna recommandations Signed-off-by: Olivier Albertini <[email protected]>
#389 implemented
force_flush() for the span processor. For BatchSpanProcessor it was implemented
by exposing an already existing _flush() method, it created a race condition
because the _flush() method was intended to be called only from the context
of the worker thread, this because it uses the export() method that is not
thread safe.
The result after that PR is that some tests were failing randomly because export()
was being executed in two different threads, the worker thread and the user
thread calling force_flush().
This commit fixes it by implementing a more sophisticated flush mechanism.
When a flush is requested, a special span token is inserted in the spans queue,
a flag indicating a flush operation is on progress is set and the worker thread
is waken up, after it a condition variable is monitored waiting for the worker
thread to indicate that the token has been processed.
The worker thread has a new logic to avoid sleeping (waiting on the condition
variable) when there is a flush operation going on, it also notifies the
caller (using another condition variable) when the token has been processed.
Fixes: #396