Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdk: fix force_flush in batch span processor #397

Merged
merged 6 commits into from
Feb 7, 2020

Conversation

mauriciovasquezbernal
Copy link
Member

#389 implemented
force_flush() for the span processor. For BatchSpanProcessor it was implemented
by exposing an already existing _flush() method, it created a race condition
because the _flush() method was intended to be called only from the context
of the worker thread, this because it uses the export() method that is not
thread safe.

The result after that PR is that some tests were failing randomly because export()
was being executed in two different threads, the worker thread and the user
thread calling force_flush().

This commit fixes it by implementing a more sophisticated flush mechanism.
When a flush is requested, a special span token is inserted in the spans queue,
a flag indicating a flush operation is on progress is set and the worker thread
is waken up, after it a condition variable is monitored waiting for the worker
thread to indicate that the token has been processed.

The worker thread has a new logic to avoid sleeping (waiting on the condition
variable) when there is a flush operation going on, it also notifies the
caller (using another condition variable) when the token has been processed.

Fixes: #396

open-telemetry#389 implemented
force_flush() for the span processor. For BatchSpanProcessor it was implemented
by exposing an already existing _flush() method, it created a race condition
because the _flush() method was intended to be called only from the context
of the worker thread, this because it uses the export() method that is not
thread safe.

The result after that PR is that some tests were failing randomly because export()
was being executed in two different threads, the worker thread and the user
thread calling force_flush().

This commit fixes it by implementing a more sophisticated flush mechanism.
When a flush is requested, a special span token is inserted in the spans queue,
a flag indicating a flush operation is on progress is set and the worker thread
is waken up, after it a condition variable is monitored waiting for the worker
thread to indicate that the token has been processed.

The worker thread has a new logic to avoid sleeping (waiting on the condition
variable) when there is a flush operation going on, it also notifies the
caller (using another condition variable) when the token has been processed.
@mauriciovasquezbernal mauriciovasquezbernal requested a review from a team February 6, 2020 00:16
@codecov-io
Copy link

codecov-io commented Feb 6, 2020

Codecov Report

Merging #397 into master will increase coverage by 0.03%.
The diff coverage is 87.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #397      +/-   ##
==========================================
+ Coverage   85.29%   85.32%   +0.03%     
==========================================
  Files          38       38              
  Lines        1931     1956      +25     
  Branches      227      231       +4     
==========================================
+ Hits         1647     1669      +22     
- Misses        219      221       +2     
- Partials       65       66       +1
Impacted Files Coverage Δ
...emetry-sdk/src/opentelemetry/sdk/trace/__init__.py 90.74% <100%> (ø) ⬆️
...sdk/src/opentelemetry/sdk/trace/export/__init__.py 86.46% <87.5%> (+0.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae484cb...fcccf2c. Read the comment docs.

Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change overall looks good, just a couple of questions in the comments.

timeout_millis: The maximum amount of time to wait for spans to be exported.

Returns:
False if the timeout is exceeded, True otherwise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts around returning a boolean here rather than raising an exception if the timeout is exceeded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I updated to raise a RuntimeError exception. I'm not sure about the type of this exception, there is a timeout exception on Python but the description doesn't fully align with our case: "Raised when a system function timed out at the system level. Corresponds to errno ETIMEDOUT."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""Export all ended spans to the configured Exporter that have not
yet been exported.

An exception is raised if the timeout is exceeeded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(continuing from #397 (comment))

I think an exception is kinda OK if we use a default timeout, but I would prefer a boolean if we had a default timeout of infinite. I am slightly in favor of making Infinite the default.

But consider https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/error-handling.md#basic-error-handling-principles, which says that we should rather not throw exceptions.

@mauriciovasquezbernal
Copy link
Member Author

In order to unblock this PR and move ahead, I went back to returning a boolean and implemented a warning as discussed in the SIG meeting.

@codeboten codeboten self-requested a review February 7, 2020 16:38
Copy link
Member

@c24t c24t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick cosmetic patch to wrap some docstrings:

diff --git a/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py b/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py
index 7690e319..e4294670 100644
--- a/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py
+++ b/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py
@@ -71,11 +71,12 @@ class SpanProcessor:
         """

     def force_flush(self, timeout_millis: int = 30000) -> bool:
-        """Export all ended spans to the configured Exporter that have not
-        yet been exported.
+        """Export all ended spans to the configured Exporter that have not yet
+        been exported.

         Args:
-            timeout_millis: The maximum amount of time to wait for spans to be exported.
+            timeout_millis: The maximum amount of time to wait for spans to be
+                exported.

         Returns:
             False if the timeout is exceeded, True otherwise.

LGTM otherwise!

I would have pushed it myself but it looks like I don't have write access(?) to the kinvolk fork.

@c24t c24t merged commit ad7a809 into open-telemetry:master Feb 7, 2020
toumorokoshi pushed a commit to toumorokoshi/opentelemetry-python that referenced this pull request Feb 17, 2020
open-telemetry#389 implemented force_flush() for the span processor. For BatchSpanProcessor
it was implemented by exposing an already existing _flush() method, it created
a race condition because the _flush() method was intended to be called only
from the context of the worker thread, this because it uses the export() method
that is not thread safe.

The result after that PR is that some tests were failing randomly because
export() was being executed in two different threads, the worker thread and the
user thread calling force_flush().

This commit fixes it by implementing a more sophisticated flush mechanism.
When a flush is requested, a special span token is inserted in the spans queue,
a flag indicating a flush operation is on progress is set and the worker thread
is waken up, after it a condition variable is monitored waiting for the worker
thread to indicate that the token has been processed.

The worker thread has a new logic to avoid sleeping (waiting on the condition
variable) when there is a flush operation going on, it also notifies the caller
(using another condition variable) when the token has been processed.
@mauriciovasquezbernal mauriciovasquezbernal deleted the mauricio/fix-force-flush branch April 14, 2020 21:51
srikanthccv pushed a commit to srikanthccv/opentelemetry-python that referenced this pull request Nov 1, 2020
* fix(plugin-http): ensure no leaks

closes open-telemetry#397

Signed-off-by: Olivier Albertini <[email protected]>

* fix: add @Flarna recommandations

Signed-off-by: Olivier Albertini <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition in export test
5 participants