sdk: fix force_flush in batch span processor #397

mauriciovasquezbernal · 2020-02-06T00:16:42Z

#389 implemented
force_flush() for the span processor. For BatchSpanProcessor it was implemented
by exposing an already existing _flush() method, it created a race condition
because the _flush() method was intended to be called only from the context
of the worker thread, this because it uses the export() method that is not
thread safe.

The result after that PR is that some tests were failing randomly because export()
was being executed in two different threads, the worker thread and the user
thread calling force_flush().

This commit fixes it by implementing a more sophisticated flush mechanism.
When a flush is requested, a special span token is inserted in the spans queue,
a flag indicating a flush operation is on progress is set and the worker thread
is waken up, after it a condition variable is monitored waiting for the worker
thread to indicate that the token has been processed.

The worker thread has a new logic to avoid sleeping (waiting on the condition
variable) when there is a flush operation going on, it also notifies the
caller (using another condition variable) when the token has been processed.

Fixes: #396

open-telemetry#389 implemented force_flush() for the span processor. For BatchSpanProcessor it was implemented by exposing an already existing _flush() method, it created a race condition because the _flush() method was intended to be called only from the context of the worker thread, this because it uses the export() method that is not thread safe. The result after that PR is that some tests were failing randomly because export() was being executed in two different threads, the worker thread and the user thread calling force_flush(). This commit fixes it by implementing a more sophisticated flush mechanism. When a flush is requested, a special span token is inserted in the spans queue, a flag indicating a flush operation is on progress is set and the worker thread is waken up, after it a condition variable is monitored waiting for the worker thread to indicate that the token has been processed. The worker thread has a new logic to avoid sleeping (waiting on the condition variable) when there is a flush operation going on, it also notifies the caller (using another condition variable) when the token has been processed.

codecov-io · 2020-02-06T00:28:13Z

Codecov Report

Merging #397 into master will increase coverage by 0.03%.
The diff coverage is 87.87%.

@@            Coverage Diff             @@
##           master     #397      +/-   ##
==========================================
+ Coverage   85.29%   85.32%   +0.03%     
==========================================
  Files          38       38              
  Lines        1931     1956      +25     
  Branches      227      231       +4     
==========================================
+ Hits         1647     1669      +22     
- Misses        219      221       +2     
- Partials       65       66       +1

Impacted Files	Coverage Δ
...emetry-sdk/src/opentelemetry/sdk/trace/__init__.py	`90.74% <100%> (ø)`	⬆️
...sdk/src/opentelemetry/sdk/trace/export/__init__.py	`86.46% <87.5%> (+0.35%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae484cb...fcccf2c. Read the comment docs.

codeboten

Change overall looks good, just a couple of questions in the comments.

opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py

codeboten · 2020-02-06T00:29:47Z

opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py

+            timeout_millis: The maximum amount of time to wait for spans to be exported.
+
+        Returns:
+            False if the timeout is exceeded, True otherwise.


Any thoughts around returning a boolean here rather than raising an exception if the timeout is exceeded?

Right, I updated to raise a RuntimeError exception. I'm not sure about the type of this exception, there is a timeout exception on Python but the description doesn't fully align with our case: "Raised when a system function timed out at the system level. Corresponds to errno ETIMEDOUT."

(see #397 (comment))

Oberon00 · 2020-02-06T13:04:37Z

opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py

        """Export all ended spans to the configured Exporter that have not
        yet been exported.
+
+        An exception is raised if the timeout is exceeeded.


(continuing from #397 (comment))

I think an exception is kinda OK if we use a default timeout, but I would prefer a boolean if we had a default timeout of infinite. I am slightly in favor of making Infinite the default.

But consider https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/error-handling.md#basic-error-handling-principles, which says that we should rather not throw exceptions.

This reverts commit 12257df.

mauriciovasquezbernal · 2020-02-07T16:16:50Z

In order to unblock this PR and move ahead, I went back to returning a boolean and implemented a warning as discussed in the SIG meeting.

c24t

Quick cosmetic patch to wrap some docstrings:

diff --git a/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py b/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py
index 7690e319..e4294670 100644
--- a/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py
+++ b/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py
@@ -71,11 +71,12 @@ class SpanProcessor:
         """

     def force_flush(self, timeout_millis: int = 30000) -> bool:
-        """Export all ended spans to the configured Exporter that have not
-        yet been exported.
+        """Export all ended spans to the configured Exporter that have not yet
+        been exported.

         Args:
-            timeout_millis: The maximum amount of time to wait for spans to be exported.
+            timeout_millis: The maximum amount of time to wait for spans to be
+                exported.

         Returns:
             False if the timeout is exceeded, True otherwise.

LGTM otherwise!

I would have pushed it myself but it looks like I don't have write access(?) to the kinvolk fork.

open-telemetry#389 implemented force_flush() for the span processor. For BatchSpanProcessor it was implemented by exposing an already existing _flush() method, it created a race condition because the _flush() method was intended to be called only from the context of the worker thread, this because it uses the export() method that is not thread safe. The result after that PR is that some tests were failing randomly because export() was being executed in two different threads, the worker thread and the user thread calling force_flush(). This commit fixes it by implementing a more sophisticated flush mechanism. When a flush is requested, a special span token is inserted in the spans queue, a flag indicating a flush operation is on progress is set and the worker thread is waken up, after it a condition variable is monitored waiting for the worker thread to indicate that the token has been processed. The worker thread has a new logic to avoid sleeping (waiting on the condition variable) when there is a flush operation going on, it also notifies the caller (using another condition variable) when the token has been processed.

@Flarna

* fix(plugin-http): ensure no leaks closes open-telemetry#397 Signed-off-by: Olivier Albertini <[email protected]> * fix: add @Flarna recommandations Signed-off-by: Olivier Albertini <[email protected]>

mauriciovasquezbernal requested a review from a team February 6, 2020 00:16

codeboten suggested changes Feb 6, 2020

View reviewed changes

raise an exception if timeout is exceeded.

12257df

Oberon00 reviewed Feb 6, 2020

View reviewed changes

mauriciovasquezbernal added 3 commits February 7, 2020 11:08

Revert "raise an exception if timeout is exceeded."

d4c3c03

This reverts commit 12257df.

print warning when timeout is exceeded.

e3c4aa1

test warning message

ef7f301

codeboten self-requested a review February 7, 2020 16:38

codeboten approved these changes Feb 7, 2020

View reviewed changes

c24t approved these changes Feb 7, 2020

View reviewed changes

wrap docstrings

fcccf2c

c24t merged commit ad7a809 into open-telemetry:master Feb 7, 2020

mauriciovasquezbernal deleted the mauricio/fix-force-flush branch April 14, 2020 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sdk: fix force_flush in batch span processor #397

sdk: fix force_flush in batch span processor #397

mauriciovasquezbernal commented Feb 6, 2020

codecov-io commented Feb 6, 2020 •

edited

Loading

codeboten left a comment

codeboten Feb 6, 2020

mauriciovasquezbernal Feb 6, 2020

Oberon00 Feb 6, 2020

Oberon00 Feb 6, 2020

mauriciovasquezbernal commented Feb 7, 2020

c24t left a comment

sdk: fix force_flush in batch span processor #397

sdk: fix force_flush in batch span processor #397

Conversation

mauriciovasquezbernal commented Feb 6, 2020

codecov-io commented Feb 6, 2020 • edited Loading

Codecov Report

codeboten left a comment

Choose a reason for hiding this comment

codeboten Feb 6, 2020

Choose a reason for hiding this comment

mauriciovasquezbernal Feb 6, 2020

Choose a reason for hiding this comment

Oberon00 Feb 6, 2020

Choose a reason for hiding this comment

Oberon00 Feb 6, 2020

Choose a reason for hiding this comment

mauriciovasquezbernal commented Feb 7, 2020

c24t left a comment

Choose a reason for hiding this comment

codecov-io commented Feb 6, 2020 •

edited

Loading