Bug 1679949 - Hacky bug fix: Lower sleep time to one second #1349

badboy · 2020-12-01T15:06:38Z

Partial revert of 2261845

The default-increased sleep time causes issues for short-lived
applications, such as tests (burnham) or command-line tools (if they are
really quick):

Glean initializes asynchronously, so by the time the first ping is
submitted it is not yet done with initialization and scanning the
pending pings directory and therefore asking the uploader to sleep for a
moment and return when the scan finished.
However right now it has no mechanism to communicate a short wait, so
consumers default to the throttle backoff time (which is 60s).

So if a fast client:

Calls Glean.initialize
Records some data and submits a ping
Stops

the uploader will start, get told to wait, sleep for 60s.
Meanwhile Glean-py's shutdown routine asks the upload process to come to
an end within 30s or kills it, which is subsequently does (a sleeping
process will not notice).

By dropping back to a 1s sleep we avoid that situation for now.
A proper fix will allow glean-core to communicate the sleep time correctly.

The downside is that for long-lived Python processes, once we get
throttled it will quickly ask glean-core for tasks again 3 times in a
row, then gets told it's done.
Pending pings will then not be picked up until a new ping is submitted
again (or the process is restarted completely).
That's an acceptable risk given that we don't really have long-running
clients right now.

mdboom · 2020-12-01T15:41:22Z

Meanwhile Glean-py's shutdown routine asks the upload process to come to
an end within 30s or kills it, which is subsequently does (a sleeping
process will not notice).

This isn't by design (or at least isn't by design unless some other bug is causing the design to fail).

The thread dispatcher joins on close, with a timeout of 30s, but the subprocess dispatcher (where the ping uploader runs) has no timeout.

If Configuration.allow_multiprocessing is False, the uploading does happen on a thread. I don't see Burnham setting that, though.

All that is to say, this quick fix is fine for the reasons outlined above -- but perhaps there is a bug that is causing the work in the subprocess to block the thread...

badboy · 2020-12-01T15:47:05Z

Ok, then I am confused.

I ran this minimal script:

from glean import load_metrics, load_pings, Glean
import logging

# logging.getLogger("glean.rust_core").setLevel(logging.DEBUG)
logging.basicConfig(level=logging.DEBUG)

Glean.initialize(
    application_id="jer-py-test",
    application_version="0.1.0",
    upload_enabled=True,
    data_dir="./data"
)

metrics = load_metrics("metrics.yaml")
pings = load_pings("pings.yaml")

metrics.test_only.bad_code.add(1)
pings.test_ping.submit()
metrics.test_only.bad_code.add(2)
pings.test_ping.submit()

print("Submitted!")

and it hangs on 33.4.0 because the uploader got the wait, sleeps for 60s and 30s later I see the Timeout sending Glean telemetry log message on stdout.

mdboom · 2020-12-01T16:13:32Z

Here's the bug: we want to only run one ping uploading subprocess at a time, so we wait for one to finish if one is already running. If we submit two pings in quick succession, this causes the first ping uploading process to block the dispatcher thread. I think there's probably a better, non-blocking way to wait on the subprocess to finish.

Dexterp37 · 2020-12-01T18:04:19Z

Here's the bug: we want to only run one ping uploading subprocess at a time, so we wait for one to finish if one is already running. If we submit two pings in quick succession, this causes the first ping uploading process to block the dispatcher thread. I think there's probably a better, non-blocking way to wait on the subprocess to finish.

@badboy given the above, should this be closed?

badboy · 2020-12-02T10:39:25Z

Here's the bug: we want to only run one ping uploading subprocess at a time, so we wait for one to finish if one is already running. If we submit two pings in quick succession, this causes the first ping uploading process to block the dispatcher thread. I think there's probably a better, non-blocking way to wait on the subprocess to finish.

@badboy given the above, should this be closed?

Nope, we still delay the initial sending by up to 60s if it's too fast. It continues to run in the background but IMO it should be done quickly and not keep that process around sleeping.
We need to land both, then tackle this properly by letting glean-core tell bindings how long to wait.

Partial revert of 2261845 The default-increased sleep time causes issues for short-lived applications, such as tests (burnham) or command-line tools (if they are _really_ quick): Glean initializes asynchronously, so by the time the first ping is submitted it is not yet done with initialization and scanning the pending pings directory and therefore asking the uploader to sleep for a moment and return when the scan finished. However right now it has no mechanism to communicate a short wait, so consumers default to the throttle backoff time (which is 60s). So if a fast client: 1. Calls `Glean.initialize` 2. Records some data and submits a ping 3. Stops the uploader will start, get told to wait, sleep for 60s. Meanwhile Glean-py's shutdown routine asks the upload process to come to an end within 30s or kills it, which is subsequently does (a sleeping process will not notice). By dropping back to a 1s sleep we avoid that situation for now. A proper fix will allow glean-core to communicate the sleep time correctly. The downside is that for long-lived Python processes, once we get throttled it will quickly ask glean-core for tasks again 3 times in a row, then gets told it's done. Pending pings will then not be picked up until a new ping is submitted again (or the process is restarted completely). That's an acceptable risk given that we don't really have long-running clients right now.

badboy requested review from Dexterp37 and brizental December 1, 2020 15:06

badboy force-pushed the revert-extended-sleep branch from 900261d to ec26275 Compare December 1, 2020 15:13

brizental approved these changes Dec 1, 2020

View reviewed changes

badboy force-pushed the revert-extended-sleep branch from ec26275 to ab5efb6 Compare December 2, 2020 10:50

badboy merged commit ccb4c1f into main Dec 2, 2020

badboy deleted the revert-extended-sleep branch December 2, 2020 10:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1679949 - Hacky bug fix: Lower sleep time to one second #1349

Bug 1679949 - Hacky bug fix: Lower sleep time to one second #1349

badboy commented Dec 1, 2020

mdboom commented Dec 1, 2020

badboy commented Dec 1, 2020

mdboom commented Dec 1, 2020 •

edited

Loading

Dexterp37 commented Dec 1, 2020

badboy commented Dec 2, 2020

Bug 1679949 - Hacky bug fix: Lower sleep time to one second #1349

Bug 1679949 - Hacky bug fix: Lower sleep time to one second #1349

Conversation

badboy commented Dec 1, 2020

mdboom commented Dec 1, 2020

badboy commented Dec 1, 2020

mdboom commented Dec 1, 2020 • edited Loading

Dexterp37 commented Dec 1, 2020

badboy commented Dec 2, 2020

mdboom commented Dec 1, 2020 •

edited

Loading