Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1679949 - Hacky bug fix: Lower sleep time to one second #1349

Merged
merged 1 commit into from
Dec 2, 2020

Conversation

badboy
Copy link
Member

@badboy badboy commented Dec 1, 2020

Partial revert of 2261845

The default-increased sleep time causes issues for short-lived
applications, such as tests (burnham) or command-line tools (if they are
really quick):

Glean initializes asynchronously, so by the time the first ping is
submitted it is not yet done with initialization and scanning the
pending pings directory and therefore asking the uploader to sleep for a
moment and return when the scan finished.
However right now it has no mechanism to communicate a short wait, so
consumers default to the throttle backoff time (which is 60s).

So if a fast client:

  1. Calls Glean.initialize
  2. Records some data and submits a ping
  3. Stops

the uploader will start, get told to wait, sleep for 60s.
Meanwhile Glean-py's shutdown routine asks the upload process to come to
an end within 30s or kills it, which is subsequently does (a sleeping
process will not notice).

By dropping back to a 1s sleep we avoid that situation for now.
A proper fix will allow glean-core to communicate the sleep time correctly.

The downside is that for long-lived Python processes, once we get
throttled it will quickly ask glean-core for tasks again 3 times in a
row, then gets told it's done.
Pending pings will then not be picked up until a new ping is submitted
again (or the process is restarted completely).
That's an acceptable risk given that we don't really have long-running
clients right now.

@badboy badboy requested review from Dexterp37 and brizental December 1, 2020 15:06
@badboy badboy force-pushed the revert-extended-sleep branch from 900261d to ec26275 Compare December 1, 2020 15:13
@mdboom
Copy link
Contributor

mdboom commented Dec 1, 2020

Meanwhile Glean-py's shutdown routine asks the upload process to come to
an end within 30s or kills it, which is subsequently does (a sleeping
process will not notice).

This isn't by design (or at least isn't by design unless some other bug is causing the design to fail).

The thread dispatcher joins on close, with a timeout of 30s, but the subprocess dispatcher (where the ping uploader runs) has no timeout.

If Configuration.allow_multiprocessing is False, the uploading does happen on a thread. I don't see Burnham setting that, though.

All that is to say, this quick fix is fine for the reasons outlined above -- but perhaps there is a bug that is causing the work in the subprocess to block the thread...

@badboy
Copy link
Member Author

badboy commented Dec 1, 2020

Ok, then I am confused.

I ran this minimal script:

from glean import load_metrics, load_pings, Glean
import logging

# logging.getLogger("glean.rust_core").setLevel(logging.DEBUG)
logging.basicConfig(level=logging.DEBUG)

Glean.initialize(
    application_id="jer-py-test",
    application_version="0.1.0",
    upload_enabled=True,
    data_dir="./data"
)

metrics = load_metrics("metrics.yaml")
pings = load_pings("pings.yaml")

metrics.test_only.bad_code.add(1)
pings.test_ping.submit()
metrics.test_only.bad_code.add(2)
pings.test_ping.submit()

print("Submitted!")

and it hangs on 33.4.0 because the uploader got the wait, sleeps for 60s and 30s later I see the Timeout sending Glean telemetry log message on stdout.

@mdboom
Copy link
Contributor

mdboom commented Dec 1, 2020

Here's the bug: we want to only run one ping uploading subprocess at a time, so we wait for one to finish if one is already running. If we submit two pings in quick succession, this causes the first ping uploading process to block the dispatcher thread. I think there's probably a better, non-blocking way to wait on the subprocess to finish.

@Dexterp37
Copy link
Contributor

Here's the bug: we want to only run one ping uploading subprocess at a time, so we wait for one to finish if one is already running. If we submit two pings in quick succession, this causes the first ping uploading process to block the dispatcher thread. I think there's probably a better, non-blocking way to wait on the subprocess to finish.

@badboy given the above, should this be closed?

@badboy
Copy link
Member Author

badboy commented Dec 2, 2020

Here's the bug: we want to only run one ping uploading subprocess at a time, so we wait for one to finish if one is already running. If we submit two pings in quick succession, this causes the first ping uploading process to block the dispatcher thread. I think there's probably a better, non-blocking way to wait on the subprocess to finish.

@badboy given the above, should this be closed?

Nope, we still delay the initial sending by up to 60s if it's too fast. It continues to run in the background but IMO it should be done quickly and not keep that process around sleeping.
We need to land both, then tackle this properly by letting glean-core tell bindings how long to wait.

Partial revert of 2261845

The default-increased sleep time causes issues for short-lived
applications, such as tests (burnham) or command-line tools (if they are
_really_ quick):

Glean initializes asynchronously, so by the time the first ping is
submitted it is not yet done with initialization and scanning the
pending pings directory and therefore asking the uploader to sleep for a
moment and return when the scan finished.
However right now it has no mechanism to communicate a short wait, so
consumers default to the throttle backoff time (which is 60s).

So if a fast client:
1. Calls `Glean.initialize`
2. Records some data and submits a ping
3. Stops

the uploader will start, get told to wait, sleep for 60s.
Meanwhile Glean-py's shutdown routine asks the upload process to come to
an end within 30s or kills it, which is subsequently does (a sleeping
process will not notice).

By dropping back to a 1s sleep we avoid that situation for now.
A proper fix will allow glean-core to communicate the sleep time correctly.

The downside is that for long-lived Python processes, once we get
throttled it will quickly ask glean-core for tasks again 3 times in a
row, then gets told it's done.
Pending pings will then not be picked up until a new ping is submitted
again (or the process is restarted completely).
That's an acceptable risk given that we don't really have long-running
clients right now.
@badboy badboy force-pushed the revert-extended-sleep branch from ec26275 to ab5efb6 Compare December 2, 2020 10:50
@badboy badboy merged commit ccb4c1f into main Dec 2, 2020
@badboy badboy deleted the revert-extended-sleep branch December 2, 2020 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants