-
Notifications
You must be signed in to change notification settings - Fork 10
Backoff not sufficient #425
Comments
@leplatrem Can you sanity check this issue? It seems we could simply increase the backoff times and seconds to a higher number(s) since it only matters for the lambda function. |
Here's another example. At the time of writing this, https://archive.mozilla.org/pub/firefox/nightly/2018/04/2018-04-25-10-01-22-mozilla-central/firefox-61.0a1.en-US.linux-i686.json is a perfectly fine 200 OK. But in https://sentry.prod.mozaws.net/operations/buildhub-stage/issues/4284392/ it failed with a ClientResponseError |
Indeed, we could retry more times, or some to some exponential internal (https://github.com/litl/backoff/blob/master/backoff/_wait_gen.py) 5min will give us a pretty good margin. However, we have to be prepared for the fact that sometimes the buildhub/jobs/buildhub/lambda_s3_event.py Lines 223 to 224 in 51d9563
Which is different from the case in https://sentry.prod.mozaws.net/operations/buildhub-stage/issues/4284392/ where we can see in the logs:
Honestly in this case I find it very weird that it takes so much time from the S3 event to the apparition on JSON API via the bucket lister. Maybe oremj has ideas... Otherwise, using an AWS client we may have more chance to fetch it immediately... |
In the Sentry entry it said it did first fail, then backoff for 0.4s, then second backoff for 1.2s and ultimately gave up on the 3rd attempt failing. I honestly don't know where that 0.4 comes from. Why 0.4 and not 0.5 or 0.1 or 123.456?
Meaning, we if we change from 3 max. retries to 5 max. retries we'll
Total of 16 seconds. Change it to 6 and you get a total max. sleep of 48.4 seconds. Surely that should be enough. That's almost a whole minute. |
#432 just sets it to 6. I'll check with Wei that this isn't overwritten in the env of Stage or Prod. I'm not excited to dwell on this much more. 48 seconds is well less than 5 minutes (AWS Lambda max) and God forbid it takes longer than 48 seconds, the scraper will have to fix it later. Also, when we get the new |
It happened again :( |
See https://sentry.prod.mozaws.net/operations/buildhub-stage/issues/1594248/
I wrote in my comment:
This is a lambda event that depends on fetching https://archive.mozilla.org/pub/firefox/nightly/2018/04/2018-04-24-01-36-04-mozilla-central-l10n/ and it tried 3 times and eventually had to give up. Since the URL now 200 OKs, it means it didn't wait long enough.
We recently made it so that the
latest-inventory-to-kinto
doesn't do backoff infetch_json
, that it only does that in the lambda function. We can now increase the backoff configuration to either try more times or using longer pauses.The text was updated successfully, but these errors were encountered: