Backoff not sufficient #425

peterbe · 2018-04-24T19:52:39Z

See https://sentry.prod.mozaws.net/operations/buildhub-stage/issues/1594248/

I wrote in my comment:

https://archive.mozilla.org/pub/firefox/nightly/2018/04/2018-04-24-01-36-04-mozilla-central-l10n/ does indeed exist. Now. That means we didn't back off long enough.

This is a lambda event that depends on fetching https://archive.mozilla.org/pub/firefox/nightly/2018/04/2018-04-24-01-36-04-mozilla-central-l10n/ and it tried 3 times and eventually had to give up. Since the URL now 200 OKs, it means it didn't wait long enough.

We recently made it so that the latest-inventory-to-kinto doesn't do backoff in fetch_json, that it only does that in the lambda function. We can now increase the backoff configuration to either try more times or using longer pauses.

The text was updated successfully, but these errors were encountered:

peterbe · 2018-04-24T19:54:02Z

@leplatrem Can you sanity check this issue? It seems we could simply increase the backoff times and seconds to a higher number(s) since it only matters for the lambda function.
We have a cap of 5 minutes (I think) but it should be sufficiently long to retry more patiently.

peterbe · 2018-04-25T15:34:14Z

Here's another example. At the time of writing this, https://archive.mozilla.org/pub/firefox/nightly/2018/04/2018-04-25-10-01-22-mozilla-central/firefox-61.0a1.en-US.linux-i686.json is a perfectly fine 200 OK. But in https://sentry.prod.mozaws.net/operations/buildhub-stage/issues/4284392/ it failed with a ClientResponseError

leplatrem · 2018-04-26T21:10:57Z

Indeed, we could retry more times, or some to some exponential internal (https://github.com/litl/backoff/blob/master/backoff/_wait_gen.py)

5min will give us a pretty good margin. However, we have to be prepared for the fact that sometimes the *-l10n/ folder may take more time to appear.
The code was supposed handle it like this:

buildhub/jobs/buildhub/lambda_s3_event.py

Lines 223 to 224 in 51d9563

    
           except ValueError: 
        
               files = []  # No -l10/ folder published yet.

Which is different from the case in https://sentry.prod.mozaws.net/operations/buildhub-stage/issues/4284392/ where we can see in the logs:

- Processing firefox nightly metadata: pub/firefox/nightly/2018/04/2018-04-26-10-00-55-mozilla-central/firefox-61.0a1.en-US.win32.json
- Fetch new nightly metadata
- GET 'https://archive.mozilla.org/pub/firefox/nightly/2018/04/2018-04-26-10-00-55-mozilla-central/firefox-61.0a1.en-US.win32.json'
- Backing off fetch_json(...) for 0.4s (aiohttp.client_exceptions.ClientResponseError: 404, message='Not Found')

Honestly in this case I find it very weird that it takes so much time from the S3 event to the apparition on JSON API via the bucket lister. Maybe oremj has ideas...

Otherwise, using an AWS client we may have more chance to fetch it immediately...

peterbe · 2018-04-27T13:24:09Z

In the Sentry entry it said it did first fail, then backoff for 0.4s, then second backoff for 1.2s and ultimately gave up on the 3rd attempt failing.

I honestly don't know where that 0.4 comes from. Why 0.4 and not 0.5 or 0.1 or 123.456?
Anyway, if you use backoff._wait_get.expo with (base=3, factor=0.4) you get this series:

Meaning, we if we change from 3 max. retries to 5 max. retries we'll

*attempt*
sleep(0.4)
*attempt*
sleep(1.2)
*attempt*
sleep(3.6)
*attempt*
sleep(10.8)
*attempt*
**give up!**

Total of 16 seconds.

Change it to 6 and you get a total max. sleep of 48.4 seconds. Surely that should be enough. That's almost a whole minute.

peterbe · 2018-04-27T13:27:08Z

#432 just sets it to 6. I'll check with Wei that this isn't overwritten in the env of Stage or Prod.

I'm not excited to dwell on this much more. 48 seconds is well less than 5 minutes (AWS Lambda max) and God forbid it takes longer than 48 seconds, the scraper will have to fix it later. Also, when we get the new buildhub.json we don't have to read any JSON listings at all any more anyway.

peterbe · 2018-05-14T20:13:18Z

It happened again :(
Even though we have retry_on_notfound=True and plenty of backoff it happened.
https://sentry.prod.mozaws.net/operations/buildhub-stage/issues/4326570/
I'm just going to increase the backoff a bit more.

peterbe · 2018-05-15T00:03:49Z

Oops. In #466 I forgot to use the "fixes #425" suffix.

peterbe closed this as completed in 9b83b1d Apr 27, 2018

mostlygeek mentioned this issue Apr 27, 2018

Architecture for buildhub.json #437

Closed

peterbe reopened this May 14, 2018

peterbe closed this as completed May 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backoff not sufficient #425

Backoff not sufficient #425

peterbe commented Apr 24, 2018

peterbe commented Apr 24, 2018

peterbe commented Apr 25, 2018

leplatrem commented Apr 26, 2018

peterbe commented Apr 27, 2018

peterbe commented Apr 27, 2018

peterbe commented May 14, 2018

peterbe commented May 15, 2018

Backoff not sufficient #425

Backoff not sufficient #425

Comments

peterbe commented Apr 24, 2018

peterbe commented Apr 24, 2018

peterbe commented Apr 25, 2018

leplatrem commented Apr 26, 2018

peterbe commented Apr 27, 2018

peterbe commented Apr 27, 2018

peterbe commented May 14, 2018

peterbe commented May 15, 2018