Architecture for buildhub.json #437

mostlygeek · 2018-04-27T20:02:58Z

There are three things about buildhub that are most important:

Completeness of data
Freshness of data
Reliability of system

These three are the quality of the product. This is a proposal for how we should approach each of when buildhub.json is available.

1. Completeness of data

First, we are not aiming for a complete index of every Firefox build. We have decided that 6 months of previous builds is enough to have in Buildhub. The way we back fill this data is with our Scraper. The Scraper should have whatever logic it needs to back fill 6 months of releases. This should be our approach:

The scraper is only guaranteed to handle back filling 6 months of builds from the day buildhub.json exists
any further back in time we will take requests (or PRs) to add logic to back fill more
the scraper should only need to be run manually
the scraper should be idempotent
the scraper should understand how to use buildhub.json files

It's also important to note that missing data (or perception of) directly undermines the value of Buildhub. We should make sure Buildhub can be trusted to have all the data it promises to have.

2. Freshness of data

Freshness means how long it takes for metadata to be in buildhub after a build is uploaded to archive.mozilla.org. Today we do this with S3 events triggering lambda functions. I think this is a better flow: s3 event => sqs <=> python processor.

This design replaces the lambda function. It is just as fast and has better integrity characteristics. Tokenserver has been using this flow for years and it has proven to be very reliable.

As long as we don't mistakenly delete unfinished SQS messages failed jobs will be retried.

The processor's logic will be very simple:

while message = sqs.get(...):  # forever long polling...
    try: 
        if message.filename == "buildhub.json": 
            # process it into buildhub
            # throw any errors we encounter

        sqs.delete(message.id)
    except: 
        # ... log exception

3. Reliability of System

The ideas we've tried, scraper + lambda has taught us that it's really hard to maintain completeness and freshness of data. Evidence is the amount of work around code (ie: #425) we have around things.

With the intro of buildhub.json and the above architecture I think the system will be much more reliable and almost maintenance free.

The text was updated successfully, but these errors were encountered:

mostlygeek · 2018-04-27T20:03:22Z

@peterbe and @leplatrem feedback please on this proposal

peterbe · 2018-04-27T21:28:50Z

Side-note: I didn't know about this 6-month retention policy. If it's true, we could change the default for MIN_AGE_LAST_MODIFIED_HOURS from 0 to 24 * 30 * 6 = 4320 instead.
It would allow us to quickly ignore about 95% of the rows in the CSV files.

peterbe · 2018-04-30T15:54:03Z

Notes:

http://boto3.readthedocs.io/en/latest/guide/sqs.html#processing-messages is an example of using boto3 to consume an SQS queue.
We could introduce this technique AND keep Lambda, to soft-land it. However, since Kinto isn't great with concurrent writes we might get crashes that makes us think it failed to store the information but it actually did.
There appears to be two different ways of using SQS. Either you start a Python program that consumes (up to) 10 messages, then quits. I guess you'd have to start this like a regular cron job every couple of minutes. The other way is to try to start it as a daemon with a while True loop wrapping the queue.receive_messages() iterator. Daemons might die. Just like Node. So you need a system that makes sure it starts it up if it can't stay up by itself.
All the example I find use boto3 which is not compatible with asyncio. Right? So we'd have to use aiobotocore. "other users report that SQS and Dynamo services work also"
asyncio is not needed at all for any of this. asyncio is only useful in the scraper where it gives us a time-performance boost of doing multiple things concurrently whilst waiting for network IO. Our current stack, the scraper main function and the lambda_handler function both feed into to_kinto.py which is all asyncio. Figuring out how to do this right will require some experimentation.
Instead of running the SQS consumer as a daemon, we could use cron to start a script that exits early if another script is currently running.
One major thing that sucks about Lambda is that developers can't run it on their laptops. SQS is better in that anybody could run it locally. All they need is AWS credentials (emailed from Cloud Ops) and access to the name of the SQS queue. Question: If I locally run a consumer instance, and delete messages I have read, it won't affect Stage or Prod, right?
Generally the best practice is to always delete successfully read messages. ...independent of being able to successfully deal with the message or not. This to avoid rogue messages that are haunted and keep blocking up the queue. This is how Celery works by default (you have to manually change it to put the failed messages back into the queue). If you go for this, how is it any better than Lambda?
Why not use Pulse? Every TaskCluster execution is broadcast on Pulse already. We can relatively easily filter it by namespace (taskcluster) and a pattern matcher (thing.url.endswith('/buildhub.json')). Then we don't have to set up a SQS producer at all.
Pretty sure with Pulse you can create a consumer instance that deletes messages without affecting other consumer instances. That makes it easy for Stage, Prod, localdevA, localdevB to run the queue consumption in the exact same way. Perhaps SQS has this too.
Pulse vs. SQS is personal taste probably. One is industry-tested (and vendor lock-in), the other is dog-fooding (forget Stackoverflow as a form of support). Both needs credentials. Both needs the daemon vs. non-daemon challenge.

LASTLY

I'd rather see how things work out (with existing stack), once buildhub.json is in place. Perhaps the reliability problems we've had with Lambda goes away once the code can be greatly simplified (e.g. no more race-condition problems looking up adjacent files). If today we're 90% reliable, with buildhub.json we might go up to 98%. With SQS we might get up to 99% but there'll be teething pains and we might drop in reliability till we iron out the kinks.

peterbe · 2018-04-30T15:58:19Z

Crazy thought...

Imagine if we had 1 system that is similar to Buildhub (scraper && (lambda || sqs)) but instead of worrying itself with software builds, it just writes down every single file written to/update S3.
The inventory CSV files (*.csv.gz) is a monster list of ~50M filenames, their size and their LastModifiedDate. If you write down every single file in a relational database or Elasticsearch with some for of index or partition on the LastModified you could so super easily send it something like:

SELECT url FROM all_s3_files_ever 
WHERE last_modified >= ${max(last_modified) from kinto}
and url.endswith('buildhub.json')

I guess, you'd do it as a REST interface and not SQL so it's easy to control the security. So more like this:

for record in requests.get('https://s3-as-a-service.com', {last_modified__gte=max(last_modified from kinto), url__endswith='buildhub.json').json():
    build_info = requests.get(record['url']).json()
    kinto_client.get_or_create_record(build_info)

And for backfill/scraping you just do:

for day in range(6 * 31):
  date = now - datetime.timedelta(days=day)
  batch = []
  for record in requests.get('https://s3-as-a-service.com', {last_modified__date=date, url__endswith='buildhub.json').json():
    build_info = requests.get(record['url']).json()
    batch.append(build_info)

If we had something like that, building services like Buildhub would be vastly simplified. It'd just be a glorified proxy for this database specifically about real builds. Buildhub would just be one example app you could build with this. I honestly don't know what else you'd build. Like I said, a crazy thought.

mostlygeek · 2018-04-30T18:21:03Z

We could introduce this technique AND keep Lambda, to soft-land it. However, since Kinto isn't great with concurrent writes we might get crashes that makes us think it failed to store the information but it actually did.

Delete both the cron and the lambda. Use what's proposed in this issue instead. We don't have a lot of users currently so there are not a lot of work flows we will break.

There appears to be two different ways of using SQS. ... I guess you'd have to start this like a regular cron job every couple of minutes. The other way is to try to start it as a daemon with a while True loop

Run it like a daemon. Ops will monitoring it using systemd which will restart it if/when it fails. See the tokenserver code which runs as a daemon and is synchronous.

asyncio is not needed at all for any of this.

Yes. Make the processor synchronous and procedural. Keep it simple and obvious.

Instead of running the SQS consumer as a daemon, we could use cron to start a script that exits early if another script is currently running.

No cron. We want the ability to run the SQS processors in parallel. We should have ops use systemd to spin up multiple of them (4?) on each kinto node.

Generally the best practice is to always delete successfully read messages. ...independent of being able to successfully deal with the message or not. This to avoid rogue messages that are haunted and keep blocking up the queue. ... If you go for this, how is it any better than Lambda?

Yes delete all messages that are not about buildhub.json. Crashes should log a sentry error. It's better than lambda because messages that trigger crashes are retried and not lost. We should fix bugs that cause crashes.

peterbe · 2018-05-14T16:04:33Z

Another option is to change our Lambda from being an on-S3-file-creation event to being a web handler. Its URL would be hardcoded into the TaskCluster code, somewhere near where the buildhub.json file is made, and it would HTTP POST the contents of that.

E.g.

# TaskCluster stuff...

with backoff.retry(...lots of patience...):
    response = requests.post(
        'https://buildhub.lambda.amazonaws.com/lambdabuild', 
        files={'file': open('buildhub.json', 'rb')}
    )
    if response.status_code >= 400:
        raise BreakTheBuild(response.status_code)

# Carry on as normal

Advantage of this is that the JSON Schema could live nearer the Buildhub code (i.e. the Lambda code). If the TaskCluster code gets a 400, it's potentially because the buildhub.json's content is now busted and needs attention.

Basically, this pattern is closely related to how Symbols works except that instead of a Django web server (i.e. https://symbols.mozilla.org) we use Lambda web handlers. All they needs is the credentials and URL to Kinto.

This is close to what was originally discussed about the creation and effects of the buildhub.json file.

One subtle advantage; our scraper code could potentially be an abstraction of this. It could work like this:

# 1. Download the IDs and hashes from the existing Kinto server
existing = { ...every ID and its hash... }

# 2. Scrape away...
for file in every_single_s3_file_ever:
   if os.path.basename(file) == 'buildhub.json':
      content = json.load(file.read())  # you get the idea
      if (
          content['id'] not in existing or 
          hashed(content) != existing[content['id']]
      ):
          from .lambda_web_handler import handle
          handle(content)

Advantages:

JSON Schema stays with the Buildhub project.
TaskCluster side will immediately be alerted to build failures if it can't post the build metadata.
No delay between build and it appearing in Kinto.
The backfill/scraper code will be closer to each other to pick up potential slack as we iron out bugs.
No S3 at all. Especially attractive if the bucket they use is eventually consistent (region standard?)
If the Lambda web handler code crashes, the JSON content would be captured in Sentry for easier debugging.

Disadvantages:

We'd need to put in some sort of authentication token into TaskCluster (similar to Symbols) so that we can trust the HTTP POST'ers of this Lambda web handler URL.
With something like SQS it's easier for the likes of Stage, and a dev's laptop, to get a stream of what's happening on Prod.
Lambda is annoying to deploy since it means we need to package it and send off the code.

peterbe · 2018-05-14T19:46:29Z

We're going to build an SQS consumer that looks only for buildhub.json files.
Starting with: #465

We'll have our production S3 bucket send to two SQS queues:

One for Prod Buildhub SQS
One for Stage Buildhub SQS

For local development, we'll use a different S3 bucket has a "Dev Buildhub SQS" queue. So to test the SQS consumer daemon locally, you use your Mozilla AWS Dev credentials and simple PUT a buildhub.json file in dev S3 bucket.

Our goal is also to NOT run the scraper any more. At all. Once the buildhub.json files are confidently created and uploaded by RelEng in TaskCluster, we should capture ALL of them. If you incorrectly acknowledge queue messages but fail to write them down properly, then we reevaluate from there. Perhaps revive the scraper cron job but simplify it by only looking for the buildhub.json file name pattern.

mostlygeek added the proposal label Apr 27, 2018

peterbe mentioned this issue May 14, 2018

Bake lambda.zip in Docker image we ship to Dockerhub #423

Closed

peterbe closed this as completed May 14, 2018

This was referenced May 16, 2018

Integrate with taskcluster #349

Closed

Write SQS consumer looking for buildhub.json files #465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture for buildhub.json #437

Architecture for buildhub.json #437

mostlygeek commented Apr 27, 2018

mostlygeek commented Apr 27, 2018

peterbe commented Apr 27, 2018

peterbe commented Apr 30, 2018

peterbe commented Apr 30, 2018 •

edited

Loading

mostlygeek commented Apr 30, 2018

peterbe commented May 14, 2018

peterbe commented May 14, 2018

Architecture for buildhub.json #437

Architecture for buildhub.json #437

Comments

mostlygeek commented Apr 27, 2018

1. Completeness of data

2. Freshness of data

3. Reliability of System

mostlygeek commented Apr 27, 2018

peterbe commented Apr 27, 2018

peterbe commented Apr 30, 2018

peterbe commented Apr 30, 2018 • edited Loading

mostlygeek commented Apr 30, 2018

peterbe commented May 14, 2018

peterbe commented May 14, 2018

peterbe commented Apr 30, 2018 •

edited

Loading