-
Notifications
You must be signed in to change notification settings - Fork 10
Architecture for buildhub.json #437
Comments
@peterbe and @leplatrem feedback please on this proposal |
Side-note: I didn't know about this 6-month retention policy. If it's true, we could change the default for |
Notes:
LASTLY I'd rather see how things work out (with existing stack), once |
Crazy thought... Imagine if we had 1 system that is similar to Buildhub (scraper && (lambda || sqs)) but instead of worrying itself with software builds, it just writes down every single file written to/update S3. SELECT url FROM all_s3_files_ever
WHERE last_modified >= ${max(last_modified) from kinto}
and url.endswith('buildhub.json') I guess, you'd do it as a REST interface and not SQL so it's easy to control the security. So more like this: for record in requests.get('https://s3-as-a-service.com', {last_modified__gte=max(last_modified from kinto), url__endswith='buildhub.json').json():
build_info = requests.get(record['url']).json()
kinto_client.get_or_create_record(build_info) And for backfill/scraping you just do: for day in range(6 * 31):
date = now - datetime.timedelta(days=day)
batch = []
for record in requests.get('https://s3-as-a-service.com', {last_modified__date=date, url__endswith='buildhub.json').json():
build_info = requests.get(record['url']).json()
batch.append(build_info) If we had something like that, building services like Buildhub would be vastly simplified. It'd just be a glorified proxy for this database specifically about real builds. Buildhub would just be one example app you could build with this. I honestly don't know what else you'd build. Like I said, a crazy thought. |
Delete both the cron and the lambda. Use what's proposed in this issue instead. We don't have a lot of users currently so there are not a lot of work flows we will break.
Run it like a daemon. Ops will monitoring it using systemd which will restart it if/when it fails. See the tokenserver code which runs as a daemon and is synchronous.
Yes. Make the processor synchronous and procedural. Keep it simple and obvious.
No cron. We want the ability to run the SQS processors in parallel. We should have ops use systemd to spin up multiple of them (4?) on each kinto node.
Yes delete all messages that are not about |
Another option is to change our Lambda from being an on-S3-file-creation event to being a web handler. Its URL would be hardcoded into the TaskCluster code, somewhere near where the E.g. # TaskCluster stuff...
with backoff.retry(...lots of patience...):
response = requests.post(
'https://buildhub.lambda.amazonaws.com/lambdabuild',
files={'file': open('buildhub.json', 'rb')}
)
if response.status_code >= 400:
raise BreakTheBuild(response.status_code)
# Carry on as normal Advantage of this is that the JSON Schema could live nearer the Buildhub code (i.e. the Lambda code). If the TaskCluster code gets a 400, it's potentially because the Basically, this pattern is closely related to how Symbols works except that instead of a Django web server (i.e. One subtle advantage; our scraper code could potentially be an abstraction of this. It could work like this: # 1. Download the IDs and hashes from the existing Kinto server
existing = { ...every ID and its hash... }
# 2. Scrape away...
for file in every_single_s3_file_ever:
if os.path.basename(file) == 'buildhub.json':
content = json.load(file.read()) # you get the idea
if (
content['id'] not in existing or
hashed(content) != existing[content['id']]
):
from .lambda_web_handler import handle
handle(content) Advantages:
Disadvantages:
|
We're going to build an SQS consumer that looks only for We'll have our production S3 bucket send to two SQS queues:
For local development, we'll use a different S3 bucket has a "Dev Buildhub SQS" queue. So to test the SQS consumer daemon locally, you use your Mozilla AWS Dev credentials and simple PUT a Our goal is also to NOT run the scraper any more. At all. Once the |
There are three things about buildhub that are most important:
These three are the quality of the product. This is a proposal for how we should approach each of when
buildhub.json
is available.1. Completeness of data
First, we are not aiming for a complete index of every Firefox build. We have decided that 6 months of previous builds is enough to have in Buildhub. The way we back fill this data is with our Scraper. The Scraper should have whatever logic it needs to back fill 6 months of releases. This should be our approach:
It's also important to note that missing data (or perception of) directly undermines the value of Buildhub. We should make sure Buildhub can be trusted to have all the data it promises to have.
2. Freshness of data
Freshness means how long it takes for metadata to be in buildhub after a build is uploaded to archive.mozilla.org. Today we do this with S3 events triggering lambda functions. I think this is a better flow:
s3 event => sqs <=> python processor
.This design replaces the lambda function. It is just as fast and has better integrity characteristics. Tokenserver has been using this flow for years and it has proven to be very reliable.
As long as we don't mistakenly delete unfinished SQS messages failed jobs will be retried.
The processor's logic will be very simple:
3. Reliability of System
The ideas we've tried, scraper + lambda has taught us that it's really hard to maintain completeness and freshness of data. Evidence is the amount of work around code (ie: #425) we have around things.
With the intro of
buildhub.json
and the above architecture I think the system will be much more reliable and almost maintenance free.The text was updated successfully, but these errors were encountered: