-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move remaining state to db, allow multiple build servers #1785
Conversation
At first skim it all looks great to me. I'll take a deeper look again later, and I'll have to read the postgres docs around the locking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome ❤️ thank you!!
src/build_queue.rs
Outdated
Err(err) => { | ||
log::error!("queue locked because of invalid last_seen_index_reference \"{}\" in database: {}", value, err); | ||
self.lock()?; | ||
return Ok(None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this return an error instead of discarding it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was kind of coming from the watch_registry
method where I want to keep the registry watcher running, but locked, after this happened.
So we would then manually remove / fix the ref and unlock the queue.
But I can see that this could be confusing behavior for last_seen_reference
and moved the logic into watch_registry
// additionally set the reference in the database | ||
// so this survives recreating the registry watcher | ||
// server. | ||
self.set_last_seen_reference(oid)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worries me that this isn't atomic. I worry that we'll:
- update this in the local git repo
- fail to update it in the database
- end up building the same list of crates twice
I guess that's not the end of the world. But it would be nice to add some more error logging, and to update it in the database before updating the git repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the order of these, first setting the ref in the db, then in the repo.
What additional logging to you imagine?
At the call-site of get_new_crates
( now in watch_registry
) all errors will be logged already.
debug!("Checking new crates"); | ||
match build_queue | ||
.get_new_crates(&index) | ||
.context("Failed to get new crates") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment that there should only be one registry watcher running at a time? It confused me for a bit why there wasn't any locking.
In practice I think it should be ok - the worst that happens is we'll try and insert the same crates twice, and the second insert will be ignored by the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a note
} | ||
Err(e) => { | ||
report_error(&e.context("Failed to build crate from queue")); | ||
} | ||
} | ||
})); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a comment below that says "If we panic here something is really truly wrong and trying to handle the error won't help". That no longer seems true to me, since the database could be down or whatever. But I don't know how to handle the error. Maybe we should treat it the same as the rest of the errors, exit the build loop and report it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a moment I thought you meant exiting = break
, but I think you mean continue
like with the other errors.
I don't know which cases can lead to panics in the build. I remember we still had some places sprinkled with .unwrap
or .expect
around database methods, where trying again could be valid.
Checking this I also saw that the ?
when checking the queue lock should be removed to match the error-handling in other cases in that loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the logic , only reporting and continue
also for panics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the failure case here if it's the crate causing the failure will just be that we retry building the top crate from the queue continuously every 60 seconds (because the panic bypasses incrementing the attempt counter); until someone manually removes that crate from the queue (which would have been the same resolution before). If it's something transient like the database, then we should eventually recover when it's working and we attempt to build the crate again.
On top of the docs this article is also helpful. I ran some local tests (replacing the build with a |
I threw 3 builders into the docker-compose setup and tested building some 20 crates locally, the queueing handled them all fine. I wonder if we should record the hostname or something into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and worked well in local testing (I just retested with the hostname changes).
I'll merge & deploy this separately after the current merges are live & safe, so I can watch & eventually revert better |
677342c
to
7af6fc3
Compare
7af6fc3
to
75207a7
Compare
This is a possible approach for #795, without #1011, but including the possibility to run multiple build-servers.
It contains:
start-registry-watcher
. With multiple build-servers we still should only run one registry watcherdatabase update-repository-fields
), or optionally in the registry watcher process (via--repository-stats-updater=enabled
).start-build-server
) which can be run multiple times. A queued crate will only be picked up by one of the servers.we already can run a separate webserver as often as we want with
start-web-server
, of course using the proper connection pooling limits.testing
Since test coverage for the build-process and queue is limited I might miss edge-cases. I'm happy to fix any issues that might come up.