move remaining state to db, allow multiple build servers #1785

syphar · 2022-07-30T15:15:01Z

This is a possible approach for #795, without #1011, but including the possibility to run multiple build-servers.

It contains:

moving the queue-lock to the database
moving the last seen reference into the database, so we can recreate the registry watcher from scratch without any issues.
an argument to run a registry watcher via start-registry-watcher. With multiple build-servers we still should only run one registry watcher
the repository-stats-updater can be run with a scheduled ECS job (via database update-repository-fields), or optionally in the registry watcher process (via --repository-stats-updater=enabled).
a separate build-server (via start-build-server) which can be run multiple times. A queued crate will only be picked up by one of the servers.

we already can run a separate webserver as often as we want with start-web-server, of course using the proper connection pooling limits.

testing

Since test coverage for the build-process and queue is limited I might miss edge-cases. I'm happy to fix any issues that might come up.

syphar · 2022-07-30T15:17:45Z

@jyn514 @Nemo157 it would be awesome to get feedback on this

Nemo157 · 2022-07-30T15:33:17Z

At first skim it all looks great to me. I'll take a deeper look again later, and I'll have to read the postgres docs around the locking.

jyn514

This is awesome ❤️ thank you!!

jyn514 · 2022-07-30T15:44:11Z

src/build_queue.rs

+                Err(err) => {
+                    log::error!("queue locked because of invalid last_seen_index_reference \"{}\" in database: {}", value, err);
+                    self.lock()?;
+                    return Ok(None);


Shouldn't this return an error instead of discarding it?

I was kind of coming from the watch_registry method where I want to keep the registry watcher running, but locked, after this happened.

So we would then manually remove / fix the ref and unlock the queue.

But I can see that this could be confusing behavior for last_seen_reference and moved the logic into watch_registry

jyn514 · 2022-07-30T15:48:42Z

src/build_queue.rs

+        // additionally set the reference in the database
+        // so this survives recreating the registry watcher
+        // server.
+        self.set_last_seen_reference(oid)?;


It worries me that this isn't atomic. I worry that we'll:

update this in the local git repo

fail to update it in the database

end up building the same list of crates twice

I guess that's not the end of the world. But it would be nice to add some more error logging, and to update it in the database before updating the git repo.

I changed the order of these, first setting the ref in the db, then in the repo.

What additional logging to you imagine?

At the call-site of get_new_crates ( now in watch_registry) all errors will be logged already.

jyn514 · 2022-07-30T15:52:41Z

src/utils/daemon.rs

+            debug!("Checking new crates");
+            match build_queue
+                .get_new_crates(&index)
+                .context("Failed to get new crates")


Can you add a comment that there should only be one registry watcher running at a time? It confused me for a bit why there wasn't any locking.

In practice I think it should be ok - the worst that happens is we'll try and insert the same crates twice, and the second insert will be ignored by the database.

I added a note

src/utils/queue_builder.rs

jyn514 · 2022-07-30T15:57:12Z

src/utils/queue_builder.rs

+                }
+                Err(e) => {
+                    report_error(&e.context("Failed to build crate from queue"));
+                }
            }
        }));



There's a comment below that says "If we panic here something is really truly wrong and trying to handle the error won't help". That no longer seems true to me, since the database could be down or whatever. But I don't know how to handle the error. Maybe we should treat it the same as the rest of the errors, exit the build loop and report it?

For a moment I thought you meant exiting = break, but I think you mean continue like with the other errors.

I don't know which cases can lead to panics in the build. I remember we still had some places sprinkled with .unwrap or .expect around database methods, where trying again could be valid.

Checking this I also saw that the ? when checking the queue lock should be removed to match the error-handling in other cases in that loop.

I updated the logic , only reporting and continue also for panics

IIUC, the failure case here if it's the crate causing the failure will just be that we retry building the top crate from the queue continuously every 60 seconds (because the panic bypasses incrementing the attempt counter); until someone manually removes that crate from the queue (which would have been the same resolution before). If it's something transient like the database, then we should eventually recover when it's working and we attempt to build the crate again.

syphar · 2022-07-30T15:58:29Z

At first skim it all looks great to me. I'll take a deeper look again later, and I'll have to read the postgres docs around the locking.

On top of the docs this article is also helpful.

I ran some local tests (replacing the build with a sleep) and 3 build-servers correctly picked up queued crates

src/utils/daemon.rs

src/utils/mod.rs

Nemo157 · 2022-07-31T13:27:07Z

I threw 3 builders into the docker-compose setup and tested building some 20 crates locally, the queueing handled them all fine.

I wonder if we should record the hostname or something into builds too, so we can identify which worker built the docs if we need to investigate issues.

syphar · 2022-07-31T19:48:24Z

@jyn514 @Nemo157 I added new commits for the proposed changes, sot his is ready for another review

syphar · 2022-08-08T17:55:14Z

@jyn514 @Nemo157 I'm aware this is a complex thing and reviews take time.

I'm happy to add any improvements and invest more time if that's what is needed :)

( Also using this is blocked on infra anyways :) )

syphar · 2022-08-26T08:42:19Z

@Nemo157 @jyn514 ping?

Nemo157

LGTM, and worked well in local testing (I just retested with the hostname changes).

syphar · 2022-08-31T20:21:57Z

I'll merge & deploy this separately after the current merges are live & safe, so I can watch & eventually revert better

github-actions bot added the S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed label Jul 30, 2022

jyn514 approved these changes Jul 30, 2022

View reviewed changes

syphar self-assigned this Jul 30, 2022

Nemo157 reviewed Jul 31, 2022

View reviewed changes

src/utils/daemon.rs Outdated Show resolved Hide resolved

src/utils/mod.rs Outdated Show resolved Hide resolved

syphar removed their assignment Aug 3, 2022

Nemo157 approved these changes Aug 29, 2022

View reviewed changes

syphar added the S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it label Aug 31, 2022

syphar force-pushed the separate-build-server branch from 677342c to 7af6fc3 Compare September 4, 2022 07:49

syphar added 9 commits September 4, 2022 09:49

move remaining state to db, allow multiple build servers

af9db1d

move handling of broken "last seen ref" into registry watcher

dc1c53a

add comment about registry watcher only being run once

549e59e

change log message when queue is locked

e8fa364

don't lock queue after panics or when retreiving the lock fails

2464d7a

registry-watcher: when setting the last-seen ref, first set the db

619dc8b

log backtrace when logging last_seen_index error

8eae19a

refactor get/set_config with serde::De/Serialize

ffe561e

add build-server hostname to builds table

75207a7

syphar force-pushed the separate-build-server branch from 7af6fc3 to 75207a7 Compare September 4, 2022 07:51

syphar merged commit c4f9e90 into rust-lang:master Sep 4, 2022

syphar deleted the separate-build-server branch September 4, 2022 07:51

github-actions bot removed the S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed label Sep 4, 2022

syphar removed the S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it label Sep 4, 2022

syphar mentioned this pull request Feb 7, 2023

Ansible for the docs-rs builder rust-lang/simpleinfra#231

Merged

rylev mentioned this pull request Feb 8, 2023

Split apart web server and build server #795

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move remaining state to db, allow multiple build servers #1785

move remaining state to db, allow multiple build servers #1785

syphar commented Jul 30, 2022

syphar commented Jul 30, 2022

Nemo157 commented Jul 30, 2022

jyn514 left a comment

jyn514 Jul 30, 2022

syphar Jul 30, 2022

jyn514 Jul 30, 2022

syphar Jul 30, 2022

jyn514 Jul 30, 2022

syphar Jul 30, 2022

jyn514 Jul 30, 2022

syphar Jul 30, 2022

syphar Jul 30, 2022

Nemo157 Jul 31, 2022

syphar commented Jul 30, 2022

Nemo157 commented Jul 31, 2022

syphar commented Jul 31, 2022

syphar commented Aug 8, 2022

syphar commented Aug 26, 2022

Nemo157 left a comment

syphar commented Aug 31, 2022

move remaining state to db, allow multiple build servers #1785

move remaining state to db, allow multiple build servers #1785

Conversation

syphar commented Jul 30, 2022

testing

syphar commented Jul 30, 2022

Nemo157 commented Jul 30, 2022

jyn514 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

syphar commented Jul 30, 2022

Nemo157 commented Jul 31, 2022

syphar commented Jul 31, 2022

syphar commented Aug 8, 2022

syphar commented Aug 26, 2022

Nemo157 left a comment

Choose a reason for hiding this comment

syphar commented Aug 31, 2022