Remove nexus-test-utils build.rs #4056

smklein · 2023-09-08T01:51:51Z

We previously used a build.rs script to create a "seed" directory for CRDB, so that tests wouldn't need to duplicate the work of initializing the DB for all our tests.

This worked, but unfortunately meant that the work of CRDB construction happened at build-time, which (1) could be unnecessary, in cases where we aren't running tests, and (2) meant that we used a build.rs script to do this work, which would cause problems for rust-analyzer.

This PR replaces that build.rs script with a more dynamic approach: We still construct the seed directory, but we do so "at test time", letting the first few invocations race with each other to create a directory with a digest name based on the CRDB schema and version. "Whoever wins first" gets to populate the seed directory, and all subsequent tests rely on that version.

smklein · 2023-09-08T01:56:39Z

As an example of what I see when I run cargo test -p omicron-nexus now:

# A bunch of tests all trying to populate CRDB concurrently...
🏠 /tmp/crdb-base                                                                                                      
 $ la -la                                                                                                              
total 292                                                                                                              
drwxrwxr-x  18 smklein smklein   4096 Sep  7 18:54 .                                                                   
drwxrwxrwt 895 root    root    225280 Sep  7 18:54 ..                                                                  
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpabJZ2u
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpAqAaQZ
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpBDZD3U
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpdI7OzM
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpFE1NhU
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpIVTLfK
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpjC1uks
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpjWGSEw
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpLT6IgR
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpMC92vU
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpP8rpzF
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmprPeh8E
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpVVZlZy
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpw0uDZF
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpxbqg8i
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpy3Eibf

# One of them wins, and renames their .tmp directory to the digest here...
🏠 /tmp/crdb-base 
 $ la -la
total 288
drwxrwxr-x  17 smklein smklein   4096 Sep  7 18:54 .
drwxrwxrwt 895 root    root    225280 Sep  7 18:54 ..
drwxrwxr-x   4 smklein smklein   4096 Sep  7 18:54 586b14361321b0e0e48755d42040c5e5ed6b18302bdf2b2e356dec5cf466a294
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpabJZ2u
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpAqAaQZ
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpBDZD3U
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpFE1NhU
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpIVTLfK
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpjC1uks
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpjWGSEw
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpMC92vU
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpP8rpzF
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmprPeh8E
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpVVZlZy
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpw0uDZF
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpxbqg8i
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpy3Eibf

# All others finish, throw away their .tmp directories, and use the same seed.
# All subsequent tests don't bother creating .tmp directories, they just use the seed.
🏠 /tmp/crdb-base 
 $ la -la
total 288
drwxrwxr-x  17 smklein smklein   4096 Sep  7 18:54 .
drwxrwxrwt 895 root    root    225280 Sep  7 18:54 ..
drwxrwxr-x   4 smklein smklein   4096 Sep  7 18:54 586b14361321b0e0e48755d42040c5e5ed6b18302bdf2b2e356dec5cf466a294

smklein · 2023-09-08T02:55:45Z

.github/buildomat/build-and-test.sh

+
+# We expect the seed CRDB to be placed here, so we explicitly remove it. There
+# is no "final test", so cargo test wouldn't really know when to remove it.
+rm -rf "$TEST_TMPDIR/crdb-base"


This data used to be placed in:

OUT_DIR — the folder in which all output and intermediate artifacts should be placed. This folder is inside the build directory for the package being built, and it is unique for the package in question.

(Which is an environment variable cargo sets for build scripts)

Well, this isn't run by a build script any longer, so now it's going in $TEST_TMPDIR, which is why I'm removing it explicitly now.

jordanhendricks · 2023-09-08T03:53:23Z

I have in the past found requiring a cockroach binary as a part of the omicron build to be a bit surprising (though it makes sense to me why things ended up that way). I also think an important (and perhaps also surprising) consequence of relying on PATH as a part of the build environment for omicron could have unexpected impacts with IDE-like tools that rely on building source (namely, rust-analyzer).

I am still ramping up on omicron tests (and don't have any experience with cargo-nextest at all yet), but I am curious if there's a way the creation of the directory could be done before the invocation of any tests. It seems to me that introducing an intentional race to create a shared resource for the tests could introduce bugs in the tests themselves. I also wonder about how much work the creation of each of these seed directories is doing -- is it going to consume more time / CPU to have an arbitrary number of tests attempt to create the directory?

jordanhendricks · 2023-09-08T04:26:16Z

nexus/test-utils/src/db.rs

+// However, the process for creating these seed directories is not atomic.
+// We create a temporary directory within:
+//
+// - /tmp/crdb-base/...
+//
+// And rename it to the final "digest" location once it has been fully created.


Thinking about this a bit more (and assuming it's hard to serialize the creation of this directory ahead of test invocations, which I assumed, but as I said, don't have a lot of experience with nextest yet and am less sure of what features it has), one way to make things more explicit could be to use a file as a lock to take ownership of setting up the directory.

So the winning test path would:

create the directory

create a lockfile in the directory

lock the file

create the database

To protect against races, I think an approach that might work is:

mkdir the directory; if the test gets EEXIST, then skip to trying to open the lockfile (without O_CREAT | O_EXCL and try to lock the lockfile with flock()

open O_CREAT | O_EXCL of the lockfile; if the test gets EEXIST, skip to flock

flock the lockfile, then check if the database exists

if the database doesn't exist, create it

What do you think?

I thought about this approach, but it falls over if any test crashes (or is ^C-ed, or fails for any other reason) during DB setup.

I do understand that the flock portion of this could help mitigate that concern, but that's also pretty filesystem / platform specific (see: https://apenwarr.ca/log/20101213 - doesn't work on NFS, silently fails in some situations, and even https://docs.rs/fs2/latest/fs2/trait.FileExt.html , which appears to be a fairly popular filesystem locking crate, says "File locks are a cross-platform hazard since the file lock APIs exposed by operating system kernels vary in subtle and not-so-subtle ways.", and is not regularly tested on illumos).

Also, ultimately, if some process failed partway through creation, and we needed to hand-off creation of the "seed directory" to a different test process, we'd need additional logic to distinguish between "this directory exists because I should use it" vs "this directory exists and I should try to salvage it, or destroy it and start over".

I thought the approach of "do an atomic rename" seemed simpler -- the directory can only (atomically) exist after it has been successfully created, and if any test failed during setup, their output would be isolated. Besides, the work of "spin up a CRDB node and populate it" was the default behavior for all tests using the DB before I tried adding this "seed" optimization. At worst, it's only a small performance hit for the first couple tests that are run (and if the schema changes, that seed directory still exists, so it can be re-used for subsequent test runs).

I'd have to second @smklein's concern about using file locks anywhere for tests.

First, a few things about file locking:

The guidance in the article you link is not unfair from a historical perspective, but it is also more than a decade old, and describes issues with versions of Linux and Mac OS X that were old at that time

We would use whole-file OFD-style exclusive locks; this avoids any of the issues one might bump into with shared locks and upgrades

flock(3C) is robust and well-tested on illumos, FWIW. We built it to work the way it does on Linux (at the time, for our LX brand work) where it is also well-tested enough at this stage as far as I can tell. Our flock(3C) creates OFD-style locks, which we also support (again for compatibility with Linux) with F_OFD_* commands via fcntl(2).

We're talking about files in /tmp, which should never be NFS-backed; the era of NFS root for diskless workstations has since passed

As I recall the fs2 crate is long-since abandoned; we forked it for a while as fs3, but I believe the community has since moved on to fs4

if some process failed partway through creation, and we needed to hand-off creation of the "seed directory" to a different test process, we'd need additional logic to distinguish between "this directory exists because I should use it" vs "this directory exists and I should try to salvage it, or destroy it and start over".

I think this logic is relatively simple: put a marker file in the directory, or write marker data into the lock file, once the setup is complete. If the marker is not present and correct when you get to lock the file, remove the database files and start again. Otherwise, release the lock and use the database seed.

You could also use a combination of the atomic directory rename you're talking about, and the lock file: lock the lock file, and:

if you see the final expected directory name, say, database, you know you can use it; drop the lock and do that

otherwise:

if you see a temporary name, like, say, notyet, remove it now

create the database, rename the directory to database

Either way, once you drop the lock, there will be a correctly provisioned seed in database.

Here is a simple example of using flock(3C) through the fs4 crate: https://github.com/jclulow/junk-locker

If it's okay with ya'll, I'd like to punt on this. Not necessarily saying we shouldn't do it, but maybe not within this PR?

This locking proposal is an optimization to avoid a minor amount of wasted work in tests

If we introduce this mechanism, I do think it's more complicated, and would want to spend more time validating and testing it...

... and the benefits of that testing and optimization seem like they'd both be tossed out the window if we implement Support for setup and teardown scripts nextest-rs/nextest#683 anyway?

... and the benefits of that testing and optimization seem like they'd both be tossed out the window if we implement Support for setup and teardown scripts nextest-rs/nextest#683 anyway?

If we're going to toss it out anyway, could we just wait until we've properly assessed and prioritized whether we should do the nextest feature?

Another option I've heard people use is a Mutex<()>, almost like a barrier. See rust-lang/rust#43155 for some discussion on that. Generally this is for running tests themselves serially, but I could imagine adapting this purely for the setup instead.

The cleanest solution would be the upstream feature. If that were coming soon, then that would be ideal.

If not though, basically this is a game of "which way of emulating this feature sucks the least." Build scripts have their own annoying downsides. But I also agree with @smklein that this discussion feels like premature optimization. Is that actually the case? Does this patch slow down the test suite, and if so, by how much?

FWIW @steveklabnik I was more worried about intentionally racing the tests against each other. Maybe that's not a fair concern. I also agree that the upstream feature is cleanest, and hence was curious about that.

davepacheco · 2023-09-08T13:45:42Z

It seems potentially counterproductive to the goal of reducing iteration time to do a bunch of work in parallel many times over and throw most of it away. (Maybe that's the wrong way to look at it though -- it's just a cache, and we were doing this even more before the build script.)

But I don't think I understand the problem this is trying to solve. It's true that if you're just building with no intent to run tests, this is extra work. How long does it actually take? Would it make sense to use cargo check? Or avoid building nexus-test-utils in that case? (That may be impractical -- I'm not sure.)

Is this just about rust-analyzer? Is it about the PATH issue involving cockroach or also that it takes a long time?

smklein · 2023-09-08T15:52:10Z

It seems potentially counterproductive to the goal of reducing iteration time to do a bunch of work in parallel many times over and throw most of it away. (Maybe that's the wrong way to look at it though -- it's just a cache, and we were doing this even more before the build script.)

Yes, it's just a fraction of "what we used to do before" from the testing point-of-view.

But I don't think I understand the problem this is trying to solve. It's true that if you're just building with no intent to run tests, this is extra work. How long does it actually take?

In a build, the old build script added a few seconds, depending on your machine.

Would it make sense to use cargo check? Or avoid building nexus-test-utils in that case? (That may be impractical -- I'm not sure.)

This impacts building the Nexus integration tests, or just cargo build, and gets re-run if any of the following change, as they are all dependencies of nexus/test-utils:

nexus-db-queries
nexus-types
omicron-common
omicron-sled-agent
oximeter
etc

Is this just about rust-analyzer? Is it about the PATH issue involving cockroach or also that it takes a long time?

This was a major part of it too. When I added the code to the build.rs script, I mentioned "this doesn't need to happen at build time" in a comment, and build.rs scripts are their own bag of complexity. Yes, this issue can be mitigated, but it's "just another pain point" building Nexus.

smklein · 2023-09-27T19:26:24Z

Closing in favor of #4150

Remove nexus-test-utils build.rs

454bcab

smklein requested a review from steveklabnik September 8, 2023 01:52

Nudge buildomat to ignore crdb-base

45a0be4

smklein requested a review from davepacheco September 8, 2023 02:42

smklein commented Sep 8, 2023

View reviewed changes

jordanhendricks reviewed Sep 8, 2023

View reviewed changes

sunshowers mentioned this pull request Sep 27, 2023

[nexus-test-utils] replace build.rs with a new setup script #4150

Merged

smklein closed this Sep 27, 2023

smklein deleted the less-buildrs branch September 27, 2023 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove nexus-test-utils build.rs #4056

Remove nexus-test-utils build.rs #4056

smklein commented Sep 8, 2023

smklein commented Sep 8, 2023

smklein Sep 8, 2023

jordanhendricks commented Sep 8, 2023 •

edited

Loading

jordanhendricks Sep 8, 2023 •

edited

Loading

smklein Sep 8, 2023

andrewjstone Sep 8, 2023

jclulow Sep 8, 2023

smklein Sep 8, 2023

jordanhendricks Sep 8, 2023

steveklabnik Sep 11, 2023

jordanhendricks Sep 14, 2023

davepacheco commented Sep 8, 2023 •

edited

Loading

smklein commented Sep 8, 2023 •

edited

Loading

smklein commented Sep 27, 2023

Remove nexus-test-utils build.rs #4056

Remove nexus-test-utils build.rs #4056

Conversation

smklein commented Sep 8, 2023

smklein commented Sep 8, 2023

Choose a reason for hiding this comment

jordanhendricks commented Sep 8, 2023 • edited Loading

jordanhendricks Sep 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davepacheco commented Sep 8, 2023 • edited Loading

smklein commented Sep 8, 2023 • edited Loading

smklein commented Sep 27, 2023

jordanhendricks commented Sep 8, 2023 •

edited

Loading

jordanhendricks Sep 8, 2023 •

edited

Loading

davepacheco commented Sep 8, 2023 •

edited

Loading

smklein commented Sep 8, 2023 •

edited

Loading