-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove nexus-test-utils build.rs #4056
Conversation
As an example of what I see when I run
|
|
||
# We expect the seed CRDB to be placed here, so we explicitly remove it. There | ||
# is no "final test", so cargo test wouldn't really know when to remove it. | ||
rm -rf "$TEST_TMPDIR/crdb-base" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This data used to be placed in:
OUT_DIR — the folder in which all output and intermediate artifacts should be placed. This folder is inside the build directory for the package being built, and it is unique for the package in question.
(Which is an environment variable cargo sets for build scripts)
Well, this isn't run by a build script any longer, so now it's going in $TEST_TMPDIR
, which is why I'm removing it explicitly now.
I have in the past found requiring a cockroach binary as a part of the omicron build to be a bit surprising (though it makes sense to me why things ended up that way). I also think an important (and perhaps also surprising) consequence of relying on PATH as a part of the build environment for omicron could have unexpected impacts with IDE-like tools that rely on building source (namely, rust-analyzer). I am still ramping up on omicron tests (and don't have any experience with cargo-nextest at all yet), but I am curious if there's a way the creation of the directory could be done before the invocation of any tests. It seems to me that introducing an intentional race to create a shared resource for the tests could introduce bugs in the tests themselves. I also wonder about how much work the creation of each of these seed directories is doing -- is it going to consume more time / CPU to have an arbitrary number of tests attempt to create the directory? |
// However, the process for creating these seed directories is not atomic. | ||
// We create a temporary directory within: | ||
// | ||
// - /tmp/crdb-base/... | ||
// | ||
// And rename it to the final "digest" location once it has been fully created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this a bit more (and assuming it's hard to serialize the creation of this directory ahead of test invocations, which I assumed, but as I said, don't have a lot of experience with nextest yet and am less sure of what features it has), one way to make things more explicit could be to use a file as a lock to take ownership of setting up the directory.
So the winning test path would:
- create the directory
- create a lockfile in the directory
- lock the file
- create the database
To protect against races, I think an approach that might work is:
- mkdir the directory; if the test gets
EEXIST
, then skip to trying to open the lockfile (withoutO_CREAT | O_EXCL
and try to lock the lockfile withflock()
- open
O_CREAT | O_EXCL
of the lockfile; if the test getsEEXIST
, skip toflock
flock
the lockfile, then check if the database exists- if the database doesn't exist, create it
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this approach, but it falls over if any test crashes (or is ^C
-ed, or fails for any other reason) during DB setup.
I do understand that the flock
portion of this could help mitigate that concern, but that's also pretty filesystem / platform specific (see: https://apenwarr.ca/log/20101213 - doesn't work on NFS, silently fails in some situations, and even https://docs.rs/fs2/latest/fs2/trait.FileExt.html , which appears to be a fairly popular filesystem locking crate, says "File locks are a cross-platform hazard since the file lock APIs exposed by operating system kernels vary in subtle and not-so-subtle ways.", and is not regularly tested on illumos).
Also, ultimately, if some process failed partway through creation, and we needed to hand-off creation of the "seed directory" to a different test process, we'd need additional logic to distinguish between "this directory exists because I should use it" vs "this directory exists and I should try to salvage it, or destroy it and start over".
I thought the approach of "do an atomic rename" seemed simpler -- the directory can only (atomically) exist after it has been successfully created, and if any test failed during setup, their output would be isolated. Besides, the work of "spin up a CRDB node and populate it" was the default behavior for all tests using the DB before I tried adding this "seed" optimization. At worst, it's only a small performance hit for the first couple tests that are run (and if the schema changes, that seed directory still exists, so it can be re-used for subsequent test runs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have to second @smklein's concern about using file locks anywhere for tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, a few things about file locking:
- The guidance in the article you link is not unfair from a historical perspective, but it is also more than a decade old, and describes issues with versions of Linux and Mac OS X that were old at that time
- We would use whole-file OFD-style exclusive locks; this avoids any of the issues one might bump into with shared locks and upgrades
- flock(3C) is robust and well-tested on illumos, FWIW. We built it to work the way it does on Linux (at the time, for our LX brand work) where it is also well-tested enough at this stage as far as I can tell. Our flock(3C) creates OFD-style locks, which we also support (again for compatibility with Linux) with
F_OFD_*
commands via fcntl(2). - We're talking about files in
/tmp
, which should never be NFS-backed; the era of NFS root for diskless workstations has since passed - As I recall the fs2 crate is long-since abandoned; we forked it for a while as fs3, but I believe the community has since moved on to fs4
if some process failed partway through creation, and we needed to hand-off creation of the "seed directory" to a different test process, we'd need additional logic to distinguish between "this directory exists because I should use it" vs "this directory exists and I should try to salvage it, or destroy it and start over".
I think this logic is relatively simple: put a marker file in the directory, or write marker data into the lock file, once the setup is complete. If the marker is not present and correct when you get to lock the file, remove the database files and start again. Otherwise, release the lock and use the database seed.
You could also use a combination of the atomic directory rename you're talking about, and the lock file: lock the lock file, and:
- if you see the final expected directory name, say, database, you know you can use it; drop the lock and do that
- otherwise:
- if you see a temporary name, like, say, notyet, remove it now
- create the database, rename the directory to database
Either way, once you drop the lock, there will be a correctly provisioned seed in database.
Here is a simple example of using flock(3C) through the fs4 crate: https://github.com/jclulow/junk-locker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's okay with ya'll, I'd like to punt on this. Not necessarily saying we shouldn't do it, but maybe not within this PR?
- This locking proposal is an optimization to avoid a minor amount of wasted work in tests
- If we introduce this mechanism, I do think it's more complicated, and would want to spend more time validating and testing it...
- ... and the benefits of that testing and optimization seem like they'd both be tossed out the window if we implement Support for setup and teardown scripts nextest-rs/nextest#683 anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and the benefits of that testing and optimization seem like they'd both be tossed out the window if we implement Support for setup and teardown scripts nextest-rs/nextest#683 anyway?
If we're going to toss it out anyway, could we just wait until we've properly assessed and prioritized whether we should do the nextest feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option I've heard people use is a Mutex<()>
, almost like a barrier. See rust-lang/rust#43155 for some discussion on that. Generally this is for running tests themselves serially, but I could imagine adapting this purely for the setup instead.
The cleanest solution would be the upstream feature. If that were coming soon, then that would be ideal.
If not though, basically this is a game of "which way of emulating this feature sucks the least." Build scripts have their own annoying downsides. But I also agree with @smklein that this discussion feels like premature optimization. Is that actually the case? Does this patch slow down the test suite, and if so, by how much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW @steveklabnik I was more worried about intentionally racing the tests against each other. Maybe that's not a fair concern. I also agree that the upstream feature is cleanest, and hence was curious about that.
It seems potentially counterproductive to the goal of reducing iteration time to do a bunch of work in parallel many times over and throw most of it away. (Maybe that's the wrong way to look at it though -- it's just a cache, and we were doing this even more before the build script.) But I don't think I understand the problem this is trying to solve. It's true that if you're just building with no intent to run tests, this is extra work. How long does it actually take? Would it make sense to use Is this just about rust-analyzer? Is it about the PATH issue involving cockroach or also that it takes a long time? |
Yes, it's just a fraction of "what we used to do before" from the testing point-of-view.
In a build, the old build script added a few seconds, depending on your machine.
This impacts building the Nexus integration tests, or just
This was a major part of it too. When I added the code to the build.rs script, I mentioned "this doesn't need to happen at build time" in a comment, and build.rs scripts are their own bag of complexity. Yes, this issue can be mitigated, but it's "just another pain point" building Nexus. |
Closing in favor of #4150 |
We previously used a
build.rs
script to create a "seed" directory for CRDB, so that tests wouldn't need to duplicate the work of initializing the DB for all our tests.This worked, but unfortunately meant that the work of CRDB construction happened at build-time, which (1) could be unnecessary, in cases where we aren't running tests, and (2) meant that we used a build.rs script to do this work, which would cause problems for rust-analyzer.
This PR replaces that build.rs script with a more dynamic approach: We still construct the seed directory, but we do so "at test time", letting the first few invocations race with each other to create a directory with a digest name based on the CRDB schema and version. "Whoever wins first" gets to populate the seed directory, and all subsequent tests rely on that version.