Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nix] tests are failing/flakey #2103

Open
colemickens opened this issue Aug 17, 2023 · 11 comments
Open

[nix] tests are failing/flakey #2103

colemickens opened this issue Aug 17, 2023 · 11 comments

Comments

@colemickens
Copy link
Contributor

Description

Hi! Hope all is well.

jj is not building with nix right now, not sure if its a sandbox thing, guessing from the test names?

Steps to Reproduce the Problem

  1. nix build github:martinvonz/jj (or nix build github:martinvonz/jj/7ad2270c0)

Expected Behavior

It builds.

Actual Behavior

It fails when testing:

        FAIL [   0.116s] jj-lib::test_bad_locking test_bad_locking_children::git_backend
        FAIL [   0.120s] jj-lib::test_bad_locking test_bad_locking_children::local_backend

fuller log:

     Running [ 00:00:04] [===> ] 619/952: 14 running, 619 passed, 0 skipped     ^M     Running [ 00:00:04] [===> ] 620/952: 13 running, 620 passed, 0 skipped     ^M     Running [ 00:00:04] [===> ] 620/952: 14 running, 620 passed, 0 skipped     ^M     Running [ 00:00:04] [===> ] 620/952: 15 running, 620 passed, 0 skipp>
     Running [ 00:00:04] [===> ] 620/952: 16 running, 620 passed, 0 skipped     ^M        PASS [   0.018s] jj-lib::test_diff_summary test_matcher_dir_file_transition::local_backend
     Running [ 00:00:04] [===> ] 621/952: 15 running, 621 passed, 0 skipped     ^M     Running [ 00:00:04] [===> ] 622/952: 14 running, 622 passed, 0 skipped     ^M     Running [ 00:00:04] [===> ] 622/952: 15 running, 622 passed, 0 skipped     ^M        PASS [   0.049s] jj-lib::test_default_revset_graph_iterator test_>
     Running [ 00:00:04] [===> ] 622/952: 16 running, 622 passed, 0 skipped     ^M     Running [ 00:00:04] [===> ] 623/952: 15 running, 623 passed, 0 skipped     ^M        FAIL [   0.116s] jj-lib::test_bad_locking test_bad_locking_children::git_backend

--- STDOUT:              jj-lib::test_bad_locking test_bad_locking_children::git_backend ---

running 1 test
test test_bad_locking_children::git_backend ... FAILED

failures:

failures:
    test_bad_locking_children::git_backend

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 3 filtered out; finished in 0.11s


--- STDERR:              jj-lib::test_bad_locking test_bad_locking_children::git_backend ---
thread 'test_bad_locking_children::git_backend' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 5, kind: Uncategorized, message: "Input/output error" }', lib/tests/test_bad_locking.rs:78:55
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

     Running [ 00:00:04] [===> ] 623/952: 16 running, 623 passed, 0 skipped     ^M   Canceling due to test failure: 15 tests still running
     Running [ 00:00:04] [] 624/952: 15 running, 623 passed, 1 failed, 0 skipped     ^M^M        PASS [   0.016s] jj-lib::test_diff_summary test_matcher_normal_cases::local_backend
   Canceling [ 00:00:04] [] 624/952: 15 running, 623 passed, 1 failed, 0 skipped     ^M^M   Canceling [ 00:00:04] [] 625/952: 14 running, 624 passed, 1 failed, 0 skipped     ^M^M   Canceling [ 00:00:04] [] 625/952: 15 running, 624 passed, 1 failed, 0 skipped     ^M^M        PASS [   0.022s] jj-lib::test_diff_summary t>
   Canceling [ 00:00:04] [] 625/952: 16 running, 624 passed, 1 failed, 0 skipped     ^M^M        FAIL [   0.120s] jj-lib::test_bad_locking test_bad_locking_children::local_backend

--- STDOUT:              jj-lib::test_bad_locking test_bad_locking_children::local_backend ---

running 1 test
test test_bad_locking_children::local_backend ... FAILED

failures:

failures:
    test_bad_locking_children::local_backend

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 3 filtered out; finished in 0.11s


--- STDERR:              jj-lib::test_bad_locking test_bad_locking_children::local_backend ---
thread 'test_bad_locking_children::local_backend' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 5, kind: Uncategorized, message: "Input/output error" }', lib/tests/test_bad_locking.rs:48:54
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

   Canceling [ 00:00:04] [] 626/952: 15 running, 625 passed, 1 failed, 0 skipped     ^M^M        PASS [   0.020s] jj-lib::test_diff_summary test_sorting::git_backend
   Canceling [ 00:00:04] [] 627/952: 14 running, 625 passed, 2 failed, 0 skipped     ^M^M        PASS [   0.020s] jj-lib::test_diff_summary test_sorting::local_backend
   Canceling [ 00:00:04] [] 628/952: 13 running, 626 passed, 2 failed, 0 skipped     ^M^M        PASS [   0.014s] jj-lib::test_diff_summary test_tree_file_transition::local_backend
   Canceling [ 00:00:04] [] 629/952: 12 running, 627 passed, 2 failed, 0 skipped     ^M^M        PASS [   0.045s] jj-lib::test_default_revset_graph_iterator test_graph_iterator_virtual_octopus::skip_transitive_edges
   Canceling [ 00:00:04] [] 630/952: 11 running, 628 passed, 2 failed, 0 skipped     ^M^M        PASS [   0.016s] jj-lib::test_diff_summary test_types::git_backend
   Canceling [ 00:00:04] [] 631/952: 10 running, 629 passed, 2 failed, 0 skipped     ^M^M        PASS [   0.017s] jj-lib::test_diff_summary test_tree_file_transition::git_backend
   Canceling [ 00:00:04] [] 632/952: 9 running, 630 passed, 2 failed, 0 skipped     ^M^M        PASS [   0.018s] jj-lib::test_diff_summary test_types::local_backend
   Canceling [ 00:00:04] [] 633/952: 8 running, 631 passed, 2 failed, 0 skipped     ^M^M        PASS [   0.019s] jj-lib::test_git test_export_partial_failure
   Canceling [ 00:00:04] [] 634/952: 7 running, 632 passed, 2 failed, 0 skipped     ^M^M   Canceling [ 00:00:04] [] 634/952: 7 running, 632 passed, 2 failed, 0 skipped     ^M^M        PASS [   0.024s] jj-lib::test_git test_export_conflicts
   Canceling [ 00:00:04] [] 635/952: 6 running, 633 passed, 2 failed, 0 skipped     ^M^M        PASS [   0.023s] jj-lib::test_git test_export_import_sequence
   Canceling [ 00:00:04] [] 636/952: 5 running, 634 passed, 2 failed, 0 skipped     ^M^M        PASS [   0.032s] jj-lib::test_git test_bulk_update_extra_on_import_refs
   Canceling [ 00:00:04] [] 637/952: 4 running, 635 passed, 2 failed, 0 skipped     ^M^M        PASS [   0.035s] jj-lib::test_git test_concurrent_write_commit
   Canceling [ 00:00:04] [] 638/952: 3 running, 636 passed, 2 failed, 0 skipped     ^M^M        PASS [   0.046s] jj-lib::test_git test_concurrent_read_write_commit
   Canceling [ 00:00:04] [] 639/952: 2 running, 637 passed, 2 failed, 0 skipped     ^M^M   Canceling [ 00:00:04] [] 639/952: 2 running, 637 passed, 2 failed, 0 skipped     ^M^M   Canceling [ 00:00:05] [] 640/952: 1 running, 638 passed, 2 failed, 0 skipped     ^M^M        PASS [   2.715s] jj-cli::test_log_command test_>
   Canceling [ 00:00:05] [] 640/952: 1 running, 638 passed, 2 failed, 0 skipped     ^M^M   Canceling [ 00:00:05] [] 640/952: 1 running, 638 passed, 2 failed, 0 skipped     ^M^M   Canceling [ 00:00:05] [] 640/952: 0 running, 639 passed, 2 failed, 0 skipped     ^M^M------------
     Summary [   5.053s] 641/952 tests run: 639 passed, 2 failed, 0 skipped
        FAIL [   0.116s] jj-lib::test_bad_locking test_bad_locking_children::git_backend
        FAIL [   0.120s] jj-lib::test_bad_locking test_bad_locking_children::local_backend
   Canceling [ 00:00:05] [] 641/952: 0 running, 639 passed, 2 failed, 0 skipped     ^M^Merror: test run failed

Specifications

╭ zeph  ~/code/nixcfg 11sec 2ms
╰─▶ cat /etc/os-release
BUG_REPORT_URL="https://github.com/NixOS/nixpkgs/issues"
BUILD_ID="23.11.20230816.91ee80b"
DOCUMENTATION_URL="https://nixos.org/learn.html"
HOME_URL="https://nixos.org/"
ID=nixos
LOGO="nix-snowflake"
NAME=NixOS
PRETTY_NAME="NixOS 23.11 (Tapir)"
SUPPORT_URL="https://nixos.org/community.html"
VERSION="23.11 (Tapir)"
VERSION_CODENAME=tapir
VERSION_ID="23.11"

(Though this shouldn't matter, I don't override jj's nixpkgs)

@colemickens
Copy link
Contributor Author

note, while I was leaving open the chance for flakiness, I haven't actually seen these be flakey, just consistently failing.

@colemickens
Copy link
Contributor Author

Odd... pinning a few older revisions isn't helping, and then the blame via GitHub UI doesn't indicate changes to that test function recently.

@necauqua
Copy link
Contributor

necauqua commented Aug 17, 2023

This is odd, I do nix profile upgrade every few days and it's fine, and right now for me it succeeded too (and then nix build just immediately links it to result)

Looking at it being the classic generic "Input/output error" from the OS, maybe you have some exotic/broken filesystem or something?.

Basically the sandbox didn't work 😄

@colemickens
Copy link
Contributor Author

Yeah, it's bizarre. I've tried it on two different machines now, albeit they're practically identical.

My root is zfs, my tmp is tmpfs. Not sure... I might file a nixpkgs bug.

@colemickens
Copy link
Contributor Author

In the issue I filed on Nix, someone else said they were able to build. I'm building it in a loop.... sometimes only one test fails, but I haven't had a pass yet (even though it should be building on a single core with the settings I'm using).

@colemickens
Copy link
Contributor Author

I think the tests are flakey. I finally got it to build:

nu─▶  $env.NIX_BUILD_CORES = 1
nu─▶  nix build -L github:martinvonz/jj --option 'cores' '1' -j1 

Finally built it.

@martinvonz
Copy link
Member

The tests do a recursive file system copy of some repos they create. They do that by walking the file system. Could it be that you have something running in the background that notices these temporary directories and adds a temporary file in them, which the test might then fail to copy if the background process had already deleted it.

@colemickens
Copy link
Contributor Author

  1. I'm not sure why changing it to be single-threaded would be the issue, if the implication is it's something outside the build itself.

  2. Nix should be sandboxing this, such that it's being built in its own private /tmp/ directory.

  3. I do use this, which uses FB's watchman: https://github.com/jgavris/rs-git-fsmonitor but I'm not really sure that it writes files into its watched dirs/repos... or again, why it would magically go away as soon as I run a single-threaded build.

Not to hammer the point too hard, but I had dozens of failures before setting it to a single core, and then it immediately built.

@martinvonz
Copy link
Member

martinvonz commented Aug 20, 2023

Since you're able to reproduce it quite reliably, could you try applying this and see what it reports? I'm curious to see if it's a particular path every time.

diff --git a/lib/tests/test_bad_locking.rs b/lib/tests/test_bad_locking.rs
index 68ac86ebc9...0f31ee743a 100644
--- a/lib/tests/test_bad_locking.rs
+++ b/lib/tests/test_bad_locking.rs
@@ -45,7 +45,9 @@
         if child_left.is_dir() {
             sub_dirs.push(base_name.to_os_string());
         } else {
-            std::fs::copy(&child_left, child_output).unwrap();
+            if let Some(err) = std::fs::copy(&child_left, &child_output).err() {
+                panic!("failed to copy {child_left:?} to {child_output:?}: {err}");
+            }
         }
     }
     // Walk the base and find files removed in the right side, then remove them in
@@ -75,7 +77,9 @@
         } else if !child_base.exists() {
             // This overwrites the left side if that's been written. That's fine, since the
             // point of the test is that it should be okay for either side to win.
-            std::fs::copy(&child_right, child_output).unwrap();
+            if let Some(err) = std::fs::copy(&child_right, &child_output).err() {
+                panic!("failed to copy {child_right:?} to {child_output:?}: {err}");
+            }
         }
     }
     // Do the merge in subdirectories

@colemickens
Copy link
Contributor Author

Here's what I'm testing:
colemickens@3f732df

nix build github:colemickens/jj/3f732df20a76946b55f8bd5480923e598f314a7a -L [--rebuild]

(so default multi-core-ish behavior)

  • run 1
jujutsu-unstable> --- STDERR:              jj-lib::test_bad_locking test_bad_locking_children::git_backend ---                                                              
jujutsu-unstable> thread 'test_bad_locking_children::git_backend' panicked at 'failed to copy "/build/jj-test-4LYqZV/.jj/repo/index/operations/8165661ec2f17a967a621aa4b465a
fda6bf5335ba44624094ee9d71ee84e07fb587a62b343de6cd1ed7636b1de73e23ae7e2f78f796914b35cc1444f3d3fea80" to "/build/jj-test-V0CQnr/.jj/repo/index/operations/8165661ec2f17a967a6
21aa4b465afda6bf5335ba44624094ee9d71ee84e07fb587a62b343de6cd1ed7636b1de73e23ae7e2f78f796914b35cc1444f3d3fea80": Input/output error (os error 5)', lib/tests/test_bad_locking
.rs:49:17

jujutsu-unstable> --- STDERR:              jj-lib::test_bad_locking test_bad_locking_children::local_backend ---                                                                                                                       
jujutsu-unstable> thread 'test_bad_locking_children::local_backend' panicked at 'failed to copy "/build/jj-test-CVjNHq/.jj/repo/store/files/9063990e5c5b2184877f92adace7c801a549b00c39cd7549877f06d5dd0d3a6ca6eee42d5896bdac64831c8114c
55cee664078bd105dc691270c92644ccb2ce7" to "/build/jj-test-Qp4aFE/.jj/repo/store/files/9063990e5c5b2184877f92adace7c801a549b00c39cd7549877f06d5dd0d3a6ca6eee42d5896bdac64831c8114c55cee664078bd105dc691270c92644ccb2ce7": Input/output e
rror (os error 5)', lib/tests/test_bad_locking.rs:49:17 
  • run 2: ... of course it passes....

  • run 3: with --rebuild...

jujutsu-unstable> --- STDERR:              jj-lib::test_bad_locking test_bad_locking_children::local_backend ---
jujutsu-unstable> thread 'test_bad_locking_children::local_backend' panicked at 'failed to copy "/build/jj-test-VjtkfG/.jj/repo/store/trees/ec27fb5fc356eeeeb77ed78bc8e425ce0648796fb272b83aee088e143d60fdc077bfc884c1119ef69e90822f9b6d7f8203a2142c04b3af2549c0636e7d0d621d" to "/build/jj-test-5xomD8/.jj/repo/store/trees/ec27fb5fc356eeeeb77ed78bc8e425ce0648796fb272b83aee088e143d60fdc077bfc884c1119ef69e90822f9b6d7f8203a2142c04b3af2549c0636e7d0d621d": Input/output error (os error 5)', lib/tests/test_bad_locking.rs:49:17


jujutsu-unstable> --- STDERR:              jj-lib::test_bad_locking test_bad_locking_children::git_backend ---
jujutsu-unstable> thread 'test_bad_locking_children::git_backend' panicked at 'failed to copy "/build/jj-test-QlkzzQ/.jj/repo/op_store/views/ce408fc4a25805be3351cf04e48117d4b7549355d10d3d310cf0452142b2770cf2170fcbbd6419e6b92726369e87b9e0a59c1e506eaa1375e20f8259813849ab" to "/build/jj-test-btsajD/.jj/repo/op_store/views/ce408fc4a25805be3351cf04e48117d4b7549355d10d3d310cf0452142b2770cf2170fcbbd6419e6b92726369e87b9e0a59c1e506eaa1375e20f8259813849ab": Input/output error (os error 5)', lib/tests/test_bad_locking.rs:81:17

Let me know if you want me to script this to get pass/fail rate, or make a log of the fail cases, or if this is sufficient.

@martinvonz
Copy link
Member

I think that's sufficient, thanks. I still have no idea why it would sometimes fail to copy some files, however. If you're curious enough, I suppose you could reduce the test case to just manually build up some random directories and see if running the same merge_directories() function on them also fails.

martinvonz added a commit that referenced this issue Sep 17, 2023
I ran into some issues here when switching our tests to use
`.jj`-internal git repos. For example, the `std::fs::copy()` calls
started failing, which may be related to #2103. I think one problem is
that we could end up calling `merge_directories()` twice for the same
directory. This patch fixes that by deduping the paths we call with,
and makes the function assume that the output directory doesn't exist.
martinvonz added a commit that referenced this issue Sep 17, 2023
I ran into some issues here when switching our tests to use
`.jj`-internal git repos. For example, the `std::fs::copy()` calls
started failing, which may be related to #2103. I think one problem is
that we could end up calling `merge_directories()` twice for the same
directory. This patch fixes that by deduping the paths we call with,
and makes the function assume that the output directory doesn't exist.
martinvonz added a commit that referenced this issue Sep 18, 2023
I ran into some issues here when switching our tests to use
`.jj`-internal git repos. For example, the `std::fs::copy()` calls
started failing, which may be related to #2103. I think one problem is
that we could end up calling `merge_directories()` twice for the same
directory. This patch fixes that by deduping the paths we call with,
and makes the function assume that the output directory doesn't exist.
martinvonz added a commit that referenced this issue Sep 18, 2023
I ran into some issues here when switching our tests to use
`.jj`-internal git repos. For example, the `std::fs::copy()` calls
started failing, which may be related to #2103. I think one problem is
that we could end up calling `merge_directories()` twice for the same
directory. This patch fixes that by deduping the paths we call with,
and makes the function assume that the output directory doesn't exist.
Dr-Emann pushed a commit to Dr-Emann/jj that referenced this issue Sep 21, 2023
I ran into some issues here when switching our tests to use
`.jj`-internal git repos. For example, the `std::fs::copy()` calls
started failing, which may be related to jj-vcs#2103. I think one problem is
that we could end up calling `merge_directories()` twice for the same
directory. This patch fixes that by deduping the paths we call with,
and makes the function assume that the output directory doesn't exist.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants