Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cache] Cache daemon is not Windows compatible #4167

Closed
smorimoto opened this issue Jan 28, 2021 · 33 comments · Fixed by ocaml/opam-repository#18405
Closed

[Cache] Cache daemon is not Windows compatible #4167

smorimoto opened this issue Jan 28, 2021 · 33 comments · Fixed by ocaml/opam-repository#18405
Assignees
Milestone

Comments

@smorimoto
Copy link
Member

The cache daemon uses Unix.fork and is not Windows compatible. It should be easy to fix, but to make sure I don't forget that, I write it down as a quick issue.

@rgrinberg
Copy link
Member

The direct mode should work on Windows however.

@smorimoto
Copy link
Member Author

Yeah, I just skimmed the code, but the direct mode should work.

@smorimoto
Copy link
Member Author

Oh, no, the direct mode doesn't seem to work on Windows too.

#=== ERROR while compiling re.1.9.0 ===========================================#
# context              2.0.8 | win32/x86_64 | ocaml-variants.4.10.2+mingw64c | git+https://github.com/fdopen/opam-repository-mingw.git#opam2
# path                 ~/.opam/ocaml-variants.4.10.2+mingw64c/.opam-switch/build/re.1.9.0
# command              C:\Users\runneradmin\.opam\ocaml-variants.4.10.2+mingw64c\bin\dune.exe build -p re -j 4
# exit-code            1
# env-file             ~/.opam/log/re-6536-a5b00a.env
# output-file          ~/.opam/log/re-6536-a5b00a.out
### output ###
# Error: utimes: C:\Users\runneradmin\Local Settings\Cache\dune\db\beacon: The
# operation completed successfully.
# 

@rgrinberg
Copy link
Member

rgrinberg commented Feb 13, 2021

# Error: utimes: C:\Users\runneradmin\Local Settings\Cache\dune\db\beacon: The
# operation completed successfully.```

Most bizarre error message. There's an error, but the message indicates success?!

cc @dra27 @tmattio who perhaps have more experience Unix.utimes on Windows.

@dra27
Copy link
Member

dra27 commented Feb 13, 2021

The error message comes from Unix.(error_message (EUNKNOWNERR 0)) and is the string representation of ERROR_SUCCESS. I couldn't see a single point in the Dune sources to improve the error message, but what's happened is equivalent to displaying errno but finding it's 0.

@smorimoto
Copy link
Member Author

@dra27 -sensei always works incredibly efficiently...

@rgrinberg
Copy link
Member

Fixed in 2.8.3

@smorimoto
Copy link
Member Author

I don't think this should be closed because daemon is still not working. Only direct mode now works.

@rgrinberg
Copy link
Member

Indeed. Let’s keep it open until we fix the other modes as well.

@rgrinberg rgrinberg reopened this Mar 8, 2021
@smorimoto
Copy link
Member Author

smorimoto commented Mar 21, 2021

We've made some progress in the direct mode, but I'm seeing another strange problem that only occurs on Windows.

https://github.com/ocsigen/js_of_ocaml/runs/2159269998?check_suite_focus=true#step:8:321

#=== ERROR while compiling uri.4.1.0 ==========================================#
# context     2.0.8 | win32/x86_64 | ocaml-variants.4.12.0+mingw64c | git+https://github.com/fdopen/opam-repository-mingw.git#opam2
# path        ~/.opam/ocaml-variants.4.12.0+mingw64c/.opam-switch/build/uri.4.1.0
# command     C:\Users\runneradmin\.opam\ocaml-variants.4.12.0+mingw64c\bin\dune.exe build -p uri -j 2
# exit-code   1
# env-file    ~/.opam/log/uri-4800-0543f2.env
# output-file ~/.opam/log/uri-4800-0543f2.out
### output ###
# Error: unlink: C:\Users\runneradmin\Local
# Settings\Cache\dune\db\promoting.641381.188\data: Permission denied


#=== ERROR while compiling ppxlib.0.22.0 ======================================#
# context     2.0.8 | win32/x86_64 | ocaml-variants.4.12.0+mingw64c | git+https://github.com/fdopen/opam-repository-mingw.git#opam2
# path        ~/.opam/ocaml-variants.4.12.0+mingw64c/.opam-switch/build/ppxlib.0.22.0
# command     C:\Users\runneradmin\.opam\ocaml-variants.4.12.0+mingw64c\bin\dune.exe build -p ppxlib -j 2 @install
# exit-code   1
# env-file    ~/.opam/log/ppxlib-4800-4c6492.env
# output-file ~/.opam/log/ppxlib-4800-4c6492.out
### output ###
# Error: unlink: C:\Users\runneradmin\Local
# Settings\Cache\dune\db\promoting.ae8efc.180\data: Permission denied

@rgrinberg
Copy link
Member

What about other packages? Say angstrom for example, did it build successfully?

I'm not sure why there's no permissions. The directory is being created by dune inside a location which should be writeable.

        Temp.temp_in_dir ~perms:0o700 Temp.Dir ~dir:root ~prefix:"promoting."
          ~suffix:("." ^ string_of_int (Unix.getpid ()))

Perhaps @snowleopard or @dra27 might know?

@smorimoto
Copy link
Member Author

smorimoto commented Mar 22, 2021

Yeah, this is strange. To avoid such problems, actions-ml automatically overrides the some permissions every time on Windows like this.

C:\windows\system32\fsutil.exe behavior query SymlinkEvaluation
  Local to local symbolic links are enabled.
  Local to remote symbolic links are enabled.
  Remote to local symbolic links are disabled.
  Remote to remote symbolic links are disabled.

C:\windows\system32\fsutil.exe behavior set symlinkEvaluation R2L:1 R2R:1

C:\windows\system32\fsutil.exe behavior query SymlinkEvaluation
  Local to local symbolic links are enabled.
  Local to remote symbolic links are enabled.
  Remote to local symbolic links are enabled.
  Remote to remote symbolic links are enabled.

@smorimoto
Copy link
Member Author

I just tried to install angstrom, and it works as expected!

@dra27
Copy link
Member

dra27 commented Mar 22, 2021

I have a suspicion about what this might be, but I'm a bit pressed for time (as ever!). Please could I have a clear repro case - ideally a Dune invocation, but an opam one will do which repeatably shows this failure?

@smorimoto
Copy link
Member Author

I will create a repro case after I get home.

@smorimoto
Copy link
Member Author

@dra27
Copy link
Member

dra27 commented Mar 23, 2021

Thanks, @smorimoto! I was able to take that and reduce it to a direct build of ocaml-uri with Dune. It's repeatable which is handy. I'm very puzzled as to why you're messing around with fsutil at all - what's that meant to be fixing? (in particular, I think the workflow you point to includes that change, but still breaks).

The issue is a race condition - I've had a brief exchange with @snowleopard and we're suspicious of:

dune/src/cache/local.ml

Lines 238 to 278 in 5326730

let tmp =
let dst = Path.relative cache.temp_dir "data" in
if Path.exists dst then Path.unlink dst;
duplicate ~duplication cache ~src:abs_path ~dst;
dst
in
let effective_digest = Digest.file_with_stats tmp (Path.stat tmp) in
if Digest.compare effective_digest expected_digest != Ordering.Eq then (
let message =
Printf.sprintf "digest mismatch: %s != %s"
(Digest.to_string effective_digest)
(Digest.to_string expected_digest)
in
cache.info [ Pp.text message ];
Result.Error message
) else
let in_the_cache = file_path cache effective_digest in
(* CR-someday: we assume that if the file with [effective_digest] exists
in the file storage, then its content matches the digest, i.e. the user
never modifies it. In principle, we could add a consistency check but
this would have a non-negligible performance cost. A good compromise
seems to be to add a "paranoid" mode to Dune cache where we always
check file contents for consistency with the expected digest, so one
could enable it when needed. In the paranoid mode, we could furthermore
check for a digest collision via [Io.compare_files in_the_cache tmp]. *)
match Path.exists in_the_cache with
| true ->
(* We no longer need the temporary file. *)
Path.unlink tmp;
(* Update the timestamp of the existing cache entry, moving it to the
back of the trimming queue. *)
Path.touch in_the_cache;
Result.Ok (Already_promoted { path; digest = effective_digest })
| false ->
Path.mkdir_p (Path.parent_exn in_the_cache);
(* Move the temporary file to the cache. *)
Path.rename tmp in_the_cache;
(* Remove write permissions, making the cache entry immutable. We assume
that users do not modify the files in the cache. *)
Path.chmod in_the_cache ~mode:(stat.st_perm land 0o555);
Result.Ok (Promoted { path; digest = effective_digest })

The file data is created by multiple processes. Using a process-specific value (or, as I did, "data" ^ string_of_int (Random.int 65536)!) eliminates the problem, but a better fix should be sought.

The unlink exception is almost certainly a consistent attempt to delete a file which is being written - in the ocaml-uri case, it's an executable being cached so on Windows it will be big and probably triggers Defender so there's a big enough delay to see the race all the time. On Unix, I expect it's causing cache misses?

@snowleopard snowleopard self-assigned this Mar 23, 2021
@snowleopard
Copy link
Collaborator

Thanks for the investigation @dra27! I self-assigned to fix the race, hopefully, I'll get to it soon.

@smorimoto Could you confirm if using the Random.int trick solves the issue for you?

@smorimoto
Copy link
Member Author

@smorimoto
Copy link
Member Author

@dra27 Although fsutil is mainly set for Cygwin, it has nothing to do with this caching problem!

@snowleopard
Copy link
Collaborator

@snowleopard That works:
https://github.com/smorimoto/ocaml-dune-repro-4167/runs/2174690105

@smorimoto That's great, thank you for confirming.

@dra27
Copy link
Member

dra27 commented Mar 23, 2021

@dra27 Although fsutil is mainly set for Cygwin, it has nothing to do with this caching problem!

What's it being used for on Cygwin?

@smorimoto
Copy link
Member Author

IIRC, it was necessary for winsymlinks:native.
https://sourceware.org/legacy-ml/cygwin/2013-05/msg00405.html

snowleopard added a commit that referenced this issue Mar 26, 2021
Fix a race in Dune cache. It was particularly easy to hit this race when using
the cache on Windows, see #4167.

Signed-off-by: Andrey Mokhov <[email protected]>
@snowleopard
Copy link
Collaborator

@smorimoto I've fixed the race in #4406. It would be very helpful if you could confirm if the main branch works for you.

@smorimoto
Copy link
Member Author

@snowleopard That was very fast work! and the PR fixed the bug as expected: https://github.com/smorimoto/ocaml-dune-repro-4167/runs/2209405229

@snowleopard
Copy link
Collaborator

@smorimoto Thank you, happy to see the fix worked!

Do we need to fix anything else in the cache daemon on Windows, or can this issue be now closed?

@rgrinberg rgrinberg modified the milestones: 2.8.4, 2.8.5 Mar 29, 2021
rgrinberg pushed a commit that referenced this issue Mar 29, 2021
Fix a race in Dune cache. It was particularly easy to hit this race when using
the cache on Windows, see #4167.

Signed-off-by: Andrey Mokhov <[email protected]>
Signed-off-by: Rudi Grinberg <[email protected]>
@dra27
Copy link
Member

dra27 commented May 17, 2021

I'm not sure that @fpottier's was intended to close this?!

@smorimoto
Copy link
Member Author

Definitely not! Due to strange specs of GitHub, an incorrect close occurs! 🤣

@dra27 dra27 reopened this May 17, 2021
@snowleopard
Copy link
Collaborator

By the way, we've recently completely rewritten the implementation of the shared cache in Dune. In particular, there is no cache daemon anymore. It is still worth testing the new implementation on Windows, of course.

The docs for the new shared cache can be found here: https://github.com/ocaml/dune/blob/main/doc/caching.rst.

(For some reason, the docs here haven't been regenerated yet.)

@smorimoto
Copy link
Member Author

Oh, then we can close this. Are changes such as cache going to be released as dune 3?

@snowleopard
Copy link
Collaborator

Are changes such as cache going to be released as dune 3?

Yes, that's right.

@smorimoto
Copy link
Member Author

I can't wait for dune 3! Thank you for a lot of your efforts on this.

@snowleopard
Copy link
Collaborator

@smorimoto We're looking forward to Dune 3 too! Thanks for your help with testing Dune cache on Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants