Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notes from a new user #17

Closed
Screwtapello opened this issue May 24, 2018 · 8 comments
Closed

Notes from a new user #17

Screwtapello opened this issue May 24, 2018 · 8 comments

Comments

@Screwtapello
Copy link
Collaborator

Screwtapello commented May 24, 2018

I was pointed to this crate from the comments on a Reddit post I made, and thought I'd take notes about things that confused me while reading the documentation.

Errors always include filenames? That's awesome!

In the list of types introduced in this crate, PathArc is first and therefore presumably most important, but it's not actually used anywhere in the examples? So maybe it's just an implementation detail I should ignore?

In the crate-level docs, PathAbs says "An absolute (not necessarily canonicalized) path that may or may not exist" but the documentation for PathAbs::new() says "The path must exist or io::Error will be returned."? Must the path exist, or not?

I'm interested in working with paths that may or may not exist, but the example at the top of the file only creates these structs with ::create(), creating them on disk at the same time, so it doesn't particularly help me.

I see PathArc has an absolute() method, which does the broken thing of resolving /../ segments semantically, which means it's almost never safe to use. It's a shame; the other parts of its functionality are pretty useful, and there's other useful cleanups it could perform (like removing empty // path segments, and normalizing delimiters to / or \ depending on platform), but while there's a chance that it will corrupt a path I'd rather not use it.

The specific functionality I was looking for in my Reddit post, and which I can't seem to find in your crate, was partial canonicalization. Instead of Rust's canonicalize which requires the the entire path exist, a method that requires all-but-the-last-segment exists, or even a method that silently switches from real canonicalization to semantic canonicalization at the first segment that does not exist.

@vitiral
Copy link
Owner

vitiral commented May 24, 2018

Thanks for opening this issue!

Errors always include filenames? That's awesome!

😄

In the list of types introduced in this crate, PathArc is first and therefore presumably most important, but it's not actually used anywhere in the examples? So maybe it's just an implementation detail I should ignore?

I opened #18. We should probably move it to the bottom and make the docs clear that it is a low-cost implementation detail.

In the crate-level docs, PathAbs says "An absolute (not necessarily canonicalized) path that may or may not exist" but the documentation for PathAbs::new() says "The path must exist or io::Error will be returned."? Must the path exist, or not?

I opened #19. It used to do this but was changed in 0.4.0. This is a bug (thanks!)

I'm interested in working with paths that may or may not exist, but the example at the top of the file only creates these structs with ::create(), creating them on disk at the same time, so it doesn't particularly help me.

I made a comment to this effect in #19. Thanks!

I see PathArc has an absolute() method, which does the broken thing of resolving /../ segments semantically, which means it's almost never safe to use. It's a shame; the other parts of its functionality are pretty useful, and there's other useful cleanups it could perform (like removing empty // path segments, and normalizing delimiters to / or \ depending on platform), but while there's a chance that it will corrupt a path I'd rather not use it.

I think the docs could be improved here, but basically this is doing this for two specific reasons:

User Surprise: paths are semantically resolved rather than physically:

  • Usecase: user's $PWD is /a/b/c but /a/b/c is a symlink for /a/x/y/z. Following .. will therefore get them to /a/x/y which I would argue most users wouldn't expect in a "absolute path" library. They would expect that for the canonicalized path!
  • Paths like /a/b/c/../d/e/f/.. are rarely created to cleverly resolve symlinks. Very few users would expect that kind of path could point to /g/h/z but in fact it could since who knows where the .. goes. /a/b/d/e (i.e. semantically resolved) is pretty reasonable.
  • If you want all symlinks to resolve, just use absolute.
  • Note: it mimicks python's normpath

Performance: the method you specified (silently switching from "real" canonicalization to semantic") would require that we make a syscall for each path segment (except the last) every time PathAbs is created. Compare that to the current scheme where I make only a single syscall a few cases on Windows. (I also make a syscall to get the current directory if the path is relative, but that is unavoidable).

See #6 for more discussion of why we made the decisions we did. I don't think you are wrong, but I think the performance tradeoffs are too high to make a method like the one you propose be the default.

I would be very open to a pull request to "normalize" a path by following all symlinks (including ..) and make sure they exist. I think such a method would have to raise IOError if any segment didn't exist (including the last one). It would be extremely important to note the performance penalty for such a method, but I can definitely see why you would find it useful!

@Screwtapello
Copy link
Collaborator Author

Like I said, my use-case is paths that may or may not exist. More specifically, I'm trying to work with something like a cache: I absolutely know (and ensure) that the path to the cache exists, but then I construct a bunch of paths inside the cache that may or may not exist, and I might create them if they don't.

In this case, the path-to-the-cache probably isn't supplied directly by the user; it may come from an environment variable, or a hard-coded convention that may involve absolutifying (there needs to be a better verb for that) path-fragments gathered from other sources, and so it could include any number of /../ path segments. I find it hard to reason about such paths when I find them in other programs' logs and errors, so I try to avoid them in my programs by canonicalizing paths wherever possible. Since semantic canonicalization can change the meaning of a path, that could result in my program putting its cache files in the wrong place, a problem I wouldn't want to be woken up at 3AM to fix.

As for performance, like I said, I expect all my canonicalization to be done once at program startup and then re-used to create new paths. Paths my program constructs don't have any /../ in them so they don't need to be re-canonicalized. Maybe I want a Path struct that remembers how much has already been canonicalized so it doesn't need to repeat that work later.

To be honest, I'm not exactly sure what I want, I'm just trying to come up with something more robust and ergonomic than the Python APIs I'm used to. Perhaps I should play around with my own implementation to form a more educated opinion.

@Screwtapello
Copy link
Collaborator Author

I've spent some more time thinking about exactly the behaviour that I want, and I've decided to call it "monotonic paths". I describe monotonic paths in (bewildering) detail in the cli-wg issue, but the basic invariant that describes a monotonic path is "after the path's head (i.e. the / or C:\ components), all following components are names, not /./ or /../". Those named components might or might not be symlinks, they might or might not exist on disk, it doesn't matter.

When it comes to actually creating monotonic paths, plain old normalisation (as currently used by PathArc) fulfils the invariant, but (as described) it can change the meaning of a path. I've got an algorithm here that only tries to dereference the component immediately preceding a /../ segment; if it's not a symlink or it doesn't exist on disk, the algorithm uses plain old normalisation.

In comparison with other approaches:

  • vs. full canonicalisation (std::fs::canonicalize): my algorithm doesn't require paths to exist, and it can safely be used to talk about symlinks directly (re: Moving/copying/removing a symbolic link #6).
  • vs. partial canonicalisation (Python''s pathlib.Path.resolve(strict=False)): my algorithm issues fewer syscalls, since it does not try to stat every path component
  • vs. normalisation (path_abs, Python's normpath()): should be equivalently fast in the common case (no /../ components), but in the uncommon case gives the same results as the kernel itself
  • vs. just joining relative paths to the current working directory: my algorithm produces cleaner, easier-to-understand paths in the face of /../ segments, and also my algorithm works on Windows.

While figuring this out, I wrote an implementation of these ideas, complete with a test suite that passes on Linux and Windows 10 (and mostly passes on Wine). If you're interested in taking a look, I'll polish up the docstrings and stick it in a gist or something; I'd love for path_abs to use it.

@vitiral
Copy link
Owner

vitiral commented Jun 2, 2018 via email

@Screwtapello
Copy link
Collaborator Author

I polished up my implementation, and stuck it in a gist for you to look at.

  • Yes, it's long, but fully half of it is docstrings and comments to explain what's going on and why things are the way they are. I may have gone overboard. 😳
  • There's an Absolute struct, which represents a fully monotonic, absolute path, with a new() method that validates/converts an arbitrary path.
  • There's a Relative struct, which represents a fully monotonic, relative path, likewise.
  • If you have an Absolute, it's very cheap to join a Relative on the end because we can assume Relative::new() already did the validation. Otherwise, there's not much use for Relative.
  • Unlike PathAbs, I just call clone everywhere because it was simpler, and because i didn't think adding reference-counting would make my idea clearer. I can't see any reason why reference-counting couldn't be added.
  • I haven't added any of the useful methods that PathArc adds, again for simplicity.
  • The only non-stdlib dependency in the code is log. The tests also use env_logger (to help debugging test failures) and tempfile (to make a temporary directories to construct test-cases in).

Have a great vacation, and I'd love to hear what you think when you get back.

@vitiral
Copy link
Owner

vitiral commented Jun 15, 2018

Hey, thanks for doing all this work!

Is it possible you can open a PR? Even just copy/pasting the file into something like path_abs/notes/absolute.rs and opening a PR will let me leave comments/review inline. It's so much code that I think I need that.

Screwtapello added a commit to Screwtapello/path_abs that referenced this issue Jun 18, 2018
Adds a "monotonic_path_preview" module containing my monotonic path handling
code, for ease of review, as requested in vitiral#17. The module is made public, so the
documentation can be viewed with `cargo doc`.
@Screwtapello
Copy link
Collaborator Author

I've created PR #20 as a very quick-and-dirty inclusion of my code into path_abs, including making the module public (so you can check the docs with cargo doc) and adding the test- and logging-dependencies so you can run the tests.

It turns out that while the tests themselves pass, the doctests do not. I prototyped this code in a binary crate rather than a library, and it turns out that not only does cargo test not execute doctests in a binary crate, apparently it doesn't even syntax-check them.

@vitiral
Copy link
Owner

vitiral commented Feb 14, 2019

I think you are no longer a new user, and we can now close this 😄

@vitiral vitiral closed this as completed Feb 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants