Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glob() should take a Path not str #78

Open
remram44 opened this issue Apr 16, 2019 · 6 comments
Open

glob() should take a Path not str #78

remram44 opened this issue Apr 16, 2019 · 6 comments

Comments

@remram44
Copy link

Paths on UNIX are bytes, not unicode. Using str means that all of them can't be represented.

In particular, on systems set with a different locale, or when using disks that were created in a different locale, or simply because of some glitch somewhere, files using non-UTF-8 byte sequence can totally exist.

Not only is it not that much of a corner-case, but the standard library already dealt with that by providing the Path type. AsRef<Path> can also be used to allow users to still pass in string slices and string literals.

Related to #23 which is not about the interface but a bug in the internals.

@BurntSushi
Copy link
Member

I'm not sure there is anyone actively working on this crate, and fixing this bug probably requires non-trivial changes. You might consider using globset instead.

@remram44
Copy link
Author

Oh, sorry about the noise 😅 It showed up in the "GitHub Explore" mailing-list "based on your public repository contributions".

Thanks for the pointer.

@remram44
Copy link
Author

remram44 commented Apr 16, 2019

globset seems to have the same problem, Glob::new() takes a string. This works fine for Americans and on UTF-8 machines but make it impossible to pass valid Path values to it.

[edit: and globset doesn't have a dedicated issue tracker]

@BurntSushi
Copy link
Member

BurntSushi commented Apr 16, 2019

No, globset does not suffer from the same problem. globset does require valid UTF-8 for the pattern, but it's matching function can accept any path.

[edit: and globset doesn't have a dedicated issue tracker]

It might some day, but you can just file bugs against ripgrep with globset mentioned in the issue.

If you have a use case for specifying a pattern that isn't valid UTF-8, then I would certainly be happy to have a bug for that, but please provide a real example, because it's not actually clear what the intended semantics should be.

Also, the focus on "Americans" is completely unnecessary. UTF-8 isn't specific to Americans.

@remram44
Copy link
Author

On a ISO-8859-15 system, I might want to match Glob::new(b"d\xE9cembre 2018 - *.jpg"). Or I might want to pass arguments from args_os() (without panicking on non-UTF-8 filenames) or clap.

I might also just want to build a pattern from an existing Path, for example to find rsync-style temp files (.<filename>.randomstring e.g. .cv_rémi.pdf.GV4H3). Path.file_name() correctly returns &OsStr which I can easily manipulate, but I can't feed it back to Glob::new().

@BurntSushi
Copy link
Member

Thanks for the example. I filed an issue and turned your example into code: BurntSushi/ripgrep#1250

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants