Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'*' matches entire path in fnmatch #72904

Closed
decibel mannequin opened this issue Nov 16, 2016 · 16 comments
Closed

'*' matches entire path in fnmatch #72904

decibel mannequin opened this issue Nov 16, 2016 · 16 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@decibel
Copy link
Mannequin

decibel mannequin commented Nov 16, 2016

BPO 28718
Nosy @serhiy-storchaka, @MojoVampire, @decibel, @Hooloovoo, @tovrstra

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2016-11-16.20:12:07.121>
labels = ['library']
title = "'*' matches entire path in fnmatch"
updated_at = <Date 2019-03-31.12:59:09.371>
user = 'https://github.com/decibel'

bugs.python.org fields:

activity = <Date 2019-03-31.12:59:09.371>
actor = 'Toon Verstraelen'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2016-11-16.20:12:07.121>
creator = 'Jim Nasby'
dependencies = []
files = []
hgrepos = []
issue_num = 28718
keywords = []
message_count = 8.0
messages = ['280985', '281017', '281018', '288608', '290624', '307867', '339054', '339256']
nosy_count = 6.0
nosy_names = ['serhiy.storchaka', 'josh.r', 'Jim Nasby', 'Alberto Galera', 'aaron-whitehouse', 'Toon Verstraelen']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue28718'
versions = []

Linked PRs

@decibel
Copy link
Mannequin Author

decibel mannequin commented Nov 16, 2016

A '*' in fnmatch.translate is converted into '.*', which will greedily match directory separators. This doesn't match shell behavior, which is that * will only match file names:

decibel@decina:[14:07]~$ls \~/tmp/*/1|head
ls: /Users/decibel/tmp/*/1: No such file or directory
decibel@decina:[14:07]~$ls \~/tmp/d*/base/1|head
112

From a posix standpoint, this would easily be fixed by using '[^/]*' instead of '.*'. I'm not sure how to make this work cross-platform though.

It's worth noting that some programs (rsync, git) support **, which would correctly translate to '.*'.

@decibel decibel mannequin added the stdlib Python modules in the Lib dir label Nov 16, 2016
@MojoVampire
Copy link
Mannequin

MojoVampire mannequin commented Nov 17, 2016

Presumably something like:

r'(?:' + r'|'.join({re.escape(os.path.sep), re.escape(os.path.altsep)}) + r')'

would cover it completely. I switched to using non-capturing groups over a character class both to deal with the fact that escaping doesn't work the same way for character classes and to cover the possibility (no idea here) that some terrible OS might have a multicharacter path separator.

@MojoVampire
Copy link
Mannequin

MojoVampire mannequin commented Nov 17, 2016

Oops, altsep is None, not the empty string when there is only one separator. And I didn't handle inverting the match. Sigh. You get the idea.

@Hooloovoo
Copy link
Mannequin

Hooloovoo mannequin commented Feb 26, 2017

Note that somebody has forked the standard library to implement this:
https://github.com/kianxineki/python-wildcard
This shows that the actual changes would be pretty small (though pywildcard is based on 2.x code and does not handle the cross-platform slashes you have been discussing).

It is also worth noting that the glob standard library:
https://docs.python.org/3.7/library/glob.html
implements a "recursive" option that has similar behaviour (* does not span path separators whereas ** does) and essentially builds this on top of fnmatch for the actual filename matching.

I do not think we can change the default behaviour of fnmatch at this point, but I would like to see this behaviour triggered by an optional argument to the various functions, e.g.:
fnmatch.fnmatch(filename, pattern, glob_asterisks=False)
fnmatch.fnmatchcase(filename, pattern, glob_asterisks=False)
fnmatch.filter(names, pattern, glob_asterisks=False)
fnmatch.translate(pattern, glob_asterisks=False)

In each case, if glob_asterisks (or whatever other name we came up with) is true, the behaviour would match the pywildcard behaviour, i.e.:
** matches everything
* matches in one path level

I look after the glob matching code in duplicity and would like to start using the standard library to do filename matching for us, but we need the above behaviour. I am happy to do the patching if there is a realistic chance of it being accepted.

@Hooloovoo Hooloovoo mannequin changed the title '*' matches entire path in fnmatch.translate '*' matches entire path in fnmatch Feb 26, 2017
@Hooloovoo
Copy link
Mannequin

Hooloovoo mannequin commented Mar 27, 2017

Posted to the [Python-ideas] mailing list, as it is proposing a change to a standard library:
https://mail.python.org/pipermail/python-ideas/2017-February/044880.html

Nobody has responded so far, however. I take this as at least no vehement objection to the idea.

@AlbertoGalera
Copy link
Mannequin

AlbertoGalera mannequin commented Dec 8, 2017

I see that they have commented on the lib that I made a few years ago (python-wildcard).

The reason for the creation of that little fork started in this issue:

https://bugs.python.org/issue25734

@tovrstra
Copy link
Mannequin

tovrstra mannequin commented Mar 28, 2019

For consistency with the corresponding feature in the glob function since Python 3.5, I would suggest to add an extra optional argument 'recursive' instead of 'glob_asterisks'. With the default recursive=False, one gets the old behavior, with recursive=True, it can handle the '**' and '*' as in pywildcard.

I realize that with recursive=False, the behavior is not exactly consistent with glob, but I'd still prefer the same name for the optional argument. It is the common terminology for this type of feature. See https://en.wikipedia.org/wiki/Matching_wildcards

@tovrstra
Copy link
Mannequin

tovrstra mannequin commented Mar 31, 2019

Just for reference, here are a few more implementations of the same idea, next to pywildcard, sometimes combined with other useful features:

The last one is rather active, with regular releases, last one on March 24, 2019.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@barneygale
Copy link
Contributor

I have an implementation of this for pathlib:

It exploits a simple trick: swapping path separators and newlines, and then matching without setting re.DOTALL. This causes * (and other wildcards) to not match directory separators.

If folks thought it was a good idea, we could instead put the implementation in an fnmatch.globmatch() or glob.match() function.

@barneygale
Copy link
Contributor

Some other differences between fnmatch and glob:

  • **/ has a special meaning in glob.glob(). You might think it means "any number of characters, followed by one slash". In fact it means "any number of characters, as long as we terminate after a slash". Importantly, this means it can match nothing at all, and so a pattern like /etc/**/**/**/hosts will match /etc/hosts.
  • * as a standalone path segment also has a special meaning. You might think it means "any number of any non-slash characters". In fact it means "at least one non-slash character", i.e. it always consumes precisely one path component.

barneygale added a commit to barneygale/cpython that referenced this issue Jul 12, 2023
If a sequence of path separators is given to the new argument,
`translate()` produces a pattern that matches similarly to
`pathlib.Path.glob()`. Specifically:

- A `*` pattern segment matches precisely one path segment.
- A `**` pattern segment matches any number of path segments
- If `**` appears in any other position within the pattern, `ValueError` is
  raised.
- `*` and `?` wildcards in other positions don't match path separators.

This change allows us to factor out a lot of complex code in pathlib.
@barneygale
Copy link
Contributor

PR available: #106703

@barneygale
Copy link
Contributor

The PR above adds an argument to fnmatch.translate(), but doesn't change fnmatchcase(). It think it would better to add a function to the glob module:

def match(path: os.PathLike, pattern: os.PathLike) -> bool:
    ...

This would take care of supplying os.path.sep (and altsep if applicable) to fnmatch.translate(pattern, seps=...).

Questions:

  • Does this sound OK?
  • Should glob.match() accept a recursive argument?
  • Should glob.match() accept an include_hidden argument?

@barneygale barneygale added the type-feature A feature request or enhancement label Jul 13, 2023
barneygale added a commit to barneygale/cpython that referenced this issue Jul 19, 2023
@barneygale
Copy link
Contributor

barneygale added a commit to barneygale/cpython that referenced this issue Sep 23, 2023
@wimglenn
Copy link
Contributor

Adding to the helpful list from tovrstra, I can recommend pathspec.

barneygale added a commit to barneygale/cpython that referenced this issue Sep 26, 2023
Use `re.Scanner` to scan shell-style patterns, rather than parsing them
by hand in a fat loop. This makes the code slower (!) but more obvious, and
lays some groundwork for a future `glob.translate()` function.
barneygale added a commit to barneygale/cpython that referenced this issue Sep 26, 2023
barneygale added a commit to barneygale/cpython that referenced this issue Sep 30, 2023
barneygale added a commit to barneygale/cpython that referenced this issue Oct 28, 2023
barneygale added a commit that referenced this issue Nov 13, 2023
Add `glob.translate()` function that converts a pathname with shell wildcards to a regular expression. The regular expression is used by pathlib to implement `match()` and `glob()`.

This function differs from `fnmatch.translate()` in that wildcards do not match path separators by default, and that a `*` pattern segment matches precisely one path segment. When *recursive* is set to true, `**` pattern segments match any number of path segments, and `**` cannot appear outside its own segment.

In pathlib, this change speeds up directory walking (because `_make_child_relpath()` does less work), makes path objects smaller (they don't need a `_lines` slot), and removes the need for some gnarly code.

Co-authored-by: Jason R. Coombs <[email protected]>
Co-authored-by: Adam Turner <[email protected]>
@barneygale
Copy link
Contributor

Addressed in Python 3.13 / cf67ebf / #106703 -- we've added a new glob.translate() function that converts a glob expression into a regular expression.

aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
Add `glob.translate()` function that converts a pathname with shell wildcards to a regular expression. The regular expression is used by pathlib to implement `match()` and `glob()`.

This function differs from `fnmatch.translate()` in that wildcards do not match path separators by default, and that a `*` pattern segment matches precisely one path segment. When *recursive* is set to true, `**` pattern segments match any number of path segments, and `**` cannot appear outside its own segment.

In pathlib, this change speeds up directory walking (because `_make_child_relpath()` does less work), makes path objects smaller (they don't need a `_lines` slot), and removes the need for some gnarly code.

Co-authored-by: Jason R. Coombs <[email protected]>
Co-authored-by: Adam Turner <[email protected]>
@hauntsaninja
Copy link
Contributor

@barneygale minor thing, I noticed that the pat argument is documented as pathname. Should we make the arg positional-only or rename the arg to match the documentation?

Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024
Add `glob.translate()` function that converts a pathname with shell wildcards to a regular expression. The regular expression is used by pathlib to implement `match()` and `glob()`.

This function differs from `fnmatch.translate()` in that wildcards do not match path separators by default, and that a `*` pattern segment matches precisely one path segment. When *recursive* is set to true, `**` pattern segments match any number of path segments, and `**` cannot appear outside its own segment.

In pathlib, this change speeds up directory walking (because `_make_child_relpath()` does less work), makes path objects smaller (they don't need a `_lines` slot), and removes the need for some gnarly code.

Co-authored-by: Jason R. Coombs <[email protected]>
Co-authored-by: Adam Turner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
Development

No branches or pull requests

3 participants