Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add case_sensitive argument to pathlib.PurePath.match() #104484

Closed
barneygale opened this issue May 14, 2023 · 11 comments
Closed

Add case_sensitive argument to pathlib.PurePath.match() #104484

barneygale opened this issue May 14, 2023 · 11 comments
Labels
topic-pathlib type-feature A feature request or enhancement

Comments

@barneygale
Copy link
Contributor

barneygale commented May 14, 2023

In #81079 we added a case_sensitive argument to pathlib.Path.glob() and rglob(). It would be good to have this functionality available in the closely-related PurePath.match() method. The rationale is much the same: this argument is useful when dealing with case-sensitive filesystems on Windows, and case-insensitive filesystems on Posix. Also, it would allow this method to be used from glob() to implement recursive globbing efficiently, which is important for #102613.

Linked PRs

@barneygale
Copy link
Contributor Author

I'd like to land #101398 before I start work on this.

@thirumurugan-git
Copy link
Contributor

Hi @barneygale,
I believe it's a relatively simple problem that I can help fix. May I have the opportunity to work on it?

@thirumurugan-git
Copy link
Contributor

@barneygale kindly review this PR

@barneygale
Copy link
Contributor Author

Added in #104565 / dcdc90d. Thank you, @thirumurugan-git!

carljm added a commit to carljm/cpython that referenced this issue May 18, 2023
* main:
  pythongh-74690: Don't set special protocol attributes on non-protocol subclasses of protocols (python#104622)
  pythongh-104623: Update Windows installer to use SQLite 3.42.0 (python#104625)
  pythongh-104050: Add more type annotations to Argument Clinic (python#104628)
  pythongh-104629: Don't skip test_clinic if _testclinic is missing (python#104630)
  pythongh-104549: Set __module__ on TypeAliasType (python#104550)
  pythongh-104050: Improve some typing around `default`s and sentinel values (python#104626)
  pythongh-104146: Remove unused vars from Argument Clinic (python#104627)
  pythongh-104615: don't make unsafe swaps in apply_static_swaps (python#104620)
  pythonGH-104484: Add case_sensitive argument to `pathlib.PurePath.match()` (pythonGH-104565)
  pythonGH-96803: Document and test new unstable internal frame API functions (pythonGH-104211)
  pythonGH-104580: Don't cache eval breaker in interpreter (pythonGH-104581)
@eryksun
Copy link
Contributor

eryksun commented May 18, 2023

It should be noted in the docs that Path.match() has changed in 3.12 to use re.IGNORECASE instead of normalizing the path and pattern with os.path.normcase(). On Windows, os.path.normcase() is based on the system's lowercase mapping, while re.IGNORECASE is based on Python's lowercase mapping. The two differ for 200 BMP characters and 260 non-BMP characters. Thus some patterns may match a path in 3.11 but not in 3.12, and vice versa.

@eryksun eryksun reopened this May 18, 2023
@barneygale
Copy link
Contributor Author

It used str.lower() in 3.11:

cpython/Lib/pathlib.py

Lines 189 to 190 in d1bfefd

def casefold_parts(self, parts):
return [p.lower() for p in parts]

@eryksun
Copy link
Contributor

eryksun commented May 19, 2023

I lost track of changes in 3.12. I thought it always used normcase() for case folding in match() as well as the hash and comparison methods. Instead, this issue restores the use of str.lower() in match(). However, the hash and comparison methods will be inconsistent on Windows because they still use _str_normcase based on os.path.normcase(). Thus match() could be true while __eq__() is false, or vice versa, given component names that contain any of the 460 characters for which str.lower() and ntpath.normcase() are different.

@barneygale
Copy link
Contributor Author

Here's a table showing how __eq__(), match() and glob() are implemented in 3.11 and 3.12:

3.11 3.12
__eq__() str.lower() os.path.normcase()
match() str.lower() re.IGNORECASE
glob() re.IGNORECASE re.IGNORECASE

You can see that things were already inconsistent in 3.11. For 3.12, we've:

  1. Brought __eq__() more in line with Windows' filename comparisons
  2. Brought match() in line with its sister method glob()

I wouldn't mind changing __eq__() and friends back to using lower() if you thought that best, @eryksun!

@eryksun
Copy link
Contributor

eryksun commented May 24, 2023

@barneygale

AFAIK, str.lower() and re.IGNORECASE are consistent. The inconsistency is with ntpath.normcase() on Windows. It's probably for the best if the pure comparison methods revert to using str.lower() for the sake of consistency, not only with glob() and match(), but also with using PureWindowsPath on POSIX. Maybe platform-dependent comparisons could be implemented on Path.

On Windows, ntpath.normcase() is based on LCMapStringEx(). It turns out that this function implements a case mapping for some non-BMP characters. WinAPI CompareStringOrdinal(), on the other hand, has no case mapping for non-BMP characters, which is consistent with Microsoft's filesystems. Thus I'd prefer for a platform-dependent comparison to use CompareStringOrdinal() instead of LCMapStringEx().

Given that any tree in a Linux ext4 filesystem can be made case insensitive, and any tree in a Windows NTFS filesystem can be made case sensitive (even a non-empty directory if the registry setting allows it), I'd prefer for a platform-dependent comparison to support partially realized behavior in this regard. Maybe the Path constructor could allow overriding the default case sensitivity for all of a path or a subtree of a path.

@barneygale
Copy link
Contributor Author

Thank you Eryk. I've logged a new issue here: #104947. Will get it fixed pronto!

AFAIK, str.lower() and re.IGNORECASE are consistent.

Seemingly not in all circumstances:

>>> s = 'ı'
>>> t = 'i'
>>> s.lower() == t.lower()
False
>>> import re
>>> re.match(s, t, flags=re.IGNORECASE)
<re.Match object; span=(0, 1), match='i'>

Given that any tree in a Linux ext4 filesystem can be made case insensitive, and any tree in a Windows NTFS filesystem can be made case sensitive (even a non-empty directory if the registry setting allows it), I'd prefer for a platform-dependent comparison to support partially realized behavior in this regard. Maybe the Path constructor could allow overriding the default case sensitivity for all of a path or a subtree of a path.

This is on my to-do list :)

@barneygale
Copy link
Contributor Author

Re-resolving this ticket per the above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-pathlib type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants