Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hh globstar #38

Closed
wants to merge 9 commits into from
Closed

Hh globstar #38

wants to merge 9 commits into from

Conversation

hhaensel
Copy link

@hhaensel hhaensel commented Jun 5, 2024

This PR addresses #19

@vtjnash
Copy link
Owner

vtjnash commented Jun 5, 2024

Thank you for the PR. Unfortunately, #19 is about adding support for ** to glob, not to fnmatch, so the PR here will not address that any more than the last time this implementation was attempted: #21 (comment)

If you want to fix #19, you need to write an updated algorithm for the glob readdir functions in this file that can successfully traverse extra directory levels when encountering a **, and then filter out any duplicates from the resulting list

@hhaensel
Copy link
Author

hhaensel commented Jun 5, 2024

Oh, I hadn't seen that implementation of @oxinabox. My implementation is indeed almost identical.

One way of supporting globstar syntax for glob() would be to define it completely differently:

function globstar(g::Union{AbstractString,Glob.GlobMatch}, prefix::AbstractString = "";
    relative::Union{Bool, Nothing} = nothing,
    topdown::Bool = true,
    follow_symlinks::Bool = true,
    onerror::Union{Function, Nothing} = nothing
)
    g = Glob.GlobMatch(g)
    relative === nothing && (relative = isempty(prefix))
    isempty(prefix) && (prefix = pwd())

    fn = FilenameMatch(join([fn isa AbstractString ? fn : fn.pattern for fn in g.pattern], "/"), PATHNAME)
    matches = String[]
    for (root, dirs, files) in walkdir(prefix; topdown, follow_symlinks, onerror)
        for file in files
            file = joinpath(root, file)
            relfile = relpath(file, prefix)
            relpattern = Sys.iswindows() ? replace(relfile, '\\' => '/') : relfile
           
            occursin(fn, relpattern) && push!(matches, relative ? relfile : file)
        end
    end
    matches    
end

With this definition you get

julia> globstar(glob"**/*.jl")
9-element Vector{String}:
 "atest.jl"
 "btest.jl"
 "ctest.jl"
 "noskip\\test1.jl"
 "noskip\\test2.jl"
 "noskip\\test3.jl"
 "noskip\\hh\\test3.jl"
 "skip\\test.jl"
 "skip\\test2.jl"

julia> globstar(glob"**/*.jl", ".")
9-element Vector{String}:
 ".\\atest.jl"
 ".\\btest.jl"
 ".\\ctest.jl"
 ".\\noskip\\test1.jl"
 ".\\noskip\\test2.jl"
 ".\\noskip\\test3.jl"
 ".\\noskip\\hh\\test3.jl"
 ".\\skip\\test.jl"
 ".\\skip\\test2.jl"

julia> globstar(glob"**/*.jl", pwd())
9-element Vector{String}:
 "C:\\Users\\hh\\test\\atest.jl"
 "C:\\Users\\hh\\test\\btest.jl"
 "C:\\Users\\hh\\test\\ctest.jl"
 "C:\\Users\\hh\\test\\noskip\\test1.jl"
 "C:\\Users\\hh\\test\\noskip\\test2.jl"
 "C:\\Users\\hh\\test\\noskip\\test3.jl"
 "C:\\Users\\hh\\test\\noskip\\hh\\test3.jl"
 "C:\\Users\\hh\\test\\skip\\test.jl"
 "C:\\Users\\hh\\test\\skip\\test2.jl"

Edit: fixed rootdir
Edit 2: refactored to support prefix and other options for walkdir.

@hhaensel
Copy link
Author

hhaensel commented Jun 5, 2024

The option relative is also interesting in cases where you want relative paths with respect to a rootdir, e.g. for additional filtering.

@hhaensel
Copy link
Author

hhaensel commented Jun 5, 2024

P.S.: the above implementation only works with the current PR in place.

@vtjnash
Copy link
Owner

vtjnash commented Jun 5, 2024

Yes, that seems somewhat right. It is less flexible than the current implementation (which avoids making any assumptions about the contents of what makes up a path and permits arbitrary matches in the sequence including regexes), but otherwise somewhat accurate to it

@coveralls
Copy link

Coverage Status

coverage: 93.559% (+1.9%) from 91.634%
when pulling ba9dc2b on hhaensel:hh-globstar
into 5956425 on vtjnash:master.

@hhaensel
Copy link
Author

hhaensel commented Jun 6, 2024

One could define this function for handling string args and otherwise test whether the pattern vector contains "**" and doesn't contain regexes and only then call this function, otherwise keep the old version in place for compatibility.
Tell me how to proceed.

@hhaensel
Copy link
Author

hhaensel commented Jun 6, 2024

There's a severe bug in the approach of my occursin(). Needs some rethinking.

@vtjnash
Copy link
Owner

vtjnash commented Jun 6, 2024

Your code seems actually pretty close to right, since you do need to first generate a list of all candidates and then filter out the ones that don't match. To maintain the rest of the matching ability, you just need to replace occursin(fn, relpattern) with a call to splitpath(fn) and then write a matcher function to see if each path vector matches with all of the tail patterns, and if not, then to chop off one prefix from the path vector and try matching the tail pattern again, until either the path vector is exhausted or a match succeeds

@hhaensel
Copy link
Author

hhaensel commented Jun 6, 2024

It seemed very close, but it actually wasn't, because latest with multiple globstars the function would always fail.

So I reverted the changes and instead wrapped the main part of the function in a while loop which restarts the matching process with a new directory. Mutliple levels of globstars are supported by using an array of globstarmatches.

I also added directory matching and a fileonly mode.

Finally, I added a bunch of tests and hope that all edge cases are contained.
Don't hesitate to add more if it seems useful.

There is one point I observed when matching the old way:

julia> glob(["a", r".", "c"])
4-element Vector{String}:
 "a\\.b\\c"
 "a\\.c\\c"
 "a\\b\\c"
 "a\\c\\c"

So it seems that one would have to set start and end markers in order to only match a single character for the middle directory.

@coveralls
Copy link

Coverage Status

coverage: 92.877% (+1.2%) from 91.634%
when pulling fa30a05 on hhaensel:hh-globstar
into 5956425 on vtjnash:master.

@hhaensel
Copy link
Author

hhaensel commented Jun 7, 2024

I think I got everything sorted out by now.
Concerning the partial regex match, I think the best way is to adapt the docs and replace r"." by r"^.$"

@hhaensel
Copy link
Author

hhaensel commented Jun 7, 2024

Some final thought perhaps:

  • Currently the glob"" macro does not support options although the constructor converts parts of the input string to FilenameMatch objects. Shouldn't we allow to supply options and pass them to the generated fn"" objects?
    The only difficulty I see with this is that currently you default to PATHNAME|PERIOD. Switching off PATHNAME doesn't really makes sense, but we'd need to either change the default setting to no-PERIOD or offer a switching off parameter, e.g. with a capital 'P'.
  • Does the pathname option still makes sense, once globstar is working?

@hhaensel
Copy link
Author

@vtjnash I continued thinking about my statement about multiple globstars. I thought that I had found an example that would not pass without my array queue but I couldn't repeat it. Mathematically I now think, it is not necessary to retry with a new position of the first wildcard as soon as the second one is reached. So not queuing saves a lot of computing. Therefore, I went back to simple indexing as you did with star and all tests pass.
During that scan I found one error in my code with globstar_period which should be solved now.

@coveralls
Copy link

Coverage Status

coverage: 93.277% (+1.6%) from 91.634%
when pulling 783e3d9 on hhaensel:hh-globstar
into 5956425 on vtjnash:master.

@vtjnash
Copy link
Owner

vtjnash commented Jun 13, 2024

I just want to apologize that I have barely any time for development for the next couple weeks, but I should be fully back in July. Originally my expectation was that if someone wanted fancier options, they could construct the object manually, instead of using the macro to split it.

@hhaensel
Copy link
Author

hhaensel commented Jun 13, 2024

@vtjnash understood, thanks for the feedback.

I'm still not completely satisfied with the current status for various reasons

  • globstar matching I

    • you implemented glob exactly according to the linked IEEE standard, which is great, but it does not support globstar.
    • standard glob-like implementations, e.g. gitignore, vscode search, accept globstar but don't restrict '*' to not match '/'.

    We could easily support gitignore-patterns, but we either need another flag to switch on/off globstar matching or we always turn on globstar and break with the IEEE standard

  • globstar matching II
    Currently, I went for detecting "**/", but it would be correct to rather detect "/**/"

  • negation pattern
    gitignore implements a negation pattern by a leading '!', should we support that syntax?

  • glob / readdir
    My current implementation uses walkdir() which always lists all files in all sub directories even when it is clear that the pattern does not match. A better implementation would iterate the tree explicitly and not travel through non-matching subdirs.

If ever you find the time for thinking about these points, just drop a note here

@0x0f0f0f
Copy link

0x0f0f0f commented Aug 2, 2024

Any news on this one?

@vtjnash
Copy link
Owner

vtjnash commented Aug 9, 2024

Yeah, I am finally able to get back to this. Thank you for your patience and persistence!

I think we should break this down into a couple different PRs for easier discussion and review:

  1. adding globstar-like matching to fnmatch
  2. implementing globstar in glob
  3. considering adding extensions such as ! as an additional feature

Would you mind making a new PR with just the fnmatch improvements? We can keep this open until everything is merged, but it would help me with thinking through reviewing that it handles ** correctly. Secondly though, I see there are a couple of different, distinct interpretations of what globstar means. We will need to document what exactly we intend for this repository to implement. The zsh code is MIT licensed, so we can copy this text directly:

https://man.freebsd.org/cgi/man.cgi?query=zshexpn&sektion=1

Looks like the patterns to detect are **/, /**/, /**, and ** alone, or more concisely, it can be treated as being equivalent to r"(^/)**(/$)" for pattern detection. I appreciate that zsh then treats **/ as matching zero or more directories (instead of one or more as bash does), so I think that is worth copying from them.

some accept globstar but don't restrict '*'

This sounds like the expected behavior of using fnmatch without the FNM_PATHNAME flag. Is there an additional distinction?

defaulting to PATHNAME

I don't entirely know why I did that. Sounds like a bug actually, as I meant for it to generate separate fnmatch calls for each directory, such that each component of the glob was split on / but would match any filename it found in there, even if the filenames themselves could end up being arbitrary strings (e.g. if the content being matched was from something filesystem-like, but not actually a posix filesystem)

@hhaensel
Copy link
Author

hhaensel commented Aug 13, 2024

That sounds like a good proposal. I will come up with a globstar PR for fnmatch asap.

@hhaensel
Copy link
Author

hhaensel commented Oct 4, 2024

Close in favour of #39

@hhaensel hhaensel closed this Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants