Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] support multiple s=>r Pairs in strings replace method #35414

Closed
wants to merge 17 commits into from
Closed

[RFC] support multiple s=>r Pairs in strings replace method #35414

wants to merge 17 commits into from

Conversation

francescoalemanno
Copy link
Contributor

@francescoalemanno francescoalemanno commented Apr 9, 2020

Stab @ addressing #35327

semantics respected:

    @test replace("foobarbaz","oo"=>"zz","ar"=>"zz","z"=>"m") == "fzzbzzbam"
    @test replace("foobarbaz","z"=>"m","oo"=>"zz","ar"=>"zz") == "fzzbzzbam"
    @test replace("foobarbaz","z"=>"m","oo"=>"zz","ar"=>"zz",count=2) == "fzzbzzbaz"
    @test replace("foobarbaz","z"=>"m",r"a.*a"=>uppercase) == "foobARBAm"
    @test replace("foobarbaz",'o'=>'z','a'=>'q','z'=>'m') == "fzzbqrbqm"

edit 9 apr: added fast specialization for Char=>Char mappings
edit 10 apr: generalized fast specialization for Char=>? mappings using IOBuffer

@francescoalemanno francescoalemanno marked this pull request as draft April 9, 2020 08:24
@francescoalemanno francescoalemanno marked this pull request as ready for review April 9, 2020 10:34
@fredrikekre fredrikekre added the strings "Strings!" label Apr 9, 2020
@francescoalemanno francescoalemanno changed the title support multiple s=>r Pairs in strings replace method [RFC] support multiple s=>r Pairs in strings replace method Apr 10, 2020
francescoalemanno and others added 17 commits April 10, 2020 17:24
removed tuple non sense
Update util.jl

tests for replace

cleanup

removed tuple non sense

Update util.jl

fix

remove useless type specialization

fast method for multiple Char => Char

costruct Dict in one go

support count also in Char mapping replace

relax constraint on Char=>Char into Char=>?

added correct type constraining

fixed bug in counting

much better testing on count semantic

full permutation set
@stevengj
Copy link
Member

If the string has length n and there are N replacements, it seems that this implementation has O(n²N) complexity in both time and space? (Even for N=1, it calls itself recursively with a new string slice for each replacement, which is O(n²).)

As I also discussed in #30457 and #25396, my basic issue with this functionality is the difficulty of getting good performance with this API. We might as well tell people to use

foldl(replace, patterns, init=s)

which at least has O(nN) complexity.

You can get better complexity by building up a regex and using a dictionary of replacments, but you still deny the caller the possibility of precomputing the regex and dictionary for the common case of calling this function for the same patterns on many strings.

@stevengj
Copy link
Member

I suppose that one option would be to define a function stringreplacer(subs::Pair…) that returns a "replacer" function via a regex and a dictionary of replacements, then then define replace(s::AbstractString, subs::Pair...) = replacer(subs...)(s).

@stevengj
Copy link
Member

For example:

const regex_metachars = ('\\','^','$','.','|','?','*','(',')','[','{')
function backslash_metachars(s::AbstractString)
    buf = IOBuffer()
    for c in s
        if c in regex_metachars
            print(buf, '\\', c)
        else
            print(buf, c)
        end
    end
    return String(take!(buf))
end

function stringreplacer(patdict::AbstractDict{<:AbstractString,<:AbstractString})
    r = Regex(join((backslash_metachars(s) for s in keys(patdict)), '|'))
    return s::AbstractString -> replace(s, r => x -> patdict[x])
end
    
stringreplacer(pats::Pair...) = stringreplacer(Dict(pats...))

at which point you can do

julia> stringreplacer("oo"=>"zz","ar"=>"zz","z"=>"m")("foobarbaz")
"fzzbzzbam"

However, rather than having stringreplacer return a function, it would be even nicer to return type, so that (for example) it can be displayed nicely.

But, as has been suggested multiple times every time this has come up, it would probably make more sense to try putting together a StringReplacer.jl package before we consider merging this into Base.

@francescoalemanno
Copy link
Contributor Author

Thank you Steven for your feedback :)

If the string has length n and there are N replacements, it seems that this implementation has O(n²N) complexity in both time and space?
this general method i wrote is indeed not ideal under any perspective performance-wise

As I also discussed in #30457 and #25396, my basic issue with this functionality is the difficulty of getting good performance with this API.
We might as well tell people to use

foldl(replace, patterns, init=s)

very true, altough i do not believe the foldl approach is of any use...
with this code i wanted to match the behaviour of replace on Vectors for example, where substitutions are by definition non overlapping. for example replace('ab','a'=>'b','b'=>'a') should not be a no-op imho. i also wanted to allow the usage of any of the Pair{A,B} types that are allowed for Strings

You can get better complexity by building up a regex and using a dictionary of replacments, but you still deny the caller the possibility of precomputing the regex and dictionary for the common case of calling this function for the same patterns on many strings.

that would indeed perform much better, the only problem with that approach is that it is restricted to pairs of Stringable=>Stringable, e.g a replace with replace(s,'a'=>'b',mycustom_Char_filter => mymapping) will definitely not be possible.

as a very first step i was aiming at a slow but interface-wise complete implementation.

I suppose that one option would be to define a function stringreplacer(subs::Pair…) that returns a "replacer" function via a regex and a dictionary of replacements, then then define replace(s::AbstractString, subs::Pair...) = replacer(subs...)(s).

this sounds very nice, with a "replacer" it would be possible to construct case specific optimized code for cases where the user is not using String-like pairs.

I understand now that this code is not by far Base material as there are many better options.

Do you think it is a viable option (in order to get it merged eventually) turning this PR, into a replacer + helper method, which will work only for String-like pairs (ATM)?

by the way, happy easter :)

@stevengj
Copy link
Member

To quantify my point about scaling, compare the stringreplacer above that uses a single precomputed regex with this PR's implementation:

replacer = stringreplacer("oo"=>"zz","ar"=>"zz","z"=>"m")
@btime $replacer("foobarbaz")
@btime $replacer("foobarbaz"^10000)

@btime replace_thisPR("foobarbaz", "oo"=>"zz","ar"=>"zz","z"=>"m")
@btime replace_thisPR($("foobarbaz"^10000), "oo"=>"zz","ar"=>"zz","z"=>"m");

which gives:

  773.821 ns (17 allocations: 720 bytes)
  4.629 ms (90009 allocations: 3.85 MiB)
  2.169 μs (80 allocations: 3.88 KiB)
  1.066 s (680050 allocations: 977.45 MiB)

In other words, for a short string the regex is 3x faster, and for a long string (90kB) it is 200x faster and allocates 200x less memory.

@stevengj
Copy link
Member

Do you think it is a viable option (in order to get it merged eventually) turning this PR, into a replacer + helper method, which will work only for String-like pairs (ATM)?

I think a better approach would be to create a package.

@francescoalemanno
Copy link
Contributor Author

francescoalemanno commented Apr 12, 2020

However, rather than having stringreplacer return a function, it would be even nicer to return type, so that (for example) it can be displayed nicely.

But, as has been suggested multiple times every time this has come up, it would probably make more sense to try putting together a StringReplacer.jl package before we consider merging this into Base.

agreed a type would be nicer, as for making a package (unless crazy strong functionality hard to reach by Base is implemented) a very simple code similar to what you just posted, would slowly bitrot and die never tried by any user (in my opinion).

In any case, thank you for the detailed feedback 👍

@francescoalemanno francescoalemanno deleted the patch-1 branch April 12, 2020 18:58
@rapus95
Copy link
Contributor

rapus95 commented Apr 13, 2020

sorry for coming back at this but @stevengj why do you always propose to not to add that functionality to Base at all? If people seriously care about performance they will look up their bottlenecks and solve them. Base is mostly for very generic convenience. Like replace.
You keep saying the API leads to hard to optimize situations and will only be picked when people look for performance. I'd like to question that argument since Base shouldn't be following the performance over generality mentality. I'm quite sure that people will mostly try to call that variant to replace a certain set of symbols by another set of fill ins. w/o collisions or multiple occurences. For that, replace is the first function to look up. So there definitively should be a generic implementation in Base. If you then see that it becomes a bottleneck, just plug in some (by then maybe available) StringReplacer.jl package. As that latter package will hold specializations for Base.replace it becomes a true drop-in replacement. IMO that's the way Base should work. Like it's done for Matrix multiplication
Base: Make common idioms work generically (though they might be slower than a specialized version)
Package: Make specific common idioms faster/specialize them
In my case I was looking for that:
replace(samplepeer, "<id>"=> id, "<Remote public key>" => pubkey, "<Remote IP>" => theip)
sure, I could easily solve it via reduce. But that definitively is not what you look up when you just want to replace a few tags. Also it'd behave differently than what is expected from replace and wouldn't be faster than the naive approach.

@stevengj
Copy link
Member

stevengj commented Apr 13, 2020

My concern is that people keep proposing APIs in Base that are nearly impossible to optimize, so you can't drop in a faster implementation.

That means that we are stuck forever with "use the Base API only if you don't care about performance," which is not generally what we aspire to for Base functions.

vtjnash added a commit to vtjnash/julia that referenced this pull request Apr 14, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes JuliaLang#35327
fixes JuliaLang#39061
fixes JuliaLang#35414
fixes JuliaLang#29849
fixes JuliaLang#30457
fixes JuliaLang#25396
JeffBezanson pushed a commit that referenced this pull request Jun 7, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. #25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes #35327
fixes #39061
fixes #35414
fixes #29849
fixes #30457
fixes #25396
JeffBezanson pushed a commit that referenced this pull request Jun 7, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. #25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes #35327
fixes #39061
fixes #35414
fixes #29849
fixes #30457
fixes #25396
shirodkara pushed a commit to shirodkara/julia that referenced this pull request Jun 9, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes JuliaLang#35327
fixes JuliaLang#39061
fixes JuliaLang#35414
fixes JuliaLang#29849
fixes JuliaLang#30457
fixes JuliaLang#25396
johanmon pushed a commit to johanmon/julia that referenced this pull request Jul 5, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes JuliaLang#35327
fixes JuliaLang#39061
fixes JuliaLang#35414
fixes JuliaLang#29849
fixes JuliaLang#30457
fixes JuliaLang#25396
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants