-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] support multiple s=>r Pairs in strings replace method #35414
Conversation
Update util.jl tests for replace cleanup removed tuple non sense Update util.jl fix remove useless type specialization fast method for multiple Char => Char costruct Dict in one go support count also in Char mapping replace relax constraint on Char=>Char into Char=>? added correct type constraining fixed bug in counting much better testing on count semantic full permutation set
If the string has length As I also discussed in #30457 and #25396, my basic issue with this functionality is the difficulty of getting good performance with this API. We might as well tell people to use
which at least has O(nN) complexity. You can get better complexity by building up a regex and using a dictionary of replacments, but you still deny the caller the possibility of precomputing the regex and dictionary for the common case of calling this function for the same patterns on many strings. |
I suppose that one option would be to define a function |
For example: const regex_metachars = ('\\','^','$','.','|','?','*','(',')','[','{')
function backslash_metachars(s::AbstractString)
buf = IOBuffer()
for c in s
if c in regex_metachars
print(buf, '\\', c)
else
print(buf, c)
end
end
return String(take!(buf))
end
function stringreplacer(patdict::AbstractDict{<:AbstractString,<:AbstractString})
r = Regex(join((backslash_metachars(s) for s in keys(patdict)), '|'))
return s::AbstractString -> replace(s, r => x -> patdict[x])
end
stringreplacer(pats::Pair...) = stringreplacer(Dict(pats...)) at which point you can do julia> stringreplacer("oo"=>"zz","ar"=>"zz","z"=>"m")("foobarbaz")
"fzzbzzbam" However, rather than having But, as has been suggested multiple times every time this has come up, it would probably make more sense to try putting together a |
Thank you Steven for your feedback :)
very true, altough i do not believe the foldl approach is of any use...
that would indeed perform much better, the only problem with that approach is that it is restricted to pairs of Stringable=>Stringable, e.g a replace with as a very first step i was aiming at a slow but interface-wise complete implementation.
this sounds very nice, with a "replacer" it would be possible to construct case specific optimized code for cases where the user is not using String-like pairs. I understand now that this code is not by far Base material as there are many better options. Do you think it is a viable option (in order to get it merged eventually) turning this PR, into a replacer + helper method, which will work only for String-like pairs (ATM)? by the way, happy easter :) |
To quantify my point about scaling, compare the replacer = stringreplacer("oo"=>"zz","ar"=>"zz","z"=>"m")
@btime $replacer("foobarbaz")
@btime $replacer("foobarbaz"^10000)
@btime replace_thisPR("foobarbaz", "oo"=>"zz","ar"=>"zz","z"=>"m")
@btime replace_thisPR($("foobarbaz"^10000), "oo"=>"zz","ar"=>"zz","z"=>"m"); which gives:
In other words, for a short string the regex is 3x faster, and for a long string (90kB) it is 200x faster and allocates 200x less memory. |
I think a better approach would be to create a package. |
agreed a type would be nicer, as for making a package (unless crazy strong functionality hard to reach by Base is implemented) a very simple code similar to what you just posted, would slowly bitrot and die never tried by any user (in my opinion). In any case, thank you for the detailed feedback 👍 |
sorry for coming back at this but @stevengj why do you always propose to not to add that functionality to Base at all? If people seriously care about performance they will look up their bottlenecks and solve them. Base is mostly for very generic convenience. Like replace. |
My concern is that people keep proposing APIs in Base that are nearly impossible to optimize, so you can't drop in a faster implementation. That means that we are stuck forever with "use the Base API only if you don't care about performance," which is not generally what we aspire to for Base functions. |
This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes JuliaLang#35327 fixes JuliaLang#39061 fixes JuliaLang#35414 fixes JuliaLang#29849 fixes JuliaLang#30457 fixes JuliaLang#25396
This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. #25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes #35327 fixes #39061 fixes #35414 fixes #29849 fixes #30457 fixes #25396
This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. #25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes #35327 fixes #39061 fixes #35414 fixes #29849 fixes #30457 fixes #25396
This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes JuliaLang#35327 fixes JuliaLang#39061 fixes JuliaLang#35414 fixes JuliaLang#29849 fixes JuliaLang#30457 fixes JuliaLang#25396
This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes JuliaLang#35327 fixes JuliaLang#39061 fixes JuliaLang#35414 fixes JuliaLang#29849 fixes JuliaLang#30457 fixes JuliaLang#25396
Stab @ addressing #35327
semantics respected:
edit 9 apr: added fast specialization for Char=>Char mappings
edit 10 apr: generalized fast specialization for Char=>? mappings using IOBuffer