-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized repeated string search API #44794
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
I'd argue about this. Huge text could be structured or tokenized. Have you examined compiled regex for this case? This is a subset of regex. |
Regular expressions are a logical suggestion, but:
|
Some of these should presumably be I'd like |
@KalleOlaviNiemitalo Correct, all of them should be abstract. I fixed it. You are also right about the |
RegexCompiler / source generator just use IndexOf now, e.g. for: [RegexGenerator(@"abc")]
private static partial Regex SearchAbc(); you get: private sealed class Runner : RegexRunner
{
// Description:
// ○ Match the string "abc".
/// <summary>Scan the <paramref name="inputSpan"/> starting from base.runtextstart for the next match.</summary>
/// <param name="inputSpan">The text being scanned by the regular expression.</param>
protected override void Scan(ReadOnlySpan<char> inputSpan)
{
if (TryFindNextPossibleStartingPosition(inputSpan))
{
// The search in TryFindNextPossibleStartingPosition performed the entire match.
int start = base.runtextpos;
int end = base.runtextpos = start + 3;
base.Capture(0, start, end);
}
}
/// <summary>Search <paramref name="inputSpan"/> starting from base.runtextpos for the next location a match could possibly start.</summary>
/// <param name="inputSpan">The text being scanned by the regular expression.</param>
/// <returns>true if a possible match was found; false if no more matches are possible.</returns>
private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
{
int pos = base.runtextpos;
// Validate that enough room remains in the input to match.
// Any possible match is at least 3 characters.
if (pos <= inputSpan.Length - 3)
{
// The pattern begins with a literal "abc". Find the next occurrence.
// If it can't be found, there's no match.
int i = inputSpan.Slice(pos).IndexOf("abc");
if (i >= 0)
{
base.runtextpos = pos + i;
return true;
}
}
// No match found.
NoStartingPositionFound:
base.runtextpos = inputSpan.Length;
return false;
}
} and RegexCompiler spits out almost an identical implementation, just with IL instead of C#.
It does not create and load a new assembly. It uses DynamicMethod. I don't currently see enough value in this proposal over what's already possible. |
Also the regex generator is free to take advantage in future of string searching algorithms that have a significant set up cost, since it can pay that cost at compilation time. I think this use case is satisfied by the new generator. I will close. |
For fairness I should have said: I'm assuming a constant string, otherwise the generator of course won't help. @GSPP is that a fair assumption? |
But RegexCompiler would, as would RegexInterpreter, which also uses IndexOf. |
@danmoseley It could be a constant, a once-initialized string, or a rarely changing string. Of course, it's also possible to use this "create once use often" idea to search for multiple strings (this was not proposed here). If the team does not want to pursue this I'm perfectly fine with closing this issue. I have noticed many closures over the past few days and it seems like a cleanup is underway. Although I have to say, I found this to be a pretty useful idea. I have so often needed to repeatedly search for the same string (or the same set of strings). I've done lots of text analysis over the years. |
@GSPP thanks. I believe we should wait until we get clear use cases where regex is not sufficient. |
Problem:
It is a fairly common operation to search for the same string many times. For example, you might want so search 1 million text documents for the same substring.
Currently, this is easy to do by simply calling
IndexOf
repeatedly. But there is a way to do this much faster: There are string search algorithms which are super fast but which require a preprocessing step specific for the string to be found. Some examples:This is especially relevant since .NET Core switched to ICU for string processing. ICU has a high per-search cost with reduced per-character cost. This has led to performance regressions (#40942). With the right API, it could lead to performance gains instead.
Proposal:
I propose adding a new API that acknowledges this two phase prepare/search model:
There would be no functional change. This is purely a performance benefit. This benefit can be large when the same string is to be found many times. In particular, the performance loss from switching to ICU can be alleviated in certain cases.
I made the
StringSearcher
class abstract so that user code can provide its own algorithms. There could be rich community innovation.Possible extension: It could be useful to be able to search for many strings at once. Certain algorithms can do that efficiently (e.g. trie based ones).
Finding all matches in a document is a fairly common task. This can be made fast and convenient:
I believe that string searching is such a common and performance-sensitive task that it would warrant a new optimized API.
The text was updated successfully, but these errors were encountered: