Optimized repeated string search API #44794

GSPP · 2020-11-17T10:43:48Z

Problem:

It is a fairly common operation to search for the same string many times. For example, you might want so search 1 million text documents for the same substring.

Currently, this is easy to do by simply calling IndexOf repeatedly. But there is a way to do this much faster: There are string search algorithms which are super fast but which require a preprocessing step specific for the string to be found. Some examples:

This is especially relevant since .NET Core switched to ICU for string processing. ICU has a high per-search cost with reduced per-character cost. This has led to performance regressions (#40942). With the right API, it could lead to performance gains instead.

Proposal:

I propose adding a new API that acknowledges this two phase prepare/search model:

public abstract class StringSearcher
{
    //Prepare search algorithm:
    public static StringSearcher Create(ReadOnlySpan<char> stringToFind, StringComparison stringComparison);
    
    //Execute search:
    public abstract int IndexOf(ReadOnlySpan<char> stringToSearch);
    public abstract int LastIndexOf(ReadOnlySpan<char> stringToSearch);
    public abstract bool Contains(ReadOnlySpan<char> stringToSearch);
    public abstract bool StartsWith(ReadOnlySpan<char> stringToSearch);
    public abstract bool EndsWith(ReadOnlySpan<char> stringToSearch);
}

There would be no functional change. This is purely a performance benefit. This benefit can be large when the same string is to be found many times. In particular, the performance loss from switching to ICU can be alleviated in certain cases.

I made the StringSearcher class abstract so that user code can provide its own algorithms. There could be rich community innovation.

Possible extension: It could be useful to be able to search for many strings at once. Certain algorithms can do that efficiently (e.g. trie based ones).

Finding all matches in a document is a fairly common task. This can be made fast and convenient:

//Framework code:
public abstract class StringSearcher {
    //OccurrenceEnumerator equivalent to IEnumerable<(int Index, int Length)>
    public OccurrenceEnumerator EnumerateOccurrences(ReadOnlySpan<char> stringToSearch);
}


//Initialize in application code:
static readonly StringSearcher searcher = ...; 


//Run search algorithm:
foreach ((int index, int length) in searcher.EnumerateOccurrences("abc"))
    ...

I believe that string searching is such a common and performance-sensitive task that it would warrant a new optimized API.

The text was updated successfully, but these errors were encountered:

Dotnet-GitSync-Bot · 2020-11-17T10:43:51Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

huoyaoyuan · 2020-11-17T10:56:09Z

It is a fairly common operation to search for the same string many times.

I'd argue about this. Huge text could be structured or tokenized.

Have you examined compiled regex for this case? This is a subset of regex.

GSPP · 2020-11-17T11:55:36Z

Regular expressions are a logical suggestion, but:

I believe that compiled regular expressions are not as fast as IndexOf by any means.
There are escaping issues.
Constructing a compiled regex is extremely expensive (does it not create and load a fresh assembly?). I imagine StringSearcher to be far cheaper to instantiate.
Operations such as StartsWith (and the others I listed) are not directly supported. They require generating regex syntax.

KalleOlaviNiemitalo · 2020-11-17T12:14:37Z

public int IndexOf(ReadOnlySpan<char> stringToSearch);
public int LastIndexOf(ReadOnlySpan<char> stringToSearch);
public bool Contains(ReadOnlySpan<char> stringToSearch);
public bool StartsWith(ReadOnlySpan<char> stringToSearch);
public bool EndsWith(ReadOnlySpan<char> stringToSearch);

Some of these should presumably be abstract.

I'd like out int matchLength or equivalent, as in CompareInfo.IndexOf. That is not needed for ordinal search but can be useful for linguistic search.

GSPP · 2020-11-17T14:03:56Z

@KalleOlaviNiemitalo Correct, all of them should be abstract. I fixed it.

You are also right about the out int matchLength. There might well be more methods and overloads needed. For now, I'll keep this proposal simple so that it's easy to understand. If the team signals interest, there can be a formal API proposal.

stephentoub · 2022-03-14T14:52:09Z

I believe that compiled regular expressions are not as fast as IndexOf by any means.

RegexCompiler / source generator just use IndexOf now, e.g. for:

[RegexGenerator(@"abc")]
private static partial Regex SearchAbc();

you get:

private sealed class Runner : RegexRunner
{
    // Description:
    // ○ Match the string "abc".

    /// <summary>Scan the <paramref name="inputSpan"/> starting from base.runtextstart for the next match.</summary>
    /// <param name="inputSpan">The text being scanned by the regular expression.</param>
    protected override void Scan(ReadOnlySpan<char> inputSpan)
    {
        if (TryFindNextPossibleStartingPosition(inputSpan))
        {
            // The search in TryFindNextPossibleStartingPosition performed the entire match.
            int start = base.runtextpos;
            int end = base.runtextpos = start + 3;
            base.Capture(0, start, end);
        }
    }

    /// <summary>Search <paramref name="inputSpan"/> starting from base.runtextpos for the next location a match could possibly start.</summary>
    /// <param name="inputSpan">The text being scanned by the regular expression.</param>
    /// <returns>true if a possible match was found; false if no more matches are possible.</returns>
    private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
    {
        int pos = base.runtextpos;

        // Validate that enough room remains in the input to match.
        // Any possible match is at least 3 characters.
        if (pos <= inputSpan.Length - 3)
        {
            // The pattern begins with a literal "abc". Find the next occurrence.
            // If it can't be found, there's no match.
            int i = inputSpan.Slice(pos).IndexOf("abc");
            if (i >= 0)
            {
                base.runtextpos = pos + i;
                return true;
            }
        }

        // No match found.
        NoStartingPositionFound:
        base.runtextpos = inputSpan.Length;
        return false;
    }
}

and RegexCompiler spits out almost an identical implementation, just with IL instead of C#.

Constructing a compiled regex is extremely expensive (does it not create and load a fresh assembly?).

It does not create and load a new assembly. It uses DynamicMethod.

I don't currently see enough value in this proposal over what's already possible.

danmoseley · 2022-03-14T15:47:59Z

Also the regex generator is free to take advantage in future of string searching algorithms that have a significant set up cost, since it can pay that cost at compilation time.

I think this use case is satisfied by the new generator. I will close.

danmoseley · 2022-03-14T19:14:10Z

For fairness I should have said: I'm assuming a constant string, otherwise the generator of course won't help. @GSPP is that a fair assumption?

stephentoub · 2022-03-14T19:23:51Z

assuming a constant string, otherwise the generator of course won't help

But RegexCompiler would, as would RegexInterpreter, which also uses IndexOf.

GSPP · 2022-03-16T14:29:09Z

@danmoseley It could be a constant, a once-initialized string, or a rarely changing string. Of course, it's also possible to use this "create once use often" idea to search for multiple strings (this was not proposed here).

If the team does not want to pursue this I'm perfectly fine with closing this issue. I have noticed many closures over the past few days and it seems like a cleanup is underway.

Although I have to say, I found this to be a pretty useful idea. I have so often needed to repeatedly search for the same string (or the same set of strings). I've done lots of text analysis over the years.

danmoseley · 2022-03-16T14:35:54Z

@GSPP thanks. I believe we should wait until we get clear use cases where regex is not sufficient.

danmoseley · 2022-03-16T14:52:11Z

@GSPP it doesn't fit your scenario here but you may be interested in #62447.

GSPP added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Nov 17, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Nov 17, 2020

jeffschwMSFT added the area-System.Runtime label Nov 19, 2020

tannergooding removed the untriaged New issue has not been triaged by the area owner label Jul 12, 2021

tannergooding added this to the Future milestone Jul 12, 2021

danmoseley closed this as completed Mar 14, 2022

ghost locked as resolved and limited conversation to collaborators Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized repeated string search API #44794

Optimized repeated string search API #44794

GSPP commented Nov 17, 2020 •

edited

Loading

Dotnet-GitSync-Bot commented Nov 17, 2020

huoyaoyuan commented Nov 17, 2020

GSPP commented Nov 17, 2020 •

edited

Loading

KalleOlaviNiemitalo commented Nov 17, 2020

GSPP commented Nov 17, 2020

stephentoub commented Mar 14, 2022 •

edited

Loading

danmoseley commented Mar 14, 2022

danmoseley commented Mar 14, 2022

stephentoub commented Mar 14, 2022

GSPP commented Mar 16, 2022 •

edited

Loading

danmoseley commented Mar 16, 2022

danmoseley commented Mar 16, 2022

Optimized repeated string search API #44794

Optimized repeated string search API #44794

Comments

GSPP commented Nov 17, 2020 • edited Loading

Dotnet-GitSync-Bot commented Nov 17, 2020

huoyaoyuan commented Nov 17, 2020

GSPP commented Nov 17, 2020 • edited Loading

KalleOlaviNiemitalo commented Nov 17, 2020

GSPP commented Nov 17, 2020

stephentoub commented Mar 14, 2022 • edited Loading

danmoseley commented Mar 14, 2022

danmoseley commented Mar 14, 2022

stephentoub commented Mar 14, 2022

GSPP commented Mar 16, 2022 • edited Loading

danmoseley commented Mar 16, 2022

danmoseley commented Mar 16, 2022

GSPP commented Nov 17, 2020 •

edited

Loading

GSPP commented Nov 17, 2020 •

edited

Loading

stephentoub commented Mar 14, 2022 •

edited

Loading

GSPP commented Mar 16, 2022 •

edited

Loading