-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: An API to optimally encapsulate several searching algorithms and techniques, using one simple class #59649
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsBackground and motivationToo many overloads of searching in We need a class where not only all those cases will be calculated or examined once and used thousand or even million of times, but also a class that will support more complex searching scenarios. API ProposalA. Base class public abstract class Scanner<T> {
public abstract bool Contain( T value );
// Actually a placeholder. This should be overridden by all implementations
public virtual int IndexOf( ReadOnlySpan<T> span ) {
for ( int i = 0; i < span.Length; i++ ) {
if ( Contain( span[ i ] ) )
return i;
}
return -1;
}
public virtual int LastIndexOf( ReadOnlySpan<T> span ) {
for ( int i = span.Length-1; i >=0 ; i-- ) {
if ( Contain( span[ i ] ) )
return i;
}
return -1;
}
public int IndexOf( ReadOnlySpan<T> span, int index ) {
int idx = IndexOf( span.Slice( index ) );
if ( idx >= 0 )
return idx + index;
return -1;
}
//..More IndexOf, LastIndexOf and others follow, that actually call the virtual methods above
protected virtual Scanner<T> GetInverse(){
return new ScannerInverse( this );
}
protected Scanner<T> _inverse;
/// <summary>
/// Get the scanner that performs the inverse search
/// </summary>
public Scanner<T> Inverse {
get => _inverse ??= GetInverse();
protected internal set => _inverse = value;
}
} B. An example of a Scanner that is searching for a simple value. The example is incomplete public class ScanOne<T> : Scanner<T> where T : struct, IEquatable<T> {
private readonly T _value;
public ScanOne( T value ) {
_value = value;
}
public override bool Contain( T value ) {
return _value.Equals( value );
}
public override int IndexOf( ReadOnlySpan<T> span ) {
return span.IndexOf( _value );
}
public override int LastIndexOf( ReadOnlySpan<T> span ) {
return span.LastIndexOf( _value );
}
protected override Scanner<T> GetInverse() {
return new Inv<T>(this);
}
class Inv<T> : Scanner<T> where T : struct, IEquatable<T> {
private readonly T _value;
public override bool Contain( T value ) {
return !_value.Equals( value );
}
public Inv( ScanOne<T> parent ) {
this.Inverse = parent;
_value = parent._value;
}
public override int IndexOf( ReadOnlySpan<T> span ) {
T value=_value;
for ( int i = 0; i < span.Length; i++ ) {
if ( !span[ i ].Equals( value ) )
return i;
}
return -1;
}
}
} C. An example of a Scanner that is using an ASCII lookup table. The example is incomplete public class Table16CharScanner : Scanner<char> {
private readonly ushort _mask;
private ushort[] _table;
public Table16CharScanner( ushort mask, ushort[] table ) {
_mask = mask;
_table = table;
}
/// <summary>
/// Check for characters that are above 0x7F
/// </summary>
/// <param name="ch">character to check</param>
/// <returns>true if is found;false otherwise</returns>
protected internal virtual bool ContainExt( char ch ) {
return false;
}
protected override Scanner<char> GetInverse() {
return new Table16CharScannerInverse( this, _mask, _table );
}
public sealed override bool Contain( char value ) {
if ( value < _table.Length ) {
return ( _table[ value ] & _mask ) != 0;
}
return ContainExt( value );
}
public override int IndexOf( ReadOnlySpan<char> span ) {
var mask = _mask;
var tableLength = _table.Length;
var table = _table;
for ( int i = 0; i < span.Length; i++ ) {
var ch = span[i];
if ( ch < tableLength ) {
if ( ( table[ ch ] & mask ) != 0 ) {
return i;
}
} else if ( ContainExt( ch ) ) {
return i;
}
}
return -1;
} D. ReadOnlySpan extensions public static class ScannerExtensions {
public int IndexOf<T>( this ReadOnlySpan<T> span, Scanner<T> scanner );
public int LastIndexOf<T>( this ReadOnlySpan<T> span, Scanner<T> scanner );
public ReadOnlySpan<T> Trim<T>( this ReadOnlySpan<T> span, Scanner<T> trim );
public ReadOnlySpan<T> Skip<T>( this ReadOnlySpan<T> span, Scanner<T> scanner );
public ReadOnlySpan<T> Remain<T>( this ReadOnlySpan<T> span, Scanner<T> scanner );
public ReadOnlySpan<T> Trim<T>( this ReadOnlySpan<T> span, Scanner<T> trimStart, Scanner<T> trimEnd );
public ReadOnlySpan<T> TrimStart<T>( this ReadOnlySpan<T> span, Scanner<T> trimStart );
public ReadOnlySpan<T> TrimEnd<T>( this ReadOnlySpan<T> span, Scanner<T> trimEnd );
public SplitIterator<T> Split<T>( this ReadOnlySpan<T> span, Scanner<T> search );
public SplitIterator<T> Split<T>( this ReadOnlySpan<T> span, Scanner<T> search, Scanner<T> trim, bool skipEmpty );
public int Count( this ReadOnlySpan<char> span, Scanner<T> search );
} API UsageReadOnlySpan<char> span="..";
// trim Unicode whitespace
span = span.Trim( WhiteSpaceScanner );
// trim what JS spec considers white spaces from the start and WhiteSpace or ';' or ',' from the end
span = span.Trim( JavascriptWhitespace, WhiteSpaceSemicolonOrComma );
// skip invalid id chars, find id until first invalid
span = span.Skip(HtmlIdScanner ).Remain( HtmlIdScanner.Inverse );
// split on ';' or ',' and trim whitespaces for each entry
var iterator = span.Split(SemicolonOrCommaScanner,WhiteSpaceScanner); RisksThe current API of Regular expressions can't be optimally encapsulated by a Scanner because it requires a string as input
|
Clearly you missed the efforts of span-ifying the Regex API. |
What sort of sets of values were you hoping to see optimized? The
Can you elaborate? The issue links to a dozen PRs doing just that |
Lexer just found a letter and wants to execute an IndexOf first non-valid identifier to locate the end of the identifier. In this case invalid identifier characters are all except A-Z, a-z, _ in the ASCII range. In Unicode range all except letters and digits. If the parser is SQL then use a table that includes the '$' as a valid identifier character, if it is CSS use a lookup table that includes '-'. Which PR covers such a usage ? |
You could do something like private static readonly IndexOfAnyValues<char> s_validJsIdentifierChars = IndexOfAnyValues.Create("abc");
private static readonly IndexOfAnyValues<char> s_validSQLIdentifierChars = IndexOfAnyValues.Create("def");
private static readonly IndexOfAnyValues<char> s_validCSSIdentifierChars = IndexOfAnyValues.Create("ghi");
private static int JsIdentifierLength(ReadOnlySpan<char> text)
{
int i = text.IndexOfAnyExcept(s_validJsIdentifierChars);
return i < 0 ? text.Length : i;
}
private static int SQLIdentifierLength(ReadOnlySpan<char> text)
{
int i = text.IndexOfAnyExcept(s_validSQLIdentifierChars);
return i < 0 ? text.Length : i;
}
// or a shared thing like
private static int IdentifierLength(ReadOnlySpan<char> text, IndexOfAnyValues<char> identifierChars)
{
int i = text.IndexOfAnyExcept(identifierChars);
return i < 0 ? text.Length : i;
} You can essentially treat the |
|
We're open to adding more specialized implementations as scenarios come up, within reason. It may make sense to add another implementation that really is just a lookup table to handle the
If you have over 60k characters, you can invert the condition and create the lookup with 65536 - 60000 characters to make it a bit more manageable. Keep in mind that you don't have to use a constant there, you can generate the set at runtime. cc: @stephentoub |
It seems the feature being requested already exists in the form of the new IndexOfAnyValues. What exactly is the feature missing from an API perspective? As Miha cites, we can add more implementations under the covers as need demonstrates. If the remaining concern from an API perspective is simply lack of extensibility, we've discussed that and opted to keep it closed for now because the extensibility point you'd need to implement buys you very little: you'd need to implement effectively all of {Last}IndexOfAny{Except}, at which point a consumer can just use your APIs directly... the only the thing the abstraction would buy you is if you'd expect to e.g. publish a library containing your custom implementations for others to consume generically, but a key aspect of the design here isn't just the implementations but also the logic for choosing the optimal implementation to use. If in the future we saw it would really be beneficial to allow IndexOfAnyValues to be publicly extended, we could do so then, but we don't want to do so now. |
My first objection is: If I have to write some generic method (or class) that involves searching, for example splitting at some separators but also trim each segment, I have to use two instances of scanners, one that finds the splitting characters and generates string segments and a second one that trims each segment before return them. What are my options? My second objection is the |
Why? Do you mean because of X and InverseX? That's why there are both IndexOfAny and IndexOfAnyExcept methods, and there's nothing that requires them to use the same data. The IndexOfAnyValues.Create method can generate whatever state it needs to generate to make all of the exposed operations as fast as possible. For example, the IndexOfAnyAsciiCharValues implementation computes both a vectorization bitmap as well as a scalar lookup table, where the former is the main workhorse behind APIs like IndexOfAny and the latter is the main workhorse behind Contains. If there were a more efficient representation that could be created to support the other operations, it could do that, too. Currently there isn't and so it hasn't.
I already addressed that. If we allowed for external extensibility, we'd just be making the handful of entry point methods abstract/virtual. There's no other boilerplate that's saved; the public
Why do you want to restrict it in that manner? Are you concerned about construction overhead necessary to create state to optimize IndexOfAny but not IndexOfAnyExcept? |
Because that's the requirements of this method, requires one instance of a IndexOfAnyValues for example to split the string for every ',' or ';' character, but before return the segment must trim the white-spaces using another instance of IndexOfAnyValues. Let's say in that white-spaces is what Javascript considers as white-space (which is different on what filesystem, XML, C# or HTML considers as white-space)
Ok, I have to find and read those suggestions and I'll be back
I wasn't clear, let's have a method that accepts one IndexOfAnyValues instance. This method makes repeatable calls to I have an instance of IndexOfAnyValues named UriSafeScanner that finds Uri-Safe-characters, how can I call this method to find me the inverse, Non-Uri-Safe characters? I can't. I have to create a new instance named UriNonSafeScanner that does the inverse search and provide some logic to synchronize those two, to avoid overlapping . In my proposal I can call the method, easily just using UriSafeScanner.Inverse property. |
@MihaZupan I somehow missed reading your answer
The algorithm above is :
So IndexOfAnyAsciiSearcher will not return the correct value, because in the 0.1% of the cases will return false |
Currently, the implementation used for mixed sets of ascii and non-ascii values will use the probabilistic map. If mixed inputs are common enough, we can consider adding an internal implementation that does something like what you describe but vectorized. But to Stephen's point, that's an implementation-only change that doesn't impact the public API.
Just to confirm, the only thing you wish to see from an API perspective is public class IndexOfAnyValues<T>
{
public IndexOfAnyValues<T> Inverse { get; }
} ? I am not sure how common that would be, but if you believe it's important feel free to open a new API proposal issue to discuss just that part (it wasn't obvious to me that's a significant part of the current proposal). If it's common in your scenario to reuse the same helper for searching for the inverse, why not add that as an option that you use to decide which method to use? public static int MyComplexIndexOf(ReadOnlySpan<char> data, IndexOfAnyValues<char> values, bool inverse)
{
// ...
int i = inverse ? data.IndexOfAnyExcept(values) : data.IndexOfAny(values);
// ...
} Alternatively, the pattern we sometimes use in the runtime is to make use of static abstract methods to deduplicate implementations. public static int MyComplexIndexOf(ReadOnlySpan<char> data, IndexOfAnyValues<char> values) =>
MyComplexIndexOf<DontNegate>(data, values);
public static int MyComplexInverseIndexOf(ReadOnlySpan<char> data, IndexOfAnyValues<char> values) =>
MyComplexIndexOf<Negate>(data, values);
private static int MyComplexIndexOf<TNegator>(ReadOnlySpan<char> data, IndexOfAnyValues<char> values)
where TNegator : struct, INegator
{
// ...
int i = TNegator.IndexOfAny(data, values);
// ...
}
internal interface INegator
{
static abstract int IndexOfAny(ReadOnlySpan<char> source, IndexOfAnyValues<char> values);
}
internal readonly struct DontNegate : INegator
{
public static int IndexOfAny(ReadOnlySpan<char> source, IndexOfAnyValues<char> values) =>
source.IndexOfAny(values);
}
internal readonly struct Negate : INegator
{
public static int IndexOfAny(ReadOnlySpan<char> source, IndexOfAnyValues<char> values) =>
source.IndexOfAnyExcept(values);
} |
I find this pattern confusing, non-optimal and misses many opportunities for pre-calculations For example if I read correctly here in the main loop of IndexOfAny4Values at SpanHelpers.LastIndexOfAnyValueType. equals = TNegator.NegateIfNeeded(Vector128.Equals(current, values0) | Vector128.Equals(current, values1) | Vector128.Equals(current, values2)
| Vector128.Equals(current, values3) | Vector128.Equals(current, values4)); It combines results with binary OR, which is ok for the Negate case, but is non an optimal approach for the normal case |
What do you mean by that?
It is not optimal in what way? |
I mean they are algorithms that can benefit from computed lookups, a computation that is performed once on ctor. The exact algorithm for searching needs some research. An example of precomputed lookups is here in Base64 Vectorized decoding here. Another is this search (old non-vectorized) that computes some seach masks each time is called internal static unsafe char* IndexOfAny( char* scan, int length, char c1, char c2 ) {
// char* scan = basePtr + index;
char ch;
if ( length > 12 ) {
int idx = ( 8 - ( (int)scan & 7 ) ) / 2;
for ( ; idx > 0; length--, idx--, scan++ ) {
ch = *scan;
if ( ch == c1 || ch == c2 ) {
return scan;
}
}
ulong controlValue1 = ( (uint)c1 << 16 ) | c1;
controlValue1 |= controlValue1 << 32;
ulong controlValue2 = ( (uint)c2 << 16 ) | c2;
controlValue2 |= controlValue2 << 32;
const ulong Magic = 0x7FFEFFFEFFFEFFFF;
const ulong Mask = 0x8001000100010000;
for ( ; length > 3; length -= 4, scan += 4 ) {
ulong next4Chars = *( (ulong*)scan );
ulong temp = next4Chars ^ controlValue1;
temp = ~temp ^ ( Magic + temp );
if ( ( temp & Mask ) != 0 ) {
//idx = (int)( scan - basePtr );
ch = *scan;
if ( ch == c1 || ch == c2 ) {
return scan;
}
ch = *( ++scan );
if ( ch == c1 || ch == c2 ) {
return scan;
}
ch = *( ++scan );
if ( ch == c1 || ch == c2 ) {
return scan;
}
ch = *( ++scan );
if ( ch == c1 || ch == c2 ) {
return scan;
}
scan -= 3;
}
temp = next4Chars ^ controlValue2;
temp = ~temp ^ ( Magic + temp );
if ( ( temp & Mask ) != 0 ) {
//idx = (int)( scan - basePtr );
if ( *scan == c2 ) {
return scan;
}
if ( *( ++scan ) == c2 ) {
return scan;
}
if ( *( ++scan ) == c2 ) {
return scan;
}
if ( *( ++scan ) == c2 ) {
return scan;
}
scan -= 3;
}
}
} Perhaps this can be vectorized also and keep the vectorized lookup values in a private field in the class This point is one of the three reasons to justify, "why we need a second instance of
That if the first value matches, there is no need to check the others. Something like short circuit OR (boolean OR). Here it does a binary OR, which is needed only in the Negate searching |
I don't understand this. IndexOfAnyValues already precomputes the data it needs once. What do you think is missing? You are expected to keep and reuse the
Helpers like this are already vectorized in the runtime. You can use them by calling
That is not how SIMD works - it would be inefficient to check each value on its own compared to just doing 4 checks upfront (you would need more instructions, more branching ...). |
Yes, my point is that the IndexOfAnyValues stores precalculated values in its fields. The Inverse class of this, also needs it's own precalculated fields. Sometimes the Inverse precalculation may be radical different and has different algorithm and fields. But the main point why an Inverse (or Negate) class is needed is not that, but how it is used, the consuming methods that accept an IndexOfAnyValues argument. Passing an additional argument to the method to specify if they should call IndexOf or IndexOfExcept is counter-intuitive and of course slower. If you pass to the consumer method just the Inverse instance of IndexOfAnyValues, it is easier and faster.
It was an old example (is not used anymore) of an algorithm, to demonstrate what I mean by precalculation and why you need to store it in fields
You imply that this statement is one a simple SIMD instruction? Because is not that simple |
If it is beneficial to store different data or use a different approach for
Yes, it's not that simple :) You're looking at debug codegen for a method where the vectors are parameters. If you instead look at the actual codegen for M01_L08:
vmovups ymm4,[rdx]
vpcmpeqw ymm5,ymm0,ymm4
vpcmpeqw ymm6,ymm1,ymm4
vpor ymm5,ymm5,ymm6
vpcmpeqw ymm6,ymm2,ymm4
vpor ymm5,ymm5,ymm6
vpcmpeqw ymm4,ymm3,ymm4
vpor ymm4,ymm5,ymm4
vptest ymm4,ymm4
jne short M01_L09
add rdx,20
cmp rdx,rax
jbe short M01_L08 Each comparison is 1 and each OR is one instruction. Checking each individual comparison result to return early would require a bunch of extra code and branches. |
Yes, good point on Debug codegen :) Since you consider the use of an Inverse class for IndexOfAnyValues redundant and IndexOfAnyExcept an intuitive approach, for all usages and consumers, I am closing this. |
@panost just noting that your sharplab link is pointing to x86 debug assembly code. |
@En3Tho Yes, I copied the link before selecting Release & x64. Sorry. The correct link is here Still not a representative use case, because I the pass the Vectors as arguments, missing the opportunity for further optimizations. The code in @MihaZupan's comment above, shows what actually happens |
Background and motivation
Too many overloads of searching in
ReadOnlySpan<T>
and yet they cover only very simple search cases. In addition to those methods, in each invocation, they calculate or examine several cases before actually find the best course of how to optimally perform the search. See for exampleSpanHelpers.IndexOfAny
or in the same class whatIndexOf
does when searching for a sequence.We need a class where not only all those cases will be calculated or examined once and used thousand or even million of times, but also a class that will support more complex searching scenarios.
API Proposal
A. Base class
B. An example of a Scanner that is searching for a simple value. The example is incomplete
C. An example of a Scanner that is using an ASCII lookup table. The example is incomplete
D. ReadOnlySpan extensions
API Usage
Risks
The current API of Regular expressions can't be optimally encapsulated by a Scanner because it requires a string as input
The text was updated successfully, but these errors were encountered: