-
Notifications
You must be signed in to change notification settings - Fork 19
[Azure Search] Improve tokenization of initialisms #628
base: dev
Are you sure you want to change the base?
Conversation
@@ -12,6 +12,6 @@ public static class PackageIdCustomTokenizer | |||
|
|||
public static readonly PatternTokenizer Instance = new PatternTokenizer( | |||
Name, | |||
@"[.\-_,;:'*#!~+()\[\]{}\s]"); | |||
@"((?<=[A-Z])(?=[A-Z][a-z]))|([.\-_,;:'*#!~+()\[\]{}\s])"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This splits on:
- Whitespace
- The characters
.
,\
,-
,_
,,
,;
,:
,'
,*
,#
,!
,~
,+
,(
,)
,[
,]
,{
,}
- After the first character on patterns like
ABc
. For example,FOOBar
becomesFOO
andBar
{ "FOOBar", new[] { "foo", "bar" } }, | ||
{ "FooBAR", new[] { "foobar", "foo", "bar" } }, | ||
{ "FOOBarBuzz", new[] { "foo", "barbuzz", "bar", "buzz" } }, | ||
{ "FooBARBuzz", new[] { "foobar", "foo", "bar", "buzz" } }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be also include set of charachters including split tokens spaces and casings, together? like the highlighted one FOOBar.Baz Qux
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is covered by SplitsTokensOnSpecialCharactersAndLowercases
.
For more context: each data set in TokenizedData
tests a single tokenization behavior. This helps us dedupe test data across many different fields in the index, each of which may have different tokenization behaviors.
Addresses NuGet/NuGetGallery#6964
Binary build: https://devdiv.visualstudio.com/DevDiv/_build/results?buildId=2948640
Config build: https://devdiv.visualstudio.com/DevDiv/_build/results?buildId=2936947
Release: https://devdiv.visualstudio.com/DevDiv/_releaseProgress?_a=release-pipeline-progress&releaseId=419179