Skip to content
This repository has been archived by the owner on Mar 16, 2021. It is now read-only.

[Azure Search] Improve tokenization of initialisms #628

Draft
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

loic-sharma
Copy link
Contributor

@loic-sharma loic-sharma commented Aug 14, 2019

@@ -12,6 +12,6 @@ public static class PackageIdCustomTokenizer

public static readonly PatternTokenizer Instance = new PatternTokenizer(
Name,
@"[.\-_,;:'*#!~+()\[\]{}\s]");
@"((?<=[A-Z])(?=[A-Z][a-z]))|([.\-_,;:'*#!~+()\[\]{}\s])");
Copy link
Contributor Author

@loic-sharma loic-sharma Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This splits on:

  1. Whitespace
  2. The characters ., \, -, _, ,, ;, :, ', *, #, !, ~, +, (, ), [, ], {, }
  3. After the first character on patterns like ABc. For example, FOOBar becomes FOO and Bar

{ "FOOBar", new[] { "foo", "bar" } },
{ "FooBAR", new[] { "foobar", "foo", "bar" } },
{ "FOOBarBuzz", new[] { "foo", "barbuzz", "bar", "buzz" } },
{ "FooBARBuzz", new[] { "foobar", "foo", "bar", "buzz" } },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be also include set of charachters including split tokens spaces and casings, together? like the highlighted one FOOBar.Baz Qux

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is covered by SplitsTokensOnSpecialCharactersAndLowercases.

For more context: each data set in TokenizedData tests a single tokenization behavior. This helps us dedupe test data across many different fields in the index, each of which may have different tokenization behaviors.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants