Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Adding a char_group tokenizer #24186

Merged
merged 5 commits into from
May 22, 2018

Conversation

synhershko
Copy link
Contributor

@synhershko synhershko commented Apr 19, 2017

Char Group Tokenizer

The char_group tokenizer breaks text into terms whenever it encounters
a
character which is in a defined set. It is mostly useful for cases where
a simple
custom tokenization is desired, and the overhead of use of the
<<analysis-pattern-tokenizer, pattern tokenizer>>
is not acceptable.

Configuration

The char_group tokenizer accepts one parameter:

tokenize_on_chars::
A string containing a list of characters to tokenize the string on.
Whenever a character
from this list is encountered, a new token is started. Also supports
escaped values like \\n and \\f,
and in addition \\s to represent whitespace, \\d to represent
digits and \\w to represent letters.
Defaults to an empty list.

Example output

The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2

When the configuration \\s-:<> is used for tokenize_on_chars, the
above sentence would produce the following terms:

[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone,
for, $2 ]

=== Char Group Tokenizer

The `char_group` tokenizer breaks text into terms whenever it encounters
a
character which is in a defined set. It is mostly useful for cases where
a simple
custom tokenization is desired, and the overhead of use of the
<<analysis-pattern-tokenizer, `pattern` tokenizer>>
is not acceptable.

=== Configuration

The `char_group` tokenizer accepts one parameter:

`tokenize_on_chars`::
    A string containing a list of characters to tokenize the string on.
Whenever a character
    from this list is encountered, a new token is started. Also supports
escaped values like `\\n` and `\\f`,
    and in addition `\\s` to represent whitespace, `\\d` to represent
digits and `\\w` to represent letters.
    Defaults to an empty list.

=== Example output

```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2```

When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the
above sentence would produce the following terms:

```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone,
for, $2 ]```
@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@jimczi
Copy link
Contributor

jimczi commented Apr 19, 2017

It is mostly useful for cases where
a simple
custom tokenization is desired, and the overhead of use of the
<<analysis-pattern-tokenizer, pattern tokenizer>>
is not acceptable.

Are you sure that there is an overhead when using a pattern tokenizer for single char tokenization ?
Is this the only reason why you need this specific tokenizer ? I guess that if the list of characters is big then this tokenizer is easier to use but I am not sure that it should be in the list of default tokenizers. A plugin seems a better fit because it is intended for a very specific usage ?

@synhershko
Copy link
Contributor Author

@jimczi yes, at a client's site and it's measurable even for a simple regex pattern.

From experience, maintaining a plugin is a considerable overhead and this seems quite a legit and useful addition, considering the existence of the letter tokenizer

@jimczi
Copy link
Contributor

jimczi commented Apr 19, 2017

Fair enough. I marked this PR with the discuss label. It is also unclear to me if this tokenizer should be added to Lucene first. @jpountz may have a better answer regarding this. And thanks @synhershko for submitting your PR.

@jpountz
Copy link
Contributor

jpountz commented May 12, 2017

Sorry for the late reply, I had totally missed the ping. @synhershko I think this new tokenizer could help. The main question to me is how we should allow the list of token chars to be configured. We already have a token_chars option with the ngram tokenizer, I'm wondering whether we should try to be consistent. Also I'm curious what your actual use-case was with this customer? For instance do you think we need to expose the ability to tokenize on specific characters or would it be enough to just expose character classes like the ngram tokenizer does?

@jpountz
Copy link
Contributor

jpountz commented May 12, 2017

To @jimczi 's question, I think Elasticsearch is the right place to put this tokenizer given that Lucene does not need more than its existing CharTokenizer?

@dakrone
Copy link
Member

dakrone commented Aug 15, 2017

Ping @synhershko is this still something you're interested in pursuing?

@synhershko
Copy link
Contributor Author

@dakrone sure, the PR with all required changes are here - not sure what you are asking? :)

@synhershko
Copy link
Contributor Author

Ping @jpountz @dakrone - would love to have this in for 6.0, anything else you need from me here?

@dakrone
Copy link
Member

dakrone commented Sep 11, 2017

@synhershko I think @jpountz's questions from his last comment are still unanswered, other than that I'm not sure what else needs to happen, @jpountz will likely have a better idea

@synhershko
Copy link
Contributor Author

Aye, responding below

The main question to me is how we should allow the list of token chars to be configured. We already have a token_chars option with the ngram tokenizer, I'm wondering whether we should try to be consistent.

That's just a naming question - fine by me to change that.

Also I'm curious what your actual use-case was with this customer?

To be honest I don't remember the actual use-case, but I do remember it was important enough to call for this PR :) I assume this was an attempt to efficiently parse incoming user data with custom separators. We all know those type of systems...

The Pattern replace tokenizer would most likely help that cause but it does come with performance penalty.

For instance do you think we need to expose the ability to tokenize on specific characters or would it be enough to just expose character classes like the ngram tokenizer does?

Using character classes does look tempting for simplicity reasons, but in most likelihood this tokenizer will mostly be useful for it's granularity. Do note \s, \w , \d are supported ootb and can also be extended for more categories if needed.

@jpountz
Copy link
Contributor

jpountz commented Sep 12, 2017

Maybe we could make it take eg. token_chars: [ "whitespace", "-", "_" ] as a way to say it should split on whitespaces, hypens and underscores? It is not ambiguous since all character classes use more than one char, and consistent with the current configuration of the ngram and edge_ngram tokenizers (which could later be improved to allow splitting on specific chars too)?

@synhershko
Copy link
Contributor Author

Sounds good. You want me to go ahead and apply this change?

If so, what should we do if len(token_chars[i]) > 1 and token_chars[i] not in ["whitespace", "digit", "letter"] for any i? this subtlety is probably the only downside for this approach

@jpountz
Copy link
Contributor

jpountz commented Sep 13, 2017

@synhershko Yes, something like that. I suspect your example didn't mean to be all-inclusive, but just in case, supporting symbol and punctuation would be nice too.

@jpountz
Copy link
Contributor

jpountz commented Sep 13, 2017

Actually I just realized that reusing token_chars is a bad idea since this option of the ngram tokenizer is about characters to keep while here we are dealing with characters to split on. So let's keep tokenize_on_chars that you initially suggested? However I'd still like to go with an array notation which I find less ambiguous than a string, especially if there are supplementary chars in the string that can't be represented by a single Java char?

@synhershko
Copy link
Contributor Author

Oh right, that might be the original reason why I chose that naming ;) been too long.

I pushed an update to the code - can you confirm structure and config look good? if so I will update tests

@romseygeek
Copy link
Contributor

/cc @elastic/es-search-aggs

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@jpountz jpountz self-assigned this May 18, 2018
@jpountz
Copy link
Contributor

jpountz commented May 18, 2018

@elasticmachine test this please

@jpountz jpountz removed the discuss label May 18, 2018
@jpountz jpountz merged commit 5f172b6 into elastic:master May 22, 2018
@jpountz
Copy link
Contributor

jpountz commented May 22, 2018

@synhershko I took the liberty of doing some minor changes like moving the code to analysis-common before merging this PR. Sorry for the long delay.

jpountz pushed a commit that referenced this pull request May 22, 2018
=== Char Group Tokenizer

The `char_group` tokenizer breaks text into terms whenever it encounters
a
character which is in a defined set. It is mostly useful for cases where
a simple
custom tokenization is desired, and the overhead of use of the
<<analysis-pattern-tokenizer, `pattern` tokenizer>>
is not acceptable.

=== Configuration

The `char_group` tokenizer accepts one parameter:

`tokenize_on_chars`::
    A string containing a list of characters to tokenize the string on.
Whenever a character
    from this list is encountered, a new token is started. Also supports
escaped values like `\\n` and `\\f`,
    and in addition `\\s` to represent whitespace, `\\d` to represent
digits and `\\w` to represent letters.
    Defaults to an empty list.

=== Example output

```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2```

When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the
above sentence would produce the following terms:

```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone,
for, $2 ]```
dnhatn added a commit that referenced this pull request May 22, 2018
* master:
  QA: Add xpack tests to rolling upgrade (#30795)
  Modify state of VerifyRepositoryResponse for bwc (#30762)
  Reduce CLI scripts to one-liners on Windows (#30772)
  Simplify number of shards setting (#30783)
  Replace Request#setHeaders with addHeader (#30588)
  [TEST] remove endless wait in RestClientTests (#30776)
  [Docs] Fix script-fields snippet execution (#30693)
  Upgrade to Lucene-7.4.0-snapshot-cc2ee23050 (#30778)
  [DOCS] Add SAML configuration information (#30548)
  [DOCS] Remove X-Pack references from SQL CLI (#30694)
  Make http pipelining support mandatory (#30695)
  [Docs] Fix typo in circuit breaker docs (#29659)
  [Feature] Adding a char_group tokenizer (#24186)
  [Docs] Fix broken cross link in documentation
  Test: wait for netty threads in a JUnit ClassRule (#30763)
  Increase the maximum number of filters that may be in the cache. (#30655)
  [Security] Include an empty json object in an json array when FLS filters out all fields (#30709)
  [TEST] Wait for CS to be fully applied in testDeleteCreateInOneBulk
  Add more yaml tests for get alias API (#29513)
  Ignore empty completion input (#30713)
  [DOCS] fixed incorrect default
  [ML] Filter undefined job groups from update calendar actions (#30757)
  Fix docs failure on language analyzers (#30722)
  [Docs] Fix inconsistencies in snapshot/restore doc (#30480)
  Enable installing plugins from snapshots.elastic.co (#30765)
  Remove fedora 26, add 28 (#30683)
  Accept Gradle build scan agreement (#30645)
  Remove logging from elasticsearch-nio jar (#30761)
  Add Delete Repository High Level REST API (#30666)
dnhatn added a commit that referenced this pull request May 24, 2018
* 6.x:
  [DOCS] Fixes typos in security settings
  Add support for indexed shape routing in geo_shape query (#30760)
  [DOCS] Splits auditing.asciidoc into smaller files
  Painless: Types Section Clean Up (#30283)
  [test] java tests for archive packaging (#30734)
  Deprecate http.pipelining setting (#30786)
  [DOCS] Fix more edit URLs in Stack Overview (#30704)
  Use correct cluster state version for node fault detection (#30810)
  [DOCS] Fixes broken link for native realm
  [DOCS] Clarified audit.index.client.hosts (#30797)
  Change serialization version of doc-value fields.
  Add a `format` option to `docvalue_fields`. (#29639)
  [TEST] Don't expect acks when isolating nodes
  Fixes UpdateSettingsRequestStreamableTests mutate bug
  Revert "Add more yaml tests for get alias API (#29513)"
  Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled"
  Only allow x-pack metadata if all nodes are ready (#30743)
  Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled
  Use original settings on full-cluster restart (#30780)
  Only ack cluster state updates successfully applied on all nodes (#30672)
  Replace Request#setHeaders with addHeader (#30588)
  [TEST] remove endless wait in RestClientTests (#30776)
  QA: Add xpack tests to rolling upgrade (#30795)
  Add support for search templates to the high-level REST client. (#30473)
  Reduce CLI scripts to one-liners on Windows (#30772)
  Fold RestGetAllSettingsAction in RestGetSettingsAction (#30561)
  Add more yaml tests for get alias API (#29513)
  [Docs] Fix script-fields snippet execution (#30693)
  Convert FieldCapabilitiesResponse to a ToXContentObject. (#30182)
  Remove assert statements from field caps documentation. (#30601)
  Fix a bug in FieldCapabilitiesRequest#equals and hashCode. (#30181)
  Add support for field capabilities to the high-level REST client. (#29664)
  [DOCS] Add SAML configuration information (#30548)
  [DOCS] Remove X-Pack references from SQL CLI (#30694)
  [Docs] Fix typo in circuit breaker docs (#29659)
  [Feature] Adding a char_group tokenizer (#24186)
  Increase the maximum number of filters that may be in the cache. (#30655)
  [Docs] Fix broken cross link in documentation
  Test: wait for netty threads in a JUnit ClassRule (#30763)
  [Security] Include an empty json object in an json array when FLS filters out all fields (#30709)
  [DOCS] fixed incorrect default
  [TEST] Wait for CS to be fully applied in testDeleteCreateInOneBulk
  Enable installing plugins from snapshots.elastic.co (#30765)
  Ignore empty completion input (#30713)
  Fix docs failure on language analyzers (#30722)
  [Docs] Fix inconsistencies in snapshot/restore doc (#30480)
  Add Delete Repository High Level REST API (#30666)
  Reduce CLI scripts to one-liners (#30759)
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants