[Feature] Adding a char_group tokenizer #24186

synhershko · 2017-04-19T13:50:11Z

Char Group Tokenizer

The char_group tokenizer breaks text into terms whenever it encounters
a
character which is in a defined set. It is mostly useful for cases where
a simple
custom tokenization is desired, and the overhead of use of the
<<analysis-pattern-tokenizer, pattern tokenizer>>
is not acceptable.

Configuration

The char_group tokenizer accepts one parameter:

tokenize_on_chars::
A string containing a list of characters to tokenize the string on.
Whenever a character
from this list is encountered, a new token is started. Also supports
escaped values like \\n and \\f,
and in addition \\s to represent whitespace, \\d to represent
digits and \\w to represent letters.
Defaults to an empty list.

Example output

The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2

When the configuration \\s-:<> is used for tokenize_on_chars, the
above sentence would produce the following terms:

[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone,
for, $2 ]

=== Char Group Tokenizer The `char_group` tokenizer breaks text into terms whenever it encounters a character which is in a defined set. It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the <<analysis-pattern-tokenizer, `pattern` tokenizer>> is not acceptable. === Configuration The `char_group` tokenizer accepts one parameter: `tokenize_on_chars`:: A string containing a list of characters to tokenize the string on. Whenever a character from this list is encountered, a new token is started. Also supports escaped values like `\\n` and `\\f`, and in addition `\\s` to represent whitespace, `\\d` to represent digits and `\\w` to represent letters. Defaults to an empty list. === Example output ```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2``` When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the above sentence would produce the following terms: ```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone, for, $2 ]```

elasticmachine · 2017-04-19T13:50:13Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

jimczi · 2017-04-19T14:15:56Z

It is mostly useful for cases where
a simple
custom tokenization is desired, and the overhead of use of the
<<analysis-pattern-tokenizer, pattern tokenizer>>
is not acceptable.

Are you sure that there is an overhead when using a pattern tokenizer for single char tokenization ?
Is this the only reason why you need this specific tokenizer ? I guess that if the list of characters is big then this tokenizer is easier to use but I am not sure that it should be in the list of default tokenizers. A plugin seems a better fit because it is intended for a very specific usage ?

synhershko · 2017-04-19T14:17:23Z

@jimczi yes, at a client's site and it's measurable even for a simple regex pattern.

From experience, maintaining a plugin is a considerable overhead and this seems quite a legit and useful addition, considering the existence of the letter tokenizer

jimczi · 2017-04-19T14:20:34Z

Fair enough. I marked this PR with the discuss label. It is also unclear to me if this tokenizer should be added to Lucene first. @jpountz may have a better answer regarding this. And thanks @synhershko for submitting your PR.

jpountz · 2017-05-12T11:16:25Z

Sorry for the late reply, I had totally missed the ping. @synhershko I think this new tokenizer could help. The main question to me is how we should allow the list of token chars to be configured. We already have a token_chars option with the ngram tokenizer, I'm wondering whether we should try to be consistent. Also I'm curious what your actual use-case was with this customer? For instance do you think we need to expose the ability to tokenize on specific characters or would it be enough to just expose character classes like the ngram tokenizer does?

jpountz · 2017-05-12T11:17:46Z

To @jimczi 's question, I think Elasticsearch is the right place to put this tokenizer given that Lucene does not need more than its existing CharTokenizer?

dakrone · 2017-08-15T20:07:38Z

Ping @synhershko is this still something you're interested in pursuing?

synhershko · 2017-08-16T13:36:47Z

@dakrone sure, the PR with all required changes are here - not sure what you are asking? :)

synhershko · 2017-09-09T19:30:33Z

Ping @jpountz @dakrone - would love to have this in for 6.0, anything else you need from me here?

dakrone · 2017-09-11T19:26:13Z

@synhershko I think @jpountz's questions from his last comment are still unanswered, other than that I'm not sure what else needs to happen, @jpountz will likely have a better idea

synhershko · 2017-09-11T19:42:05Z

Aye, responding below

The main question to me is how we should allow the list of token chars to be configured. We already have a token_chars option with the ngram tokenizer, I'm wondering whether we should try to be consistent.

That's just a naming question - fine by me to change that.

Also I'm curious what your actual use-case was with this customer?

To be honest I don't remember the actual use-case, but I do remember it was important enough to call for this PR :) I assume this was an attempt to efficiently parse incoming user data with custom separators. We all know those type of systems...

The Pattern replace tokenizer would most likely help that cause but it does come with performance penalty.

For instance do you think we need to expose the ability to tokenize on specific characters or would it be enough to just expose character classes like the ngram tokenizer does?

Using character classes does look tempting for simplicity reasons, but in most likelihood this tokenizer will mostly be useful for it's granularity. Do note \s, \w , \d are supported ootb and can also be extended for more categories if needed.

jpountz · 2017-09-12T08:59:35Z

Maybe we could make it take eg. token_chars: [ "whitespace", "-", "_" ] as a way to say it should split on whitespaces, hypens and underscores? It is not ambiguous since all character classes use more than one char, and consistent with the current configuration of the ngram and edge_ngram tokenizers (which could later be improved to allow splitting on specific chars too)?

synhershko · 2017-09-12T20:36:05Z

Sounds good. You want me to go ahead and apply this change?

If so, what should we do if len(token_chars[i]) > 1 and token_chars[i] not in ["whitespace", "digit", "letter"] for any i? this subtlety is probably the only downside for this approach

jpountz · 2017-09-13T07:25:44Z

@synhershko Yes, something like that. I suspect your example didn't mean to be all-inclusive, but just in case, supporting symbol and punctuation would be nice too.

jpountz · 2017-09-13T07:30:12Z

Actually I just realized that reusing token_chars is a bad idea since this option of the ngram tokenizer is about characters to keep while here we are dealing with characters to split on. So let's keep tokenize_on_chars that you initially suggested? However I'd still like to go with an array notation which I find less ambiguous than a string, especially if there are supplementary chars in the string that can't be represented by a single Java char?

synhershko · 2017-09-13T10:09:22Z

Oh right, that might be the original reason why I chose that naming ;) been too long.

I pushed an update to the code - can you confirm structure and config look good? if so I will update tests

romseygeek · 2018-03-13T12:19:20Z

/cc @elastic/es-search-aggs

elasticmachine · 2018-03-13T12:19:22Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

jpountz · 2018-05-18T16:06:01Z

@elasticmachine test this please

jpountz · 2018-05-22T14:28:32Z

@synhershko I took the liberty of doing some minor changes like moving the code to analysis-common before merging this PR. Sorry for the long delay.

=== Char Group Tokenizer The `char_group` tokenizer breaks text into terms whenever it encounters a character which is in a defined set. It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the <<analysis-pattern-tokenizer, `pattern` tokenizer>> is not acceptable. === Configuration The `char_group` tokenizer accepts one parameter: `tokenize_on_chars`:: A string containing a list of characters to tokenize the string on. Whenever a character from this list is encountered, a new token is started. Also supports escaped values like `\\n` and `\\f`, and in addition `\\s` to represent whitespace, `\\d` to represent digits and `\\w` to represent letters. Defaults to an empty list. === Example output ```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2``` When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the above sentence would produce the following terms: ```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone, for, $2 ]```

* master: QA: Add xpack tests to rolling upgrade (#30795) Modify state of VerifyRepositoryResponse for bwc (#30762) Reduce CLI scripts to one-liners on Windows (#30772) Simplify number of shards setting (#30783) Replace Request#setHeaders with addHeader (#30588) [TEST] remove endless wait in RestClientTests (#30776) [Docs] Fix script-fields snippet execution (#30693) Upgrade to Lucene-7.4.0-snapshot-cc2ee23050 (#30778) [DOCS] Add SAML configuration information (#30548) [DOCS] Remove X-Pack references from SQL CLI (#30694) Make http pipelining support mandatory (#30695) [Docs] Fix typo in circuit breaker docs (#29659) [Feature] Adding a char_group tokenizer (#24186) [Docs] Fix broken cross link in documentation Test: wait for netty threads in a JUnit ClassRule (#30763) Increase the maximum number of filters that may be in the cache. (#30655) [Security] Include an empty json object in an json array when FLS filters out all fields (#30709) [TEST] Wait for CS to be fully applied in testDeleteCreateInOneBulk Add more yaml tests for get alias API (#29513) Ignore empty completion input (#30713) [DOCS] fixed incorrect default [ML] Filter undefined job groups from update calendar actions (#30757) Fix docs failure on language analyzers (#30722) [Docs] Fix inconsistencies in snapshot/restore doc (#30480) Enable installing plugins from snapshots.elastic.co (#30765) Remove fedora 26, add 28 (#30683) Accept Gradle build scan agreement (#30645) Remove logging from elasticsearch-nio jar (#30761) Add Delete Repository High Level REST API (#30666)

* 6.x: [DOCS] Fixes typos in security settings Add support for indexed shape routing in geo_shape query (#30760) [DOCS] Splits auditing.asciidoc into smaller files Painless: Types Section Clean Up (#30283) [test] java tests for archive packaging (#30734) Deprecate http.pipelining setting (#30786) [DOCS] Fix more edit URLs in Stack Overview (#30704) Use correct cluster state version for node fault detection (#30810) [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (#30797) Change serialization version of doc-value fields. Add a `format` option to `docvalue_fields`. (#29639) [TEST] Don't expect acks when isolating nodes Fixes UpdateSettingsRequestStreamableTests mutate bug Revert "Add more yaml tests for get alias API (#29513)" Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (#30743) Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled Use original settings on full-cluster restart (#30780) Only ack cluster state updates successfully applied on all nodes (#30672) Replace Request#setHeaders with addHeader (#30588) [TEST] remove endless wait in RestClientTests (#30776) QA: Add xpack tests to rolling upgrade (#30795) Add support for search templates to the high-level REST client. (#30473) Reduce CLI scripts to one-liners on Windows (#30772) Fold RestGetAllSettingsAction in RestGetSettingsAction (#30561) Add more yaml tests for get alias API (#29513) [Docs] Fix script-fields snippet execution (#30693) Convert FieldCapabilitiesResponse to a ToXContentObject. (#30182) Remove assert statements from field caps documentation. (#30601) Fix a bug in FieldCapabilitiesRequest#equals and hashCode. (#30181) Add support for field capabilities to the high-level REST client. (#29664) [DOCS] Add SAML configuration information (#30548) [DOCS] Remove X-Pack references from SQL CLI (#30694) [Docs] Fix typo in circuit breaker docs (#29659) [Feature] Adding a char_group tokenizer (#24186) Increase the maximum number of filters that may be in the cache. (#30655) [Docs] Fix broken cross link in documentation Test: wait for netty threads in a JUnit ClassRule (#30763) [Security] Include an empty json object in an json array when FLS filters out all fields (#30709) [DOCS] fixed incorrect default [TEST] Wait for CS to be fully applied in testDeleteCreateInOneBulk Enable installing plugins from snapshots.elastic.co (#30765) Ignore empty completion input (#30713) Fix docs failure on language analyzers (#30722) [Docs] Fix inconsistencies in snapshot/restore doc (#30480) Add Delete Repository High Level REST API (#30666) Reduce CLI scripts to one-liners (#30759)

jimczi added :Search Relevance/Analysis How text is split into tokens discuss >feature labels Apr 19, 2017

Updating config

9c8c4ea

Fixing typo

014e64c

jpountz self-assigned this May 18, 2018

jpountz added 2 commits May 18, 2018 16:05

Merge branch 'master' into chargrouptokenizer

15f24d8

Move things to analysis/common.

752d0cf

jpountz removed the discuss label May 18, 2018

jpountz merged commit 5f172b6 into elastic:master May 22, 2018

jpountz added v7.0.0 v6.4.0 labels May 22, 2018

Mpdreamz mentioned this pull request Sep 21, 2018

[meta] 6.4.0 release elastic/elasticsearch-net#3397

Closed

89 tasks

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Dehax mentioned this pull request Jun 16, 2023

Fix post filter for lang codes with a dash (zh-hans) OnTheGoSystems/wpml-elasticpress#20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Adding a char_group tokenizer #24186

[Feature] Adding a char_group tokenizer #24186

synhershko commented Apr 19, 2017 •

edited

Loading

elasticmachine commented Apr 19, 2017

jimczi commented Apr 19, 2017

synhershko commented Apr 19, 2017

jimczi commented Apr 19, 2017

jpountz commented May 12, 2017

jpountz commented May 12, 2017

dakrone commented Aug 15, 2017

synhershko commented Aug 16, 2017

synhershko commented Sep 9, 2017

dakrone commented Sep 11, 2017

synhershko commented Sep 11, 2017

jpountz commented Sep 12, 2017

synhershko commented Sep 12, 2017

jpountz commented Sep 13, 2017

jpountz commented Sep 13, 2017

synhershko commented Sep 13, 2017

romseygeek commented Mar 13, 2018

elasticmachine commented Mar 13, 2018

jpountz commented May 18, 2018

jpountz commented May 22, 2018

[Feature] Adding a char_group tokenizer #24186

[Feature] Adding a char_group tokenizer #24186

Conversation

synhershko commented Apr 19, 2017 • edited Loading

Char Group Tokenizer

Configuration

Example output

elasticmachine commented Apr 19, 2017

jimczi commented Apr 19, 2017

synhershko commented Apr 19, 2017

jimczi commented Apr 19, 2017

jpountz commented May 12, 2017

jpountz commented May 12, 2017

dakrone commented Aug 15, 2017

synhershko commented Aug 16, 2017

synhershko commented Sep 9, 2017

dakrone commented Sep 11, 2017

synhershko commented Sep 11, 2017

jpountz commented Sep 12, 2017

synhershko commented Sep 12, 2017

jpountz commented Sep 13, 2017

jpountz commented Sep 13, 2017

synhershko commented Sep 13, 2017

romseygeek commented Mar 13, 2018

elasticmachine commented Mar 13, 2018

jpountz commented May 18, 2018

jpountz commented May 22, 2018

synhershko commented Apr 19, 2017 •

edited

Loading