Add remove types token filter (as opposite to keep_types token filter) #29277

edovac · 2018-03-28T08:28:39Z

Describe the feature:
Hi, Elasticsearch provides the keep_types token filter, but does not provide a token filter to exclude specific token types from the token stream.

As I understand, the keep_types token filter is implemented using Lucene org.apache.lucene.analysis.core.TypeTokenFilter.TypeTokenFilter(TokenStream, Set<String>, boolean) which implements both behaviours.
It would be nice to have the remove filter too.

Elasticsearch version: 6.2

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-03-28T09:33:51Z

Pinging @elastic/es-search-aggs

mayya-sharipova · 2018-03-29T20:39:35Z

@edovac Can you please provide your use-case? How are you going to use keep_types token filter to exclude specific token types?

edovac · 2018-04-04T06:36:30Z

Hi, sorry for the delay.
The use case is a little bit articulated, I'll try to keep it plain simple.

The company I work in is developing an analyzer which can manage (among other things) named entity extractions and structured text.
By structured text I mean a text for which I provide explicit sections structure like:

<document>
<title>John Smith hired by Acme Inc.</title>
<body>Last monday the company announced the new hire.</body>
</document>

Sections names are not fixed, the can repeat in the same text and they change on a per project basis.
Named entities also can vary by project.

We need to support per extraction type queries, and also constraining the search to a specific section.
ie: John Smith as person in section title

We also need to support near queries between different extraction types like:
John Smith as person near Acme Inc. as organization
This could be done both by sentence (using slop) or by named section.

Here is a mapping example:

{
	"settings": {
		"analysis": {
			"analyzer": {
				"field_analyzer": {
					"type": "custom",
					"tokenizer": "structure_tokenizer"
				},
				"sentence_analyzer": {
					"type": "custom",
					"tokenizer": "structure_tokenizer",
					"filter": [
						"not_section"
					]
				},
				"organizations_analyzer": {
					"type": "custom",
					"tokenizer": "structure_tokenizer",
					"filter": [
						"organizations"
					]
				},
				"people_analyzer": {
					"type": "custom",
					"tokenizer": "structure_tokenizer",
					"filter": [
						"people"
					]
				},
			},
			"tokenizer": {
				"structure_tokenizer": {
					"type": "structure-tokenizer"
				}
			},
			"filter": {
				"organizations": {
					"type": "keep_types",
					"types": [
						"organization"
					]
				},
				"people": {
					"type": "keep_types",
					"types": [
						"people"
					]
				},
				"section": {
					"type": "keep_types",
					"types": [
						"section"
					]
				},
				"not_section": {
					"type": "remove_types",
					"types": [
						"section"
					]
				}
			}
		}
	},
	"mappings": {
		"type": {
			"properties": {
				"text": {
					"type": "text",
					"fielddata": true,
					"analyzer": "sentence_analyzer",
					"term_vector": "with_positions_offsets",
					"fields": {
						"organizations": {
							"type": "text",
							"fielddata": true,
							"analyzer": "organizations_analyzer",
							"term_vector": "with_positions_offsets"
						},
						"people": {
							"type": "text",
							"fielddata": true,
							"analyzer": "people_analyzer",
							"term_vector": "with_positions_offsets"
						},
						"section": {
							"type": "text",
							"analyzer": "section_analyzer",
							"term_vector": "with_positions_offsets"
						}
					}
				}
			}
		}
	}
}

structure-tokenizer is developed by us.
Our analysis engine is able to generate a record like this:

value: John Smith
offset: [0, 10]
token pos: 0
type: people
section: title

From this records we generate index tokens:

example for text:

first token

value: John Smith
offset: [0, 10]
token pos: 0
type: people

second token

value: hired
offset: [11, 16]
token pos: 1
type: keyword

example for people:

value: John Smith
offset: [0, 10]
token pos: 0
type: people

example for section:

first token

value: title
offset: [0, 10]
token pos: 0
type: section

second token

value: title
offset: [11, 16]
token pos: 0
type: section

For each token we have an overlapping one in the section field.

"text" field contains all tokens and will be used for match and phrase queries.

To support constraints by section we use the dedicate field "section" in conjunction with field_masking_span and span_containing.
Thus, we use a dedicated "section" field to store section information for each token.
ie:

GET /_search
{
	"query": {
		"span_containing": {
			"big": {
				"span_term": {
					"people": "John Smith"
				}
			},
			"little": {
				"field_masking_span": {
					"query": {
						"span_term": {
							"section": "title"
						}
					},
					"field": "poeple"
				}
			}
		}
	}
}

Coming back to our main point, "text" field requires all tokens except those relatives to "section" and thus the request for a "remove_types" token filter.
I'm aware that I can achieve the same results using the "keep_types" to include only useful tokens, but it seems overly verbose, error prone and less readable.

ie:

"not_section": {
	"type": "remove_types",
	"types": [
		"section"
	]
}

vs.

"not_section": {
	"type": "keep_types",
	"types": [
		"keyword",
		"people",
		"organizations"
	]
}

We often have tens of different named extractions.

I hope this will clarify my request :)

jpountz · 2018-04-06T13:27:27Z

Discussed in FixitFriday: we agreed to do it. Here is the plan we discussed:

update the existing filter so that it supports includes and excludes
not support wildcards, only exact strings
fail if both includes and excludes are specified

edovac · 2018-04-06T13:42:48Z

Thanks :)

Currently the `keep_types` token filter includes all token types specified using its `types` parameter. Lucenes TypeTokenFilter also provides a second mode where instead of keeping the specified tokens (include) they are filtered out (exclude). This change exposes this option as a new `mode` parameter that can either take the values `include` (the default, if not specified) or `exclude`. Closes elastic#29277

Currently the `keep_types` token filter includes all token types specified using its `types` parameter. Lucenes TypeTokenFilter also provides a second mode where instead of keeping the specified tokens (include) they are filtered out (exclude). This change exposes this option as a new `mode` parameter that can either take the values `include` (the default, if not specified) or `exclude`. Closes #29277

bleskes added the :Search Relevance/Analysis How text is split into tokens label Mar 28, 2018

bleskes added the >enhancement label Mar 28, 2018

jimczi added the discuss label Mar 29, 2018

mayya-sharipova added feedback_needed and removed discuss labels Mar 29, 2018

colings86 added discuss and removed feedback_needed labels Apr 4, 2018

pcsanwald removed the discuss label Apr 6, 2018

pcsanwald assigned jpountz and unassigned jpountz Apr 6, 2018

jpountz added the help wanted adoptme label Apr 6, 2018

cbuescher self-assigned this Jul 12, 2018

cbuescher mentioned this issue Jul 12, 2018

Add exclusion option to keep_types token filter #32012

Merged

cbuescher closed this as completed in #32012 Jul 17, 2018

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add remove types token filter (as opposite to keep_types token filter) #29277

Add remove types token filter (as opposite to keep_types token filter) #29277

edovac commented Mar 28, 2018

elasticmachine commented Mar 28, 2018

mayya-sharipova commented Mar 29, 2018

edovac commented Apr 4, 2018 •

edited

Loading

jpountz commented Apr 6, 2018

edovac commented Apr 6, 2018

Add remove types token filter (as opposite to keep_types token filter) #29277

Add remove types token filter (as opposite to keep_types token filter) #29277

Comments

edovac commented Mar 28, 2018

elasticmachine commented Mar 28, 2018

mayya-sharipova commented Mar 29, 2018

edovac commented Apr 4, 2018 • edited Loading

jpountz commented Apr 6, 2018

edovac commented Apr 6, 2018

edovac commented Apr 4, 2018 •

edited

Loading