Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add remove types token filter (as opposite to keep_types token filter) #29277

Closed
edovac opened this issue Mar 28, 2018 · 5 comments
Closed

Add remove types token filter (as opposite to keep_types token filter) #29277

edovac opened this issue Mar 28, 2018 · 5 comments
Assignees
Labels
>enhancement help wanted adoptme :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@edovac
Copy link

edovac commented Mar 28, 2018

Describe the feature:
Hi, Elasticsearch provides the keep_types token filter, but does not provide a token filter to exclude specific token types from the token stream.

As I understand, the keep_types token filter is implemented using Lucene org.apache.lucene.analysis.core.TypeTokenFilter.TypeTokenFilter(TokenStream, Set<String>, boolean) which implements both behaviours.
It would be nice to have the remove filter too.

Elasticsearch version: 6.2

@bleskes bleskes added the :Search Relevance/Analysis How text is split into tokens label Mar 28, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@mayya-sharipova
Copy link
Contributor

@edovac Can you please provide your use-case? How are you going to use keep_types token filter to exclude specific token types?

@edovac
Copy link
Author

edovac commented Apr 4, 2018

Hi, sorry for the delay.
The use case is a little bit articulated, I'll try to keep it plain simple.

The company I work in is developing an analyzer which can manage (among other things) named entity extractions and structured text.
By structured text I mean a text for which I provide explicit sections structure like:

<document>
<title>John Smith hired by Acme Inc.</title>
<body>Last monday the company announced the new hire.</body>
</document>

Sections names are not fixed, the can repeat in the same text and they change on a per project basis.
Named entities also can vary by project.

We need to support per extraction type queries, and also constraining the search to a specific section.
ie: John Smith as person in section title

We also need to support near queries between different extraction types like:
John Smith as person near Acme Inc. as organization
This could be done both by sentence (using slop) or by named section.

Here is a mapping example:

{
	"settings": {
		"analysis": {
			"analyzer": {
				"field_analyzer": {
					"type": "custom",
					"tokenizer": "structure_tokenizer"
				},
				"sentence_analyzer": {
					"type": "custom",
					"tokenizer": "structure_tokenizer",
					"filter": [
						"not_section"
					]
				},
				"organizations_analyzer": {
					"type": "custom",
					"tokenizer": "structure_tokenizer",
					"filter": [
						"organizations"
					]
				},
				"people_analyzer": {
					"type": "custom",
					"tokenizer": "structure_tokenizer",
					"filter": [
						"people"
					]
				},
			},
			"tokenizer": {
				"structure_tokenizer": {
					"type": "structure-tokenizer"
				}
			},
			"filter": {
				"organizations": {
					"type": "keep_types",
					"types": [
						"organization"
					]
				},
				"people": {
					"type": "keep_types",
					"types": [
						"people"
					]
				},
				"section": {
					"type": "keep_types",
					"types": [
						"section"
					]
				},
				"not_section": {
					"type": "remove_types",
					"types": [
						"section"
					]
				}
			}
		}
	},
	"mappings": {
		"type": {
			"properties": {
				"text": {
					"type": "text",
					"fielddata": true,
					"analyzer": "sentence_analyzer",
					"term_vector": "with_positions_offsets",
					"fields": {
						"organizations": {
							"type": "text",
							"fielddata": true,
							"analyzer": "organizations_analyzer",
							"term_vector": "with_positions_offsets"
						},
						"people": {
							"type": "text",
							"fielddata": true,
							"analyzer": "people_analyzer",
							"term_vector": "with_positions_offsets"
						},
						"section": {
							"type": "text",
							"analyzer": "section_analyzer",
							"term_vector": "with_positions_offsets"
						}
					}
				}
			}
		}
	}
}

structure-tokenizer is developed by us.
Our analysis engine is able to generate a record like this:

  • value: John Smith
  • offset: [0, 10]
  • token pos: 0
  • type: people
  • section: title

From this records we generate index tokens:

example for text:

first token

  • value: John Smith
  • offset: [0, 10]
  • token pos: 0
  • type: people

second token

  • value: hired
  • offset: [11, 16]
  • token pos: 1
  • type: keyword

example for people:

  • value: John Smith
  • offset: [0, 10]
  • token pos: 0
  • type: people

example for section:

first token

  • value: title
  • offset: [0, 10]
  • token pos: 0
  • type: section

second token

  • value: title
  • offset: [11, 16]
  • token pos: 0
  • type: section

For each token we have an overlapping one in the section field.

"text" field contains all tokens and will be used for match and phrase queries.

To support constraints by section we use the dedicate field "section" in conjunction with field_masking_span and span_containing.
Thus, we use a dedicated "section" field to store section information for each token.
ie:

GET /_search
{
	"query": {
		"span_containing": {
			"big": {
				"span_term": {
					"people": "John Smith"
				}
			},
			"little": {
				"field_masking_span": {
					"query": {
						"span_term": {
							"section": "title"
						}
					},
					"field": "poeple"
				}
			}
		}
	}
}

Coming back to our main point, "text" field requires all tokens except those relatives to "section" and thus the request for a "remove_types" token filter.
I'm aware that I can achieve the same results using the "keep_types" to include only useful tokens, but it seems overly verbose, error prone and less readable.

ie:

"not_section": {
	"type": "remove_types",
	"types": [
		"section"
	]
}

vs.

"not_section": {
	"type": "keep_types",
	"types": [
		"keyword",
		"people",
		"organizations"
	]
}

We often have tens of different named extractions.

I hope this will clarify my request :)

@pcsanwald pcsanwald removed the discuss label Apr 6, 2018
@pcsanwald pcsanwald assigned jpountz and unassigned jpountz Apr 6, 2018
@jpountz
Copy link
Contributor

jpountz commented Apr 6, 2018

Discussed in FixitFriday: we agreed to do it. Here is the plan we discussed:

  • update the existing filter so that it supports includes and excludes
  • not support wildcards, only exact strings
  • fail if both includes and excludes are specified

@jpountz jpountz added the help wanted adoptme label Apr 6, 2018
@edovac
Copy link
Author

edovac commented Apr 6, 2018

Thanks :)

@cbuescher cbuescher self-assigned this Jul 12, 2018
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Jul 12, 2018
Currently the `keep_types` token filter includes all token types specified using
its `types` parameter. Lucenes TypeTokenFilter also provides a second mode where
instead of keeping the specified tokens (include) they are filtered out
(exclude). This change exposes this option as a new `mode` parameter that can
either take the values `include` (the default, if not specified) or `exclude`.

Closes elastic#29277
cbuescher pushed a commit that referenced this issue Jul 17, 2018
Currently the `keep_types` token filter includes all token types specified using
its `types` parameter. Lucenes TypeTokenFilter also provides a second mode where
instead of keeping the specified tokens (include) they are filtered out
(exclude). This change exposes this option as a new `mode` parameter that can
either take the values `include` (the default, if not specified) or `exclude`.

Closes #29277
cbuescher pushed a commit that referenced this issue Jul 17, 2018
Currently the `keep_types` token filter includes all token types specified using
its `types` parameter. Lucenes TypeTokenFilter also provides a second mode where
instead of keeping the specified tokens (include) they are filtered out
(exclude). This change exposes this option as a new `mode` parameter that can
either take the values `include` (the default, if not specified) or `exclude`.

Closes #29277
@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement help wanted adoptme :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

10 participants