Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'singleton' flag to field mappers? #58523

Closed
jtibshirani opened this issue Jun 25, 2020 · 28 comments
Closed

Add 'singleton' flag to field mappers? #58523

jtibshirani opened this issue Jun 25, 2020 · 28 comments
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@jtibshirani
Copy link
Contributor

jtibshirani commented Jun 25, 2020

Datastreams have a first-class concept of a timestamp field. Each document in the datastream must contain exactly one value for the designated timestamp field, so that we know where to route the document when partitioning by time. In #58582 we're adding index-time validation for this requirement. The current implementation is very narrow in scope and adds special document parsing logic just for datastream timestamps.

I was wondering if this validation could be useful more generally. We could add a 'singleton' flag to field mappers -- when it is set, the mapper will verify that it encounters one (and only one) value in each document:

"product_identifier": {
  "type": "keyword",
  "singleton": true
}

Such an option could be helpful for modeling fields that an application requires to have exactly one value, like identifiers, timestamps, content types, etc.

@jtibshirani jtibshirani added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types labels Jun 25, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

@webmat
Copy link

webmat commented Jun 30, 2020

I'd like to clarify one part:

Each document in the datastream must contain exactly one timestamp

I assume this sentence applies only to the field where singleton is set. In your example, @timestamp contains exactly one timestamp, and not an array of timestamps. This I'm totally on board with 👍

However the way this is currently worded makes it sound like no other date fields would be allowed (e.g. an event with @timestamp, event.start and event.end populated). I assume this is not the case here?

Stepping out of the timestamp example, I like this feature in general, not just for datastreams. Elasticsearch has long been flexible with fields containing either a single value or an array. However since ECS came out, we've gradually been clarifying which ECS fields are expected to contain an array of values, in order to make the event format more predictable for consumers of the data.

This new addition would help clarify 3 acceptable formats about a field:

  • it must be a single value
  • it must be an array of values
  • it's (still) unspecified (default Elasticsearch behaviour), but most likely a single value

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Jun 30, 2020

However the way this is currently worded makes it sound like no other date fields would be allowed (e.g. an event with @timestamp, event.start and event.end populated). I assume this is not the case here?

Your assumption is right, I've updated the description to clarify the wording.

@ruflin
Copy link
Contributor

ruflin commented Jul 28, 2020

If I understand it correctly, constant_keyword today is already a singleton field? In general ++ on this feature?

@jtibshirani
Copy link
Contributor Author

If I understand it correctly, constant_keyword today is already a singleton field?

Not exactly, because documents are allowed to not specify a value for constant_keyword. When performing a search, we know to fill in the missing field with the constant value from the mappings. But you're right in that they appear like singleton fields from a search API perspective.

@romseygeek
Copy link
Contributor

We discussed this in the search meeting and agree that it would be a useful addition to at least some of the field mappers. We think that allow_multi_fields:true|false might be a better parameter name to use as it's more obvious that we're not enforcing that a single value is present.

@webmat
Copy link

webmat commented Aug 14, 2020

From the start I've interpreted this as a way to ensure a field would not contain an array of values. E.g. preventing the following:

{ "@timestamp": [ "1597425324", "1597425325" ] }

And enforcing that a field has a single value, e.g.:

{ "@timestamp": "1597425324" }

My understanding is that this feature is unrelated to whether or not a field has additional multi-fields.

@mayya-sharipova
Copy link
Contributor

I think @romseygeek meant allow_multiple_values:

  • when set true, we will allow multiple values
  • when set false, we will not allow multiple values. A value may be a single value, or it may not exist.

Or is the requirement with singleton is that every document must has this field and only a single value?

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Aug 18, 2020

@romseygeek @mayya-sharipova the singleton flag would ensure there is exactly one value (as mentioned in the description), not just 'at most' one value. This is part of the usefulness of the feature -- for example with time-based data, we may want to ensure that every document has a timestamp and can be associated with a bucket based on time.

@jtibshirani
Copy link
Contributor Author

We discussed again as a team and agreed that this could be a useful feature. But we didn't see a strong immediate need for it yet, and will leave the issue open to gather more feedback on use cases and priority.

Other topics from our discussion:

  • We would probably want to return the flag in field caps. We would merge the flag across indices, only returning singleton: true if every index had singleton: true.
  • There may be a concern around over-use. Would users see the flag and start applying it everywhere, even when the default behavior would work fine?

@markharwood
Copy link
Contributor

I was wondering if this validation could be useful more generally.

Absolutely. I expect in practice the majority of elasticsearch fields are single-value fields but always allowing for multiple values has created a poor user experience in Kibana. Kibana typically offers a search box and shows visualizations e.g. bar charts of operating systems used by website visitors. Clicking on the bars creates "filter pills" which are always ANDed.
It never makes sense to create a search with 2 values ANDed on a single value field as no documents can match and yet Kibana will not stop users from trying this.

Kibana could offer users much more sensible drill-down options if it knew that a field were a single value field. E.g. allowing multiple selections from a bar/pie chart and ORing these selections from the same field rather than ANDing them. This would be a great step forward in Kibana usability.

@markharwood
Copy link
Contributor

Make it a dynamic setting?

If Kibana starts to make UX improvements based on using this flag I expect people will want to "tighten-up" their mapping definitions for existing data without reindexing. This means the setting would ideally be dynamic i.e. can be redeclared on existing index mappings

When changing this setting, validating all existing docs are single-valued could be too expensive/difficult to implement. If unvalidated this could leave us in a state where the mappings declare fields are single-valued (and future docs will be validated as such) but there could still be some historical docs that have multi-values.
That would be an unfortunate scenario but I think the advantages of offering a dynamic, unvalidated setting may outweigh the disadvantages:
Advantage - the majority of existing indices can add this useful metadata without reindexing and it would hold true for all docs. Kibana introduces better default search behaviour (term filters on this field are ORed rather than ANDed).
Disadvantage - there may be weird behaviour given this combination of rarer circumstances

  1. An admin mistakenly declares a field as single-value where there are historical examples of multi-value and
  2. Some future client applications are coded with strict assumptions based on this new flag .e.g. expect single values and encounter a doc with an array.

@markharwood
Copy link
Contributor

Dynamic mappings behaviour?

Another consideration is what would happen with dynamic mappings. It would certainly be convenient if the common case (single value fields) were detected and mapped as single values automatically. However, doing so means that this sequence of docs would create an error:

POST my_new_index/_doc/1
{ "foo": "bar"}         # Detects and creates new single-valued field definition
POST my_new_index/_doc/2
{ "foo": ["bar", "baz"]}         # Throws error.

This would be a breaking change to existing behaviour.

@markharwood
Copy link
Contributor

markharwood commented Oct 28, 2021

@jimczi suggest we explicitly expand the scope of this issue to cover both the mandatory single-value case (like timestamp) and the case where multiple values are permitted.

I propose a new cardinality setting in mappings with min and max values. These could be the logic for these min/max values:

  • When min is zero or unset then the field is optional
  • When min is > zero the field is mandatory
  • When max is 1 arrays are rejected
  • When max is > 1 arrays are accepted (perhaps not required?)
  • When max is unset then unlimited values are permitted in an array

So the default is our current behaviour for fields : no expected max or min.
We need to consider the logic for merging conflicting mapping definitions in field caps.
I expect reporting "min of all mins" and "max of all maxes" is the way to go.

We'd also need to pick a format for presenting the params:

  1. Object: "cardinality": { "min": 1, "max": 1}
  2. fields: ​ "min_cardinality" 1, "max_cardinality": 1}
  3. String syntax: "cardinality: "0,1" or "cardinality": "1,*"

Is there a precedent somewhere we should follow for consistency's sake?

@markharwood
Copy link
Contributor

markharwood commented Oct 28, 2021

When we discussed the cardinality settings I outlined in my last comment @jpountz felt that this might be leading us down the slippery slope of validating all incoming data. The next asks being adding support for valid enum values or positive-only numerics or custom validation scripts etc. etc.
This is something we have tried to avoid and have pushed client applications to retain most of the control over validation.
That being the case, this issue should remain about "minimal validation" - only the stuff we really care about like the single-value timestamps used for routing.
@jpountz did suggest that it would be possible to derive single vs multi-value flags from the data held in indexed fields and we could return that relatively cheaply as part of the field caps API. Before I open a different issue to focus on providing just that, it would be good to agree that is what we want. Much of the "derive vs control" choice for field cardinality flags might come down to the levels of reliability we need in what is reported. Enforcing cardinality at ingest time via the mapping would provide more reliable field-singularity flags but deriving the current state of an uncontrolled index can change with the very next document addition.

Any thoughts @astefan @costin ?

@astefan
Copy link
Contributor

astefan commented Nov 2, 2021

@markharwood agree on dealing with the field caps API potential additional in a different issue. I don't think QL has any use in a mapping level flag, as we do not look at the indices mappings.

@markharwood
Copy link
Contributor

markharwood commented Nov 2, 2021

don't think QL has any use in a mapping level flag, as we do not look at the indices mappings.

@astefan I don't expect you would to need to look at mappings to benefit from this. We'd surface the mapping level flags in field cap summaries.

When it come to cardinalities I'm leaning towards the policy of enforce-via-mapping rather than report-on-index-state because:

  1. Costs of deriving field-caps information are higher when you have to poke around all segments of all indices to reverse-engineer actual cardinality info (perhaps even more so on frozen tier)
  2. Field caps would require frequent calls as cardinality info is not stable - a field reported as single-valued one instant could turn to a multi valued one the next with the addition of one more document.
  3. Reverse-engineering cardinalities from poking around Lucene indices will not work with all field types. A single-valued JSON property could be indexed any number of ways by a field - as a single token or many or not at all in the case of runtime fields.

@markharwood
Copy link
Contributor

markharwood commented Nov 4, 2021

When it comes to enforcement I think enforcing a maximum cardinality will be much cheaper/more viable than enforcing a minimum cardinality.
Checking the max cardinality is simple because there's one central piece of logic in the DocumentParser class that parses JSON arrays. We can put a simple field-specific limit check in there.
However, checking that fields are present in every doc i.e. when min_cardinality is 1 involves more computation for each doc:

  1. All fields in a mapping with min_cardinality >0 have to be loaded into a "required_fields" set
  2. Each field encountered in the JSON of a doc being indexed removes any corresponding name from the "required_fields" set
  3. An error is thrown if the required_fields set is not empty for a fully parsed document
  4. Extra complexity occurs within objects or nested fields that are allowed multiples but their leaf fields are not.

I'm concerned that this min_cardinality checking would add overhead to one of the hottest loops - the indexing logic.
For this reason I propose that we only offer an allow_multiple_values boolean for fields rather than offer both a min and max cardinality validation.

@jpountz
Copy link
Contributor

jpountz commented Nov 5, 2021

One benefit of enforcing in the mappings that I can think of is that we might be able to speed up indexing for single-valued fields by using NumericDocValuesField instead of SortedNumericDocValuesField and SortedDocValuesField instead of SortedSetDocValuesField. It might also make it easier to reuse Field instances, which is another factor that helps speed up indexing. I'm unsure about the sort of speedup that we should expect though.

Costs of deriving field-caps information are higher when you have to poke around all segments of all indices to reverse-engineer actual cardinality info

FWIW we already have such logic when an index_filter is provided, since we need to examine all segments to know if any of them intersects with the query.

(perhaps even more so on frozen tier)

Index statistics are loaded in memory when opening a shard, so it wouldn't be slower on the frozen tier. However nodes on the frozen tier are denser, so if they would have more shards to check than hot nodes.

The worry I have about the "enforce-via-mapping" approach is adoption:

  • it wouldn't work on indices created before this feature is introduced,
  • if you forgot to set it at index creation time, you cannot add it later without reindexing,
  • some users (e.g. Logging use-cases) might be nervous about using this feature because they'd rather index documents that are multi-valued even though they shouldn't than reject them.

Furthermore the "enforce-via-mapping" approach brings unique challenges too. For instance if you have 100 data streams and one of them doesn't set the allow_multiple_values flag on the agent.id field even though it is single-valued, then calling field caps on an index pattern that matches these 100 data streams will report agent.id as being multi-valued.

To me the strongest argument against the "report-on-index-state" approach is the performance overhead. Maybe we could build a prototype and try to evaluate the performance impact to see how bad it actually is?

@markharwood
Copy link
Contributor

markharwood commented Nov 8, 2021

we might be able to speed up indexing for single-valued fields by using NumericDocValuesField instead of SortedNumericDocValuesField and SortedDocValuesField instead of SortedSetDocValuesField

I tried this out for keyword and date fields for a couple of smallish datasets but didn't see any noticeable uplift to indexing speeds. Haven't tried recycling Field instances yet.
Any disk savings from switching to single-valued for keyword and date fields were marginal.

markharwood added a commit to markharwood/elasticsearch that referenced this issue Jan 4, 2022
Inspects index contents in some cases to reveal if all docs hold single values or not for a field.

Relates to elastic#58523
@dakrone
Copy link
Member

dakrone commented Apr 18, 2023

Was this ever decided on? It seems like it would be a useful feature.

@ruflin
Copy link
Contributor

ruflin commented Apr 19, 2023

This popped up again also in the context of the routing processor that now landed: #76511 As soon as a value is multi field, we might not be able to route the document. At the same time the fields that are likely to be used for routing should never be multi field in the first place. It would be nice to even enforce this.

Our need for a singleton field is not driven by any performance improvements but much rather strictness on the data. With #95329 we become more lenient on what data we accepts for logs-*-*. It means even if it is a singleton field, it would still be accepted just not indexed.

Could we start this effort by having the option on keyword fields? It seems @markharwood made some tests quite some time ago on this which means the code was changed for it. Does anyone know if this code is still around somewhere?

Playing around with data streams and trying to ingest an array for @timestamp I get the following error: "reason": "[11:1] failed to parse: data stream timestamp field [@timestamp] encountered multiple values". Seems like we already have a singleton field here (which is good).

@abatsakis
Copy link
Contributor

We can discuss the feature independently, but why we don't want a routing field to ever be multi-valued? Also, are you gonna force the rerouting processor to work only on fields with the singleton entry?

@ruflin
Copy link
Contributor

ruflin commented Apr 20, 2023

The discussion on the reroute processor happened here: #76511 (comment) For now, routing only works on fields with a singleton entry and otherwise throws and exception. The problem with multi value is, which value do we pick for routing? @felixbarny

@javanna
Copy link
Member

javanna commented Apr 20, 2023

This is getting some traction again recently, I recently discussed it with @nik9000 as well in the context of esql. We discussed it today again within the Search team and we said it's a thing that we would like to do once some of us has time.

@abatsakis
Copy link
Contributor

@ruflin the answer to your question is... it doesn't matter :) If a user doesn't explicitly handle multi-valued fields in their routing processors, then the data are most likely bogus so you can do whatever you want with them, e.g. route them to the error store explicitly, drop them, pick one value randomly or whatever.

This is something that can happen anyway because someone may try to route a field that is not designated as singleton, unless you will require routing rules to be applied only to those fields (which I don't think you want to). Same for ESQL.

In other words, I don't see good reasons for this feature. Maybe I am missing something though, so before you intend to pick it up, let's talk :)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@javanna
Copy link
Member

javanna commented May 1, 2023

Some clarifications:

I get the following error: "reason": "[11:1] failed to parse: data stream timestamp field [@timestamp] encountered multiple values". Seems like we already have a singleton field here (which is good).

Yes, we perform this check specifically for the @timestamp field. Quoting the description of this issue "Datastreams have a first-class concept of a timestamp field. Each document in the datastream must contain exactly one value for the designated timestamp field, so that we know where to route the document when partitioning by time.". This issue was originally to generalize this check and potentially apply it to other fields. I see that for a big part of the discussion though we actually discussed exposing whether a field has multiple values or not, which is a different problem, meaning adding a flag that specifies whether a field is allowed to have multiple values or not, but that does not mean each document must have a value for it.

While we are currently discussing the possibility of adding the allow_multiple_values flag which is covered by #80825, we have no concrete plans on adding the singleton flag. We acknowledged that it may help but we don't have much data and concrete scenarios where that would help, which is pretty much the reason why we have not implemented it so far: it has not been a priority. For this reason I am closing this issue, we will reserve to reopen it if we collect feedback on concrete scenarios where adding a singleton flag would help.

@javanna javanna closed this as not planned Won't fix, can't repro, duplicate, stale May 1, 2023
@javanna javanna added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.