Add 'singleton' flag to field mappers? #58523

jtibshirani · 2020-06-25T03:27:14Z

Datastreams have a first-class concept of a timestamp field. Each document in the datastream must contain exactly one value for the designated timestamp field, so that we know where to route the document when partitioning by time. In #58582 we're adding index-time validation for this requirement. The current implementation is very narrow in scope and adds special document parsing logic just for datastream timestamps.

I was wondering if this validation could be useful more generally. We could add a 'singleton' flag to field mappers -- when it is set, the mapper will verify that it encounters one (and only one) value in each document:

"product_identifier": {
  "type": "keyword",
  "singleton": true
}

Such an option could be helpful for modeling fields that an application requires to have exactly one value, like identifiers, timestamps, content types, etc.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-06-25T03:27:16Z

Pinging @elastic/es-search (:Search/Mapping)

webmat · 2020-06-30T15:44:10Z

I'd like to clarify one part:

Each document in the datastream must contain exactly one timestamp

I assume this sentence applies only to the field where singleton is set. In your example, @timestamp contains exactly one timestamp, and not an array of timestamps. This I'm totally on board with 👍

However the way this is currently worded makes it sound like no other date fields would be allowed (e.g. an event with @timestamp, event.start and event.end populated). I assume this is not the case here?

Stepping out of the timestamp example, I like this feature in general, not just for datastreams. Elasticsearch has long been flexible with fields containing either a single value or an array. However since ECS came out, we've gradually been clarifying which ECS fields are expected to contain an array of values, in order to make the event format more predictable for consumers of the data.

This new addition would help clarify 3 acceptable formats about a field:

it must be a single value
it must be an array of values
it's (still) unspecified (default Elasticsearch behaviour), but most likely a single value

jtibshirani · 2020-06-30T15:56:01Z

However the way this is currently worded makes it sound like no other date fields would be allowed (e.g. an event with @timestamp, event.start and event.end populated). I assume this is not the case here?

Your assumption is right, I've updated the description to clarify the wording.

ruflin · 2020-07-28T19:04:03Z

If I understand it correctly, constant_keyword today is already a singleton field? In general ++ on this feature?

jtibshirani · 2020-07-28T20:20:05Z

If I understand it correctly, constant_keyword today is already a singleton field?

Not exactly, because documents are allowed to not specify a value for constant_keyword. When performing a search, we know to fill in the missing field with the constant value from the mappings. But you're right in that they appear like singleton fields from a search API perspective.

romseygeek · 2020-08-14T15:23:34Z

We discussed this in the search meeting and agree that it would be a useful addition to at least some of the field mappers. We think that allow_multi_fields:true|false might be a better parameter name to use as it's more obvious that we're not enforcing that a single value is present.

webmat · 2020-08-14T17:17:35Z

From the start I've interpreted this as a way to ensure a field would not contain an array of values. E.g. preventing the following:

{ "@timestamp": [ "1597425324", "1597425325" ] }

And enforcing that a field has a single value, e.g.:

{ "@timestamp": "1597425324" }

My understanding is that this feature is unrelated to whether or not a field has additional multi-fields.

mayya-sharipova · 2020-08-14T17:34:05Z

I think @romseygeek meant allow_multiple_values:

when set true, we will allow multiple values
when set false, we will not allow multiple values. A value may be a single value, or it may not exist.

Or is the requirement with singleton is that every document must has this field and only a single value?

jtibshirani · 2020-08-18T15:38:50Z

@romseygeek @mayya-sharipova the singleton flag would ensure there is exactly one value (as mentioned in the description), not just 'at most' one value. This is part of the usefulness of the feature -- for example with time-based data, we may want to ensure that every document has a timestamp and can be associated with a bucket based on time.

jtibshirani · 2020-08-21T18:10:43Z

We discussed again as a team and agreed that this could be a useful feature. But we didn't see a strong immediate need for it yet, and will leave the issue open to gather more feedback on use cases and priority.

Make it a dynamic setting?

If Kibana starts to make UX improvements based on using this flag I expect people will want to "tighten-up" their mapping definitions for existing data without reindexing. This means the setting would ideally be dynamic i.e. can be redeclared on existing index mappings

When changing this setting, validating all existing docs are single-valued could be too expensive/difficult to implement. If unvalidated this could leave us in a state where the mappings declare fields are single-valued (and future docs will be validated as such) but there could still be some historical docs that have multi-values.
That would be an unfortunate scenario but I think the advantages of offering a dynamic, unvalidated setting may outweigh the disadvantages:
Advantage - the majority of existing indices can add this useful metadata without reindexing and it would hold true for all docs. Kibana introduces better default search behaviour (term filters on this field are ORed rather than ANDed).
Disadvantage - there may be weird behaviour given this combination of rarer circumstances

An admin mistakenly declares a field as single-value where there are historical examples of multi-value and
Some future client applications are coded with strict assumptions based on this new flag .e.g. expect single values and encounter a doc with an array.

markharwood · 2021-10-27T08:18:40Z

Dynamic mappings behaviour?

Another consideration is what would happen with dynamic mappings. It would certainly be convenient if the common case (single value fields) were detected and mapped as single values automatically. However, doing so means that this sequence of docs would create an error:

POST my_new_index/_doc/1
{ "foo": "bar"}         # Detects and creates new single-valued field definition

POST my_new_index/_doc/2
{ "foo": ["bar", "baz"]}         # Throws error.

This would be a breaking change to existing behaviour.

markharwood · 2021-10-28T11:52:57Z

@jimczi suggest we explicitly expand the scope of this issue to cover both the mandatory single-value case (like timestamp) and the case where multiple values are permitted.

I propose a new cardinality setting in mappings with min and max values. These could be the logic for these min/max values:

When min is zero or unset then the field is optional
When min is > zero the field is mandatory
When max is 1 arrays are rejected
When max is > 1 arrays are accepted (perhaps not required?)
When max is unset then unlimited values are permitted in an array

So the default is our current behaviour for fields : no expected max or min.
We need to consider the logic for merging conflicting mapping definitions in field caps.
I expect reporting "min of all mins" and "max of all maxes" is the way to go.

We'd also need to pick a format for presenting the params:

Object: "cardinality": { "min": 1, "max": 1}
fields: "min_cardinality" 1, "max_cardinality": 1}
String syntax: "cardinality: "0,1" or "cardinality": "1,*"

Is there a precedent somewhere we should follow for consistency's sake?

markharwood · 2021-10-28T16:42:56Z

When we discussed the cardinality settings I outlined in my last comment @jpountz felt that this might be leading us down the slippery slope of validating all incoming data. The next asks being adding support for valid enum values or positive-only numerics or custom validation scripts etc. etc.
This is something we have tried to avoid and have pushed client applications to retain most of the control over validation.
That being the case, this issue should remain about "minimal validation" - only the stuff we really care about like the single-value timestamps used for routing.
@jpountz did suggest that it would be possible to derive single vs multi-value flags from the data held in indexed fields and we could return that relatively cheaply as part of the field caps API. Before I open a different issue to focus on providing just that, it would be good to agree that is what we want. Much of the "derive vs control" choice for field cardinality flags might come down to the levels of reliability we need in what is reported. Enforcing cardinality at ingest time via the mapping would provide more reliable field-singularity flags but deriving the current state of an uncontrolled index can change with the very next document addition.

Any thoughts @astefan @costin ?

astefan · 2021-11-02T08:56:14Z

@markharwood agree on dealing with the field caps API potential additional in a different issue. I don't think QL has any use in a mapping level flag, as we do not look at the indices mappings.

markharwood · 2021-11-02T09:21:22Z

don't think QL has any use in a mapping level flag, as we do not look at the indices mappings.

@astefan I don't expect you would to need to look at mappings to benefit from this. We'd surface the mapping level flags in field cap summaries.

When it come to cardinalities I'm leaning towards the policy of enforce-via-mapping rather than report-on-index-state because:

Costs of deriving field-caps information are higher when you have to poke around all segments of all indices to reverse-engineer actual cardinality info (perhaps even more so on frozen tier)
Field caps would require frequent calls as cardinality info is not stable - a field reported as single-valued one instant could turn to a multi valued one the next with the addition of one more document.
Reverse-engineering cardinalities from poking around Lucene indices will not work with all field types. A single-valued JSON property could be indexed any number of ways by a field - as a single token or many or not at all in the case of runtime fields.

markharwood · 2021-11-04T15:43:51Z

When it comes to enforcement I think enforcing a maximum cardinality will be much cheaper/more viable than enforcing a minimum cardinality.
Checking the max cardinality is simple because there's one central piece of logic in the DocumentParser class that parses JSON arrays. We can put a simple field-specific limit check in there.
However, checking that fields are present in every doc i.e. when min_cardinality is 1 involves more computation for each doc:

All fields in a mapping with min_cardinality >0 have to be loaded into a "required_fields" set
Each field encountered in the JSON of a doc being indexed removes any corresponding name from the "required_fields" set
An error is thrown if the required_fields set is not empty for a fully parsed document
Extra complexity occurs within objects or nested fields that are allowed multiples but their leaf fields are not.

I'm concerned that this min_cardinality checking would add overhead to one of the hottest loops - the indexing logic.
For this reason I propose that we only offer an allow_multiple_values boolean for fields rather than offer both a min and max cardinality validation.

jpountz · 2021-11-05T17:52:11Z

One benefit of enforcing in the mappings that I can think of is that we might be able to speed up indexing for single-valued fields by using NumericDocValuesField instead of SortedNumericDocValuesField and SortedDocValuesField instead of SortedSetDocValuesField. It might also make it easier to reuse Field instances, which is another factor that helps speed up indexing. I'm unsure about the sort of speedup that we should expect though.

Costs of deriving field-caps information are higher when you have to poke around all segments of all indices to reverse-engineer actual cardinality info

FWIW we already have such logic when an index_filter is provided, since we need to examine all segments to know if any of them intersects with the query.

(perhaps even more so on frozen tier)

Index statistics are loaded in memory when opening a shard, so it wouldn't be slower on the frozen tier. However nodes on the frozen tier are denser, so if they would have more shards to check than hot nodes.

The worry I have about the "enforce-via-mapping" approach is adoption:

it wouldn't work on indices created before this feature is introduced,
if you forgot to set it at index creation time, you cannot add it later without reindexing,
some users (e.g. Logging use-cases) might be nervous about using this feature because they'd rather index documents that are multi-valued even though they shouldn't than reject them.

Furthermore the "enforce-via-mapping" approach brings unique challenges too. For instance if you have 100 data streams and one of them doesn't set the allow_multiple_values flag on the agent.id field even though it is single-valued, then calling field caps on an index pattern that matches these 100 data streams will report agent.id as being multi-valued.

To me the strongest argument against the "report-on-index-state" approach is the performance overhead. Maybe we could build a prototype and try to evaluate the performance impact to see how bad it actually is?

markharwood · 2021-11-08T17:42:13Z

we might be able to speed up indexing for single-valued fields by using NumericDocValuesField instead of SortedNumericDocValuesField and SortedDocValuesField instead of SortedSetDocValuesField

I tried this out for keyword and date fields for a couple of smallish datasets but didn't see any noticeable uplift to indexing speeds. Haven't tried recycling Field instances yet.
Any disk savings from switching to single-valued for keyword and date fields were marginal.

Inspects index contents in some cases to reveal if all docs hold single values or not for a field. Relates to elastic#58523

dakrone · 2023-04-18T17:17:53Z

Was this ever decided on? It seems like it would be a useful feature.

ruflin · 2023-04-19T08:20:28Z

This popped up again also in the context of the routing processor that now landed: #76511 As soon as a value is multi field, we might not be able to route the document. At the same time the fields that are likely to be used for routing should never be multi field in the first place. It would be nice to even enforce this.

Our need for a singleton field is not driven by any performance improvements but much rather strictness on the data. With #95329 we become more lenient on what data we accepts for logs-*-*. It means even if it is a singleton field, it would still be accepted just not indexed.

Could we start this effort by having the option on keyword fields? It seems @markharwood made some tests quite some time ago on this which means the code was changed for it. Does anyone know if this code is still around somewhere?

Playing around with data streams and trying to ingest an array for @timestamp I get the following error: "reason": "[11:1] failed to parse: data stream timestamp field [@timestamp] encountered multiple values". Seems like we already have a singleton field here (which is good).

abatsakis · 2023-04-19T14:10:18Z

We can discuss the feature independently, but why we don't want a routing field to ever be multi-valued? Also, are you gonna force the rerouting processor to work only on fields with the singleton entry?

ruflin · 2023-04-20T06:15:58Z

The discussion on the reroute processor happened here: #76511 (comment) For now, routing only works on fields with a singleton entry and otherwise throws and exception. The problem with multi value is, which value do we pick for routing? @felixbarny

javanna · 2023-04-20T19:47:14Z

This is getting some traction again recently, I recently discussed it with @nik9000 as well in the context of esql. We discussed it today again within the Search team and we said it's a thing that we would like to do once some of us has time.

abatsakis · 2023-04-22T22:55:54Z

@ruflin the answer to your question is... it doesn't matter :) If a user doesn't explicitly handle multi-valued fields in their routing processors, then the data are most likely bogus so you can do whatever you want with them, e.g. route them to the error store explicitly, drop them, pick one value randomly or whatever.

This is something that can happen anyway because someone may try to route a field that is not designated as singleton, unless you will require routing rules to be applied only to those fields (which I don't think you want to). Same for ESQL.

In other words, I don't see good reasons for this feature. Maybe I am missing something though, so before you intend to pick it up, let's talk :)

elasticsearchmachine · 2023-05-01T12:38:03Z

Pinging @elastic/es-search (Team:Search)

javanna · 2023-05-01T12:45:15Z

Some clarifications:

I get the following error: "reason": "[11:1] failed to parse: data stream timestamp field [@timestamp] encountered multiple values". Seems like we already have a singleton field here (which is good).

Yes, we perform this check specifically for the @timestamp field. Quoting the description of this issue "Datastreams have a first-class concept of a timestamp field. Each document in the datastream must contain exactly one value for the designated timestamp field, so that we know where to route the document when partitioning by time.". This issue was originally to generalize this check and potentially apply it to other fields. I see that for a big part of the discussion though we actually discussed exposing whether a field has multiple values or not, which is a different problem, meaning adding a flag that specifies whether a field is allowed to have multiple values or not, but that does not mean each document must have a value for it.

While we are currently discussing the possibility of adding the allow_multiple_values flag which is covered by #80825, we have no concrete plans on adding the singleton flag. We acknowledged that it may help but we don't have much data and concrete scenarios where that would help, which is pretty much the reason why we have not implemented it so far: it has not been a priority. For this reason I am closing this issue, we will reserve to reopen it if we collect feedback on concrete scenarios where adding a singleton flag would help.

jtibshirani added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types labels Jun 25, 2020

elasticmachine added the Team:Search Meta label for search team label Jun 25, 2020

jtibshirani mentioned this issue Jun 25, 2020

Validate whether a data stream timestamp has been specified in a document #58119

Closed

jtibshirani added the team-discuss label Jul 13, 2020

romseygeek removed the team-discuss label Aug 14, 2020

jtibshirani added team-discuss and removed team-discuss labels Aug 18, 2020

romseygeek mentioned this issue Oct 26, 2020

Track when fields contain multi-value arrays and expose in field_caps or mappings #64077

Closed

jtibshirani mentioned this issue Feb 19, 2021

Removal of mapping types in 7.x: why not include index aliases in results? #68549

Closed

wylieconlon mentioned this issue May 10, 2021

Use ValuesSourceRegistry for MultiValuesSourceAggregations #53194

Open

4 tasks

markharwood mentioned this issue Nov 3, 2021

New field mapping flag - allow_multiple_values #80289

Closed

markharwood mentioned this issue Nov 15, 2021

Field caps api - report back if fields are single-valued or not. #80730

Closed

csoulios mentioned this issue Dec 3, 2021

improve aggregate_metric_double parsing time #80685

Merged

markharwood added a commit to markharwood/elasticsearch that referenced this issue Jan 4, 2022

Field caps api - report back if fields are single-valued or not.

9157e2b

Inspects index contents in some cases to reveal if all docs hold single values or not for a field. Relates to elastic#58523

javanna added the team-discuss label May 30, 2022

javanna removed the team-discuss label May 1, 2023

javanna closed this as not planned Won't fix, can't repro, duplicate, stale May 1, 2023

javanna added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 'singleton' flag to field mappers? #58523

Add 'singleton' flag to field mappers? #58523

jtibshirani commented Jun 25, 2020 •

edited

Loading

elasticmachine commented Jun 25, 2020

webmat commented Jun 30, 2020 •

edited

Loading

jtibshirani commented Jun 30, 2020 •

edited

Loading

ruflin commented Jul 28, 2020

jtibshirani commented Jul 28, 2020

romseygeek commented Aug 14, 2020

webmat commented Aug 14, 2020

mayya-sharipova commented Aug 14, 2020

jtibshirani commented Aug 18, 2020 •

edited

Loading

jtibshirani commented Aug 21, 2020

markharwood commented Oct 26, 2021

markharwood commented Oct 26, 2021

markharwood commented Oct 27, 2021

markharwood commented Oct 28, 2021 •

edited

Loading

markharwood commented Oct 28, 2021 •

edited

Loading

astefan commented Nov 2, 2021

markharwood commented Nov 2, 2021 •

edited

Loading

markharwood commented Nov 4, 2021 •

edited

Loading

jpountz commented Nov 5, 2021

markharwood commented Nov 8, 2021 •

edited

Loading

dakrone commented Apr 18, 2023

ruflin commented Apr 19, 2023 •

edited

Loading

abatsakis commented Apr 19, 2023

ruflin commented Apr 20, 2023

javanna commented Apr 20, 2023

abatsakis commented Apr 22, 2023

elasticsearchmachine commented May 1, 2023

javanna commented May 1, 2023

Add 'singleton' flag to field mappers? #58523

Add 'singleton' flag to field mappers? #58523

Comments

jtibshirani commented Jun 25, 2020 • edited Loading

elasticmachine commented Jun 25, 2020

webmat commented Jun 30, 2020 • edited Loading

jtibshirani commented Jun 30, 2020 • edited Loading

ruflin commented Jul 28, 2020

jtibshirani commented Jul 28, 2020

romseygeek commented Aug 14, 2020

webmat commented Aug 14, 2020

mayya-sharipova commented Aug 14, 2020

jtibshirani commented Aug 18, 2020 • edited Loading

jtibshirani commented Aug 21, 2020

markharwood commented Oct 26, 2021

markharwood commented Oct 26, 2021

Make it a dynamic setting?

markharwood commented Oct 27, 2021

Dynamic mappings behaviour?

markharwood commented Oct 28, 2021 • edited Loading

markharwood commented Oct 28, 2021 • edited Loading

astefan commented Nov 2, 2021

markharwood commented Nov 2, 2021 • edited Loading

markharwood commented Nov 4, 2021 • edited Loading

jpountz commented Nov 5, 2021

markharwood commented Nov 8, 2021 • edited Loading

dakrone commented Apr 18, 2023

ruflin commented Apr 19, 2023 • edited Loading

abatsakis commented Apr 19, 2023

ruflin commented Apr 20, 2023

javanna commented Apr 20, 2023

abatsakis commented Apr 22, 2023

elasticsearchmachine commented May 1, 2023

javanna commented May 1, 2023

jtibshirani commented Jun 25, 2020 •

edited

Loading

webmat commented Jun 30, 2020 •

edited

Loading

jtibshirani commented Jun 30, 2020 •

edited

Loading

jtibshirani commented Aug 18, 2020 •

edited

Loading

markharwood commented Oct 28, 2021 •

edited

Loading

markharwood commented Oct 28, 2021 •

edited

Loading

markharwood commented Nov 2, 2021 •

edited

Loading

markharwood commented Nov 4, 2021 •

edited

Loading

markharwood commented Nov 8, 2021 •

edited

Loading

ruflin commented Apr 19, 2023 •

edited

Loading