-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 'singleton' flag to field mappers? #58523
Comments
Pinging @elastic/es-search (:Search/Mapping) |
I'd like to clarify one part:
I assume this sentence applies only to the field where However the way this is currently worded makes it sound like no other date fields would be allowed (e.g. an event with Stepping out of the timestamp example, I like this feature in general, not just for datastreams. Elasticsearch has long been flexible with fields containing either a single value or an array. However since ECS came out, we've gradually been clarifying which ECS fields are expected to contain an array of values, in order to make the event format more predictable for consumers of the data. This new addition would help clarify 3 acceptable formats about a field:
|
Your assumption is right, I've updated the description to clarify the wording. |
If I understand it correctly, |
Not exactly, because documents are allowed to not specify a value for |
We discussed this in the search meeting and agree that it would be a useful addition to at least some of the field mappers. We think that |
From the start I've interpreted this as a way to ensure a field would not contain an array of values. E.g. preventing the following: { "@timestamp": [ "1597425324", "1597425325" ] } And enforcing that a field has a single value, e.g.: { "@timestamp": "1597425324" } My understanding is that this feature is unrelated to whether or not a field has additional multi-fields. |
I think @romseygeek meant
Or is the requirement with |
@romseygeek @mayya-sharipova the |
We discussed again as a team and agreed that this could be a useful feature. But we didn't see a strong immediate need for it yet, and will leave the issue open to gather more feedback on use cases and priority. Other topics from our discussion:
|
Absolutely. I expect in practice the majority of elasticsearch fields are single-value fields but always allowing for multiple values has created a poor user experience in Kibana. Kibana typically offers a search box and shows visualizations e.g. bar charts of operating systems used by website visitors. Clicking on the bars creates "filter pills" which are always ANDed. Kibana could offer users much more sensible drill-down options if it knew that a field were a single value field. E.g. allowing multiple selections from a bar/pie chart and ORing these selections from the same field rather than ANDing them. This would be a great step forward in Kibana usability. |
Make it a dynamic setting?If Kibana starts to make UX improvements based on using this flag I expect people will want to "tighten-up" their mapping definitions for existing data without reindexing. This means the setting would ideally be dynamic i.e. can be redeclared on existing index mappings When changing this setting, validating all existing docs are single-valued could be too expensive/difficult to implement. If unvalidated this could leave us in a state where the mappings declare fields are single-valued (and future docs will be validated as such) but there could still be some historical docs that have multi-values.
|
Dynamic mappings behaviour?Another consideration is what would happen with dynamic mappings. It would certainly be convenient if the common case (single value fields) were detected and mapped as single values automatically. However, doing so means that this sequence of docs would create an error:
This would be a breaking change to existing behaviour. |
@jimczi suggest we explicitly expand the scope of this issue to cover both the mandatory single-value case (like timestamp) and the case where multiple values are permitted. I propose a new
So the default is our current behaviour for fields : no expected max or min. We'd also need to pick a format for presenting the params:
Is there a precedent somewhere we should follow for consistency's sake? |
When we discussed the |
@markharwood agree on dealing with the field caps API potential additional in a different issue. I don't think QL has any use in a mapping level flag, as we do not look at the indices mappings. |
@astefan I don't expect you would to need to look at mappings to benefit from this. We'd surface the mapping level flags in field cap summaries. When it come to cardinalities I'm leaning towards the policy of enforce-via-mapping rather than report-on-index-state because:
|
When it comes to enforcement I think enforcing a maximum cardinality will be much cheaper/more viable than enforcing a minimum cardinality.
I'm concerned that this min_cardinality checking would add overhead to one of the hottest loops - the indexing logic. |
One benefit of enforcing in the mappings that I can think of is that we might be able to speed up indexing for single-valued fields by using
FWIW we already have such logic when an
Index statistics are loaded in memory when opening a shard, so it wouldn't be slower on the frozen tier. However nodes on the frozen tier are denser, so if they would have more shards to check than hot nodes. The worry I have about the "enforce-via-mapping" approach is adoption:
Furthermore the "enforce-via-mapping" approach brings unique challenges too. For instance if you have 100 data streams and one of them doesn't set the To me the strongest argument against the "report-on-index-state" approach is the performance overhead. Maybe we could build a prototype and try to evaluate the performance impact to see how bad it actually is? |
I tried this out for |
Inspects index contents in some cases to reveal if all docs hold single values or not for a field. Relates to elastic#58523
Was this ever decided on? It seems like it would be a useful feature. |
This popped up again also in the context of the routing processor that now landed: #76511 As soon as a value is multi field, we might not be able to route the document. At the same time the fields that are likely to be used for routing should never be multi field in the first place. It would be nice to even enforce this. Our need for a singleton field is not driven by any performance improvements but much rather strictness on the data. With #95329 we become more lenient on what data we accepts for Could we start this effort by having the option on keyword fields? It seems @markharwood made some tests quite some time ago on this which means the code was changed for it. Does anyone know if this code is still around somewhere? Playing around with data streams and trying to ingest an array for |
We can discuss the feature independently, but why we don't want a routing field to ever be multi-valued? Also, are you gonna force the rerouting processor to work only on fields with the singleton entry? |
The discussion on the reroute processor happened here: #76511 (comment) For now, routing only works on fields with a singleton entry and otherwise throws and exception. The problem with multi value is, which value do we pick for routing? @felixbarny |
This is getting some traction again recently, I recently discussed it with @nik9000 as well in the context of esql. We discussed it today again within the Search team and we said it's a thing that we would like to do once some of us has time. |
@ruflin the answer to your question is... it doesn't matter :) If a user doesn't explicitly handle multi-valued fields in their routing processors, then the data are most likely bogus so you can do whatever you want with them, e.g. route them to the error store explicitly, drop them, pick one value randomly or whatever. This is something that can happen anyway because someone may try to route a field that is not designated as singleton, unless you will require routing rules to be applied only to those fields (which I don't think you want to). Same for ESQL. In other words, I don't see good reasons for this feature. Maybe I am missing something though, so before you intend to pick it up, let's talk :) |
Pinging @elastic/es-search (Team:Search) |
Some clarifications:
Yes, we perform this check specifically for the While we are currently discussing the possibility of adding the |
Datastreams have a first-class concept of a timestamp field. Each document in the datastream must contain exactly one value for the designated timestamp field, so that we know where to route the document when partitioning by time. In #58582 we're adding index-time validation for this requirement. The current implementation is very narrow in scope and adds special document parsing logic just for datastream timestamps.
I was wondering if this validation could be useful more generally. We could add a 'singleton' flag to field mappers -- when it is set, the mapper will verify that it encounters one (and only one) value in each document:
Such an option could be helpful for modeling fields that an application requires to have exactly one value, like identifiers, timestamps, content types, etc.
The text was updated successfully, but these errors were encountered: