-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to return ignored values as part of response #74121
Comments
Pinging @elastic/es-search (Team:Search) |
I started to look into this a bit and think the current behaviour is consistent with what we also do with e.g. numeric or date values that we reject because of an |
We had a first discussion today to understand the needs better. |
We've discussed the desired Kibana behavior with our product managers and came to the following conclusion (see also elastic/kibana#101232 (comment)): We want to show ignored values (e.g. because they are above ignore_above) in the Discover table in columns and expanded documents directly. We'd additionally want to add a warning icon with a tooltip beside them explaining that those values are not indexed, and thus cannot be searched through. There should no additional user interaction be required to show those values (e.g. by loading them async, or refering to the _source view we have). I think the best way we can achieve that technically is if the fields API (or the high level query) would have a The only alternative solution we'd have at the moment: Request
Thus I think, given the discussion we had, adding that flag and a separated part of the response to return the ignored values might be the best (and most performant) solution we can end up with. Please let us know, if that's the way we can continue. |
@giladgal @qhoxie just chiming in here, this is an interesting case, and it wasn't immediately obvious to me when we did the fields effort that we'd "lose something" in terms of what we display in discover. If we want |
The discussion started from a situation in which the event.original field that is not supposed to be searched and is not indexed is configured with The typical case is that the field will be mapped as keyword and text (if it is a lengthy textual field) and in this case the text field will be presented. In the rare case that the field was mapped as keyword and there is a lengthy text doc, the fields API provides indication that the data was not indexed and Kibana can make a second call to present the _source. It is a change in how we handle an error in the mapping or in the data (or both) rather than a regression. We can look to include the field's value from _source in the response from the fields API under these conditions. It's not something we can do for the next minor but we can do that for a future release so that there's no need for a second call from Kibana. |
Thanks @tomcallahan and @giladgal. That sounds like a good option. |
Yeah, I should have put that more explicitly, sorry: This is for sure not a regression from ES side I'd say, but an enhancement to an new API, but it's a regression from Discover point of view to our users. |
I would not call it a regression to be honest. Even from a Discover angle this behavior is a conscious choice that we've made to protect users from poisoned values. It is useful at indexing time, to avoid giant terms that certainly contains buggy values, but also at retrieval time since Kibana should not consider these buggy values as searchable or aggregatable. I'd also like to second what Gilad said. The discussion started from a mapping error in ECS. That's the main issue imo and I am happy that this behavior in Kibana helped us to discover the problem quickly. We have a planned discussion with the ECS team that I hope won't require any change in Kibana nor Elasticsearch. It is a mapping issue on a managed schema. |
It seems quite clear from the SDH issues we are seeing that users see this as a regression in Discover - what used to work for them, does not work anymore - so Kibana team is treating this as a regression.
I think the issue is that App users view ES as a datastore and expect to be able to retrieve all the data they put in. I think if we want to protect users from themselves we need to find ways to do it without preventing them from retrieving the data. +1 to Gilad's proposal to include the field's value from _source in the response from the fields API under these conditions in a future release so that there's no need for a second call from Kibana. |
I'd second that from a UX perspective (as indicated by SDHs) this feels like a regression to our users. The mapping should certainly be fixed in ECS (is there an issue open for that?) - is there more than one place where we are mapping |
Ok let's not focus too much on the regression vs feature debate. I agree with you that we have an issue here and we need to resolve it.
My point is that the issue stemmed from a managed schema (ECS) so the user in this case didn't configure the schema explicitly. We had a chat with the ECS team and we are now investigating a bug in the generation of the final mapping in Beats.:
I am not against this proposal but I'd prefer to discuss it in the context of the UX that we want. Not as a workaround to restore a buggy behavior. |
Yeah I agree. That's why I tried to outline the UX we discussed with Product and we agreed on that we want to deliver in Discover. See this comment for more details. Basically we agreed that the UX we want to deliver to users is to show all values (also ignored ones) by default (in Discover, I don't want to speak for any other app here, since I believe this might be different on a case by case basis). And additionally show warnings when those values are actually ignored (and thus not searchable). Ideally those warnings are also as detailed as they can be, though to be honest I think for the beginning it's fine if just have a generic "this value was not indexed" warning on those values, which would be achievable with the API discussed here, given that we'd show this waring besides every value we pull out from
++ to that. I think that
Yes, we can leverage that, but we're still lacking the part of showing the actual values of those fields (the discussed API here). Though I think if we'd have that |
This point is a good one, I wouldn't want to lose the type safety of the fields option. Another thing I'm wondering is whether we can avoid introducing a new option for this: could we always return ignored values in the response on a best-effort basis, ie. when |
@timroes and I had a discussion on this today and we came away with the following proposal: A new boolean parameter should be introduced called something like
There are several things to note about this format: Some (or nearly all) of the "bad data" in results might be goodWith multi-value fields/docs there may be good and bad examples of values returned. The elasticsearch server won't try to reverse-engineer which of a field's values were problematic - it will just list all values as they were found in _source. The No nested doc supportBeing a flat map of field->values we cannot represent any of the structure of nested objects like we do with the main No runtime field supportA malformed value in a runtime field is not something we can return because we have no memory or insight into what a runtime field script may have been working with when it failed. The existing The existing
Could also be written as:
This object structure also gives us a place to hold other future settings which brings me on to the next topic: We need to avoid surfacing abusive contentIf we are not careful the |
@timroes another oddity to look out for - the field names used in the |
The proposed additions to the response body look good to me.
I would like to see if we can avoid introducing a new parameter and always return ignored field values for requested field names? |
Is that not just surfacing abusive content that stricter databases would just reject completely at ingest time? I worry about the massive-doc scenario so would prefer to see this as opt-in with some added safeguards e.g. truncate settings. |
I'm not concerned about this. |
True - I just didn't want to add to the list.
Also true. Some background that may be of interest based on an earlier discussion with Tim: This Discover problem is just a symptom of a bigger issue of bad indexing choices. They often didn't map as
People struggle with picking up parts of content like that above and knowing if they need to make it a phrase query or wildcard because they need to know how it's tokenized in the index. They didn't map as
Tim and I discussed again the long term goal of some adaptive indexing strategy which made its own decisions about keyword vs wildcard field indexing strategies as query loads and content grows. Possibly a bit too "magical" but without such a thing we're stuck in this limbo of users not being able to make great indexing decisions or search. |
I've tried to implement this with the most efficiently packaged results because I'm concerned about repetition of noisy content. There's still potential for results amplification.
and this doc:
Then a search with
We have this less-than ideal result:
The results carry a fair amount of repetition. Note, I have taken efforts to remove "OK" content from arrays of an ignored field's values. If there's an exact match between a retrievable doc value e.g. The use of two multifields (keyword5 and keyword10) is perhaps a little contrived in my example but I wanted to test the behaviour and even with one multi-field defined in mappings it's worth noting how verbose a simple |
I'm not concerned by the repetitions, this is a feature of the fields API: it removes the need for the consumer of the response to know how the field is mapped, it can look at fields in isolation. |
@markharwood I wonder if we could reconsider the removing of repetition. As @jpountz I am not worried around repetition on a neither API response nor network traffic level. The removal itself though might cause us problems when trying to craft now the proper values. With this removal we basically need to merge the
By merging those together we now would present it as if that field would have had 4 values, while it only had 3. I think this is a rather nasty side-effect, and would not outweight the additional "repetetive" elements we'd return otherwise? |
I can take out the logic that removes what it thinks are non-ignored values. It may be imperfect at spotting ignored values correctly but the two main arguments for keeping it are:
|
So we're basically in the pickle of either going for a better API or making sure the data in Discover will be more accurate. I would be in favor of the more accurate data in Discover and therefore including all of _source in |
…of search request Closes elastic#74121
Since Kibana's Discover switched to retrieving values via the fields API rather than source there have been gaps in the display caused by "ignored" fields (those that fall foul of ignore_above and ignore_malformed size and formatting rules). This PR returns ignored values from source when a user-requested field fails to be parsed for a document. In these cases the corresponding hit adds a new ignored_field_values section in the response. Closes elastic#74121
Since Kibana's Discover switched to retrieving values via the fields API rather than source there have been gaps in the display caused by "ignored" fields (those that fall foul of ignore_above and ignore_malformed size and formatting rules). This PR returns ignored values from source when a user-requested field fails to be parsed for a document. In these cases the corresponding hit adds a new ignored_field_values section in the response. Closes #74121
BWC change following backport of PR 78697 to 7.x Closes #74121
If you create an mapping containing an
ignore_above
field and ingest values that are above that limit, those are not returned by via the fields API. Since we're using the fields API nowadays in Discover this is a rather severe regression we found. From my experience we often have people having fields containing values above those limits, which were with reading from_source
available and visible to the users. Using the new fields API, those will simply return as empty and no longer show in Discover.It would be good if we can include those values in the response (whether behind a request flag, similar to
include_unmapped
or by default).Related Kibana issue: elastic/kibana#101232
The text was updated successfully, but these errors were encountered: