Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to return ignored values as part of response #74121

Closed
timroes opened this issue Jun 15, 2021 · 28 comments · Fixed by #78697
Closed

Add option to return ignored values as part of response #74121

timroes opened this issue Jun 15, 2021 · 28 comments · Fixed by #78697
Labels
:Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@timroes
Copy link
Contributor

timroes commented Jun 15, 2021

If you create an mapping containing an ignore_above field and ingest values that are above that limit, those are not returned by via the fields API. Since we're using the fields API nowadays in Discover this is a rather severe regression we found. From my experience we often have people having fields containing values above those limits, which were with reading from _source available and visible to the users. Using the new fields API, those will simply return as empty and no longer show in Discover.

It would be good if we can include those values in the response (whether behind a request flag, similar to include_unmapped or by default).

Related Kibana issue: elastic/kibana#101232

@timroes timroes added >bug needs:triage Requires assignment of a team area label labels Jun 15, 2021
@DaveCTurner DaveCTurner added :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels Jun 15, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jun 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@cbuescher
Copy link
Member

I started to look into this a bit and think the current behaviour is consistent with what we also do with e.g. numeric or date values that we reject because of an ignore_malformed setting in as much as we don't return values in fields if they are not indexed and cannot be aggregated over. I think this behaviour was intended, but I'll mark this for discussion in the team to see if we can add an option to retrieve these from source.

@cbuescher
Copy link
Member

We had a first discussion today to understand the needs better.
One of the things we already have is the "_ignored" field in the the search hit that - if present - records fields that have not been indexed due to being malformed but have the ignore_malformed option set. We currently don't include keyword fields that we don't index because of ignore_above into this, but we will investigate if it is possible to do that.
With that in place Kibana Discover could display if fields are missing and offer ways of going to "source" view in these cases.
We also discussed that mixing these "non-indexed" values into the "fields" output is something we would not like to do.
An option to consider might be to add an (optional) additional section to the fields output where we include values from fields in "_ignored" that we then get directly from source, however we are not fully convinced that this is the best way to go.
I will open separate issues to discuss and track whether we should add keyword fields that exceed ignore_above to the _ignored output and that we might want to include this metadata field in the fields API output.

@timroes
Copy link
Contributor Author

timroes commented Jun 22, 2021

We've discussed the desired Kibana behavior with our product managers and came to the following conclusion (see also elastic/kibana#101232 (comment)): We want to show ignored values (e.g. because they are above ignore_above) in the Discover table in columns and expanded documents directly. We'd additionally want to add a warning icon with a tooltip beside them explaining that those values are not indexed, and thus cannot be searched through. There should no additional user interaction be required to show those values (e.g. by loading them async, or refering to the _source view we have).

I think the best way we can achieve that technically is if the fields API (or the high level query) would have a include_ignored option that we could specify to make ES return all ignored values in the response - though as a separate part of the response, not to mix in with fields and thus potentially poluting type-purity. That way we can merge them client side and know which values were ignored (in case a field contained an array of values).

The only alternative solution we'd have at the moment: Request _source and fields at the same time and try to merge them client-side and figure out which values are in _source, but not in fields. This would cause multiple problems, e.g.:

  • Requesting _source and fields aroundish doubles the size of our response causing us to run into problems with too large responses (that we know we already have nowadays sometimes)
  • Discover would be significantly less performant, since we'd need to compare _source and fields for each field of each document to be able to figure out which _source values are missing from fields (and thus were ignored)
  • Causing weird side-effects, e.g. in the 64-bit case, were a value in fields might be rounded by JavaScript, but in _source if present as a string, might not be rounded and thus we're not able to detect equality of those, and will present both values, leading the users to believe their document had multiple values in that fields, while it was only one.

Thus I think, given the discussion we had, adding that flag and a separated part of the response to return the ignored values might be the best (and most performant) solution we can end up with. Please let us know, if that's the way we can continue.

cc @rayafratkina @VijayDoshi

@timroes timroes changed the title Fields option in _search API does not returned fields above ignore_above Add option to return ignored values Jun 22, 2021
@timroes timroes changed the title Add option to return ignored values Add option to return ignored values as part of response Jun 22, 2021
@timroes timroes removed the >bug label Jun 22, 2021
@tomcallahan
Copy link
Contributor

@giladgal @qhoxie just chiming in here, this is an interesting case, and it wasn't immediately obvious to me when we did the fields effort that we'd "lose something" in terms of what we display in discover. If we want fields to be the "best way" for kibana and the solutions to retrieve data from ES, it seems we should permit retrieving all the data in Elasticsearch (even if it is not indexed/searchable). Kibana going back to using _source feels like a bad outcome.

@giladgal
Copy link
Contributor

The discussion started from a situation in which the event.original field that is not supposed to be searched and is not indexed is configured with "ignore_above": 1024 (see #101232). This seems to be a mapping error since the correct mapping would be to have index false and doc_value false.

The typical case is that the field will be mapped as keyword and text (if it is a lengthy textual field) and in this case the text field will be presented. In the rare case that the field was mapped as keyword and there is a lengthy text doc, the fields API provides indication that the data was not indexed and Kibana can make a second call to present the _source. It is a change in how we handle an error in the mapping or in the data (or both) rather than a regression.

We can look to include the field's value from _source in the response from the fields API under these conditions. It's not something we can do for the next minor but we can do that for a future release so that there's no need for a second call from Kibana.

@qhoxie
Copy link
Contributor

qhoxie commented Jun 23, 2021

Thanks @tomcallahan and @giladgal. That sounds like a good option.

@timroes
Copy link
Contributor Author

timroes commented Jun 24, 2021

It is a change in how we handle an error in the mapping or in the data (or both) rather than a regression.

Yeah, I should have put that more explicitly, sorry: This is for sure not a regression from ES side I'd say, but an enhancement to an new API, but it's a regression from Discover point of view to our users.

@jimczi
Copy link
Contributor

jimczi commented Jun 28, 2021

I would not call it a regression to be honest. Even from a Discover angle this behavior is a conscious choice that we've made to protect users from poisoned values. It is useful at indexing time, to avoid giant terms that certainly contains buggy values, but also at retrieval time since Kibana should not consider these buggy values as searchable or aggregatable.
So in other words, values that are ignored in this context should be seen as invalid. They don't fit in the context where they are define.
We can make these values more discoverable but then I'd call it an improvement.

I'd also like to second what Gilad said. The discussion started from a mapping error in ECS. That's the main issue imo and I am happy that this behavior in Kibana helped us to discover the problem quickly. We have a planned discussion with the ECS team that I hope won't require any change in Kibana nor Elasticsearch. It is a mapping issue on a managed schema.

@rayafratkina
Copy link

rayafratkina commented Jun 28, 2021

I would not call it a regression to be honest.

It seems quite clear from the SDH issues we are seeing that users see this as a regression in Discover - what used to work for them, does not work anymore - so Kibana team is treating this as a regression.

It is useful at indexing time, to avoid giant terms that certainly contains buggy values, but also at retrieval time since Kibana should not consider these buggy values as searchable or aggregatable.

I think the issue is that App users view ES as a datastore and expect to be able to retrieve all the data they put in. I think if we want to protect users from themselves we need to find ways to do it without preventing them from retrieving the data.
Maybe we should be looking at the mapping options we provide so it's harder for users to paint themselves into a corner?
Just to reiterate - regardless of any mapping enhancements we decide on, we need to provide a path forward to the users who have these configurations today.

+1 to Gilad's proposal to include the field's value from _source in the response from the fields API under these conditions in a future release so that there's no need for a second call from Kibana.

@VijayDoshi
Copy link

I'd second that from a UX perspective (as indicated by SDHs) this feels like a regression to our users.

The mapping should certainly be fixed in ECS (is there an issue open for that?) - is there more than one place where we are mapping keyword but not text in ECS?

@jimczi
Copy link
Contributor

jimczi commented Jun 28, 2021

Ok let's not focus too much on the regression vs feature debate. I agree with you that we have an issue here and we need to resolve it.

I think the issue is that App users view ES as a datastore and expect to be able to retrieve all the data they put in.

My point is that the issue stemmed from a managed schema (ECS) so the user in this case didn't configure the schema explicitly. We had a chat with the ECS team and we are now investigating a bug in the generation of the final mapping in Beats.:
https://github.com/elastic/ecs/blob/9f6d0ef7471415e99128d96d142185514fd54c5d/generated/beats/fields.ecs.yml#L1737-L1750
The field should be generated with doc_values set to false but it's not. Other fields may be affected so again and thanks to this behavior we'll be able to fix a real bug in ECS.

+1 to Gilad's proposal to include the field's value from _source in the response from the fields API under these conditions in a future release so that there's no need for a second call from Kibana.

I am not against this proposal but I'd prefer to discuss it in the context of the UX that we want. Not as a workaround to restore a buggy behavior.
We should also consider the general case here. ignore_malformed for instance is no different, values that cannot be parsed as a number will be omitted in the fields response if they have the option set.
We'll not a make a special case for each of these options so we started to discuss how to expose all ignored fields more consistently in Kibana. We have a metadata field that is always returned for each hit called _ignored so it should be possible to expose these missing fields for users prominently in Discover.
Again, these values are supposed to be invalid (not parsable, too big, ...) so I am not sure that showing them by default on every application is a good thing. They can disturb the system so expecting a second call to get the original source sounds like a safer choice.

@timroes
Copy link
Contributor Author

timroes commented Jun 29, 2021

I am not against this proposal but I'd prefer to discuss it in the context of the UX that we want. Not as a workaround to restore a buggy behavior.

Yeah I agree. That's why I tried to outline the UX we discussed with Product and we agreed on that we want to deliver in Discover. See this comment for more details. Basically we agreed that the UX we want to deliver to users is to show all values (also ignored ones) by default (in Discover, I don't want to speak for any other app here, since I believe this might be different on a case by case basis). And additionally show warnings when those values are actually ignored (and thus not searchable). Ideally those warnings are also as detailed as they can be, though to be honest I think for the beginning it's fine if just have a generic "this value was not indexed" warning on those values, which would be achievable with the API discussed here, given that we'd show this waring besides every value we pull out from ignored_values instead of the fields response.

We should also consider the general case here. ignore_malformed for instance is no different, values that cannot be parsed as a number will be omitted in the fields response if they have the option set.

++ to that. I think that include_ignored should include ALL ignored values in the ignored_values (or whatever we'd name it) response and not just the once ignore because of ignore_above.

We have a metadata field that is always returned for each hit called _ignored so it should be possible to expose these missing fields for users prominently in Discover.

Yes, we can leverage that, but we're still lacking the part of showing the actual values of those fields (the discussed API here). Though I think if we'd have that includes_ignored API we'd actually not need to look into _ignored anymore since technically, we can just show the warning beside those values pulled out from ignored_values.

@jpountz
Copy link
Contributor

jpountz commented Sep 2, 2021

as a separate part of the response, not to mix in with fields and thus potentially poluting type-purity

This point is a good one, I wouldn't want to lose the type safety of the fields option.

Another thing I'm wondering is whether we can avoid introducing a new option for this: could we always return ignored values in the response on a best-effort basis, ie. when _source is neither disabled nor filtered?

@markharwood
Copy link
Contributor

@timroes and I had a discussion on this today and we came away with the following proposal:

A new boolean parameter should be introduced called something like include_ignored_field_values. This would be optional with a default of false. When set to true we would return a ignored_field_values map in results which would contain values for fields where any of the field name expressions in the search request matched any values recorded in the ignored field for a document. The response might look something like this:

     "ignored_field_values" : {
          "my_string_field": [ "small value", "A quite large value that exceeded the size limit for ....." ],
          "my.obj.my_date_field": [ "31/February/2021", "date unknown" ]
     }

There are several things to note about this format:

Some (or nearly all) of the "bad data" in results might be good

With multi-value fields/docs there may be good and bad examples of values returned. The elasticsearch server won't try to reverse-engineer which of a field's values were problematic - it will just list all values as they were found in _source. The "small value" string in the above may be an example of an OK value. Kibana should take care to tell users "some of these values can't be searched/aggregated on" rather than the more pessimistic view that they are all unsearchable. This may be annoying if a doc has an array of thousands of numbers or dates and only one is in error but we don't retain enough failure info to remember which was in error.

No nested doc support

Being a flat map of field->values we cannot represent any of the structure of nested objects like we do with the main fields response - we again don't retain any data of which of many nested objects might have failed.

No runtime field support

A malformed value in a runtime field is not something we can return because we have no memory or insight into what a runtime field script may have been working with when it failed.

The existing fields request criteria is simply an array of field names/wildcards so has no place to nest the new include_ignored_field_values boolean setting as a property. This means that we should perhaps allow the fields API to to take an object as an alternative to an array e.g.

The existing

     "fields" : ["foo", "bar"]

Could also be written as:

   "fields": {
       "names": ["foo","bar"]
       "include_ignored_field_values": true
   }

This object structure also gives us a place to hold other future settings which brings me on to the next topic:

We need to avoid surfacing abusive content

If we are not careful the include_ignored_field_values could act as a good way to surface problematic content like the multi-megabyte base64 encoded string someone tried and failed to index. A related issue we may try to address is capping the size of huge arrays of values. In both cases we want to give people a flavour of what the problem docs are without blowing up the browser running Kibana. This suggests that we might want settings to allow for truncation of large documents. I propose we tackle that in another issue but mention it here because it hints at future settings we might want to record in the above fields object if we redesign the API that way.

@markharwood
Copy link
Contributor

markharwood commented Sep 29, 2021

@timroes another oddity to look out for - the field names used in the ignored_field_values section of the results will be the requested multi-field name e.g. foo.keyword, even though the content is obtained from the field foo in source. It's possible that a foo.keyword and foo.keyword_normalized are both multi-field mappings derived from the same source foo. If they both fail for different reasons the ignored_field_values map will contain the foo source contents twice - once under foo.keyword and once for foo.keyword_normalized. It's duplication but it brings clarity to which of potentially many fields derived from foo that failed.

@jpountz
Copy link
Contributor

jpountz commented Sep 29, 2021

The proposed additions to the response body look good to me.

A new boolean parameter should be introduced called something like include_ignored_field_values.

I would like to see if we can avoid introducing a new parameter and always return ignored field values for requested field names?

@markharwood
Copy link
Contributor

I would like to see if we can avoid introducing a new parameter and always return ignored field values for requested field names?

Is that not just surfacing abusive content that stricter databases would just reject completely at ingest time? I worry about the massive-doc scenario so would prefer to see this as opt-in with some added safeguards e.g. truncate settings.

@jpountz
Copy link
Contributor

jpountz commented Sep 29, 2021

I'm not concerned about this. text and binary fields don't have an option to ignore long field values, so this isn't really a protection if this only exists on keyword fields? Plus if Discover opts in for getting values anyway, then this protection doesn't apply to Discover either?

@markharwood
Copy link
Contributor

markharwood commented Sep 29, 2021

text and binary fields don't have an option to ignore long field values,

True - I just didn't want to add to the list.

if Discover opts in for getting values anyway, then this protection doesn't apply to Discover either?

Also true. Some background that may be of interest based on an earlier discussion with Tim:

This Discover problem is just a symptom of a bigger issue of bad indexing choices.
Discover users are mostly upset because they're looking at log files and really want to see those stack trace messages, not blank cells. If the values are currently blank it's because they used keyword mappings which have an ignore_above. The change proposed here gives them that visibility but still does not address its search-ability.

They often didn't map as text field because the context isn't human-authored text like this paragraph so users have no idea about word boundaries in content like

org.gradle.process.internal.DefaultExecActionFactory.exec(DefaultExecActionFactory.java:205)

People struggle with picking up parts of content like that above and knowing if they need to make it a phrase query or wildcard because they need to know how it's tokenized in the index.

They didn't map as wildcard field because:

  1. They didn't know what wildcard field was or
  2. were uncertain that wildcard would work well with their content (it is a complex decision) or
  3. didn't want to pay the extra disk or
  4. users can't query it effectively because KQL doesn't support regex or case-insensitive wildcard search options

Tim and I discussed again the long term goal of some adaptive indexing strategy which made its own decisions about keyword vs wildcard field indexing strategies as query loads and content grows. Possibly a bit too "magical" but without such a thing we're stuck in this limbo of users not being able to make great indexing decisions or search.

@markharwood
Copy link
Contributor

I've tried to implement this with the most efficiently packaged results because I'm concerned about repetition of noisy content. There's still potential for results amplification.
This example is a silly scenario but we have to prepare for the worst. Given this mapping:

  "mappings": {
    "properties": {
      "foo":{
        "type": "text",
        "fields": {
          "keyword10": {"type":"keyword","ignore_above":10},
          "keyword5": { "type":"keyword","ignore_above":5 }
        }
    }}

and this doc:

{
	"foo": ["ok", "ok2", "bad456","bad45678901"]
}

Then a search with * for fields

POST http://localhost:9200/test/_search
{
  "_source": false,
  "fields": [ "*" ]
}

We have this less-than ideal result:

{
        "_index": "test",
        "_id": "gwpPMXwBn3-sjgk-L4it",
        "_score": 1.0,
        "_ignored": [
          "foo.keyword5",
          "foo.keyword10"
        ],
        "fields": {
          "foo.keyword5": [   // Good stuff from doc values 1
            "ok",
            "ok2"
          ],
          "foo": [            // Good and bad stuff from source
            "ok",
            "ok2",
            "bad456",
            "bad45678901"
          ],
          "foo.keyword10": [  // Good stuff from doc values 2
            "ok",
            "ok2",
            "bad456"
          ]
        },
        "ignored_field_values": {
          "foo.keyword5": [   // bad stuff from source (again)
            [
              "bad456",
              "bad45678901"
            ]
          ],
          "foo.keyword10": [      //  bad stuff from source (again)
            [
              "bad45678901"
            ]
          ]
        }
      }

The results carry a fair amount of repetition.

Note, I have taken efforts to remove "OK" content from arrays of an ignored field's values. If there's an exact match between a retrievable doc value e.g. foo.keyword5's ok and any value found in the ignored field's foo _source-retrieved array then I remove it from those listed in the ignored_field_values section. This is to prevent potentially thousands of OK entries intermingled with one bad apple. Note this rudimentary trimming will not work if there's any normalisation on the keyword field because source's OK will not be equal to doc value's ok. This could lead to wholesale duplication of the fields/foo array and the ignored_field_values/foo.keywordx sections in the results.

The use of two multifields (keyword5 and keyword10) is perhaps a little contrived in my example but I wanted to test the behaviour and even with one multi-field defined in mappings it's worth noting how verbose a simple * fields search can become and this may be an undesirable consequence of offering this API.

@jpountz
Copy link
Contributor

jpountz commented Sep 29, 2021

I'm not concerned by the repetitions, this is a feature of the fields API: it removes the need for the consumer of the response to know how the field is mapped, it can look at fields in isolation.

@timroes
Copy link
Contributor Author

timroes commented Sep 29, 2021

@markharwood I wonder if we could reconsider the removing of repetition. As @jpountz I am not worried around repetition on a neither API response nor network traffic level. The removal itself though might cause us problems when trying to craft now the proper values.

With this removal we basically need to merge the fields response part with the ignored_field_values response part together. Since as you mentioned we might not detect all duplicates correctly if normalization happens, it feels a bit weird, since we might now end up merging the wrong amount of elements in an array, e.g. field has in _source: [ 'good1', 'GOOD2', 'bad1' ] and now we'd be ending up with a response of (pseudo response, not actual format):

{
  "fields": [ 'good1', 'good2' ],
  "ignored_values": [ 'GOOD2', 'bad1']
}

By merging those together we now would present it as if that field would have had 4 values, while it only had 3. I think this is a rather nasty side-effect, and would not outweight the additional "repetetive" elements we'd return otherwise?

@markharwood
Copy link
Contributor

By merging those together we now would present it as if that field would have had 4 values, while it only had 3. I think this is a rather nasty side-effect, and would not outweight the additional "repetetive" elements we'd return otherwise?

I can take out the logic that removes what it thinks are non-ignored values. It may be imperfect at spotting ignored values correctly but the two main arguments for keeping it are:

  1. Avoids misleading result labels - the JSON result is labeled "ignored_field_values" and this may be a misnomer for some values that weren't ignored at all.
  2. Avoids problems with large value sets - the repetition of many values from elsewhere in the results and the games users might have to play to "spot the odd one out" when it comes to identifying the malformed values.

@timroes
Copy link
Contributor Author

timroes commented Sep 30, 2021

So we're basically in the pickle of either going for a better API or making sure the data in Discover will be more accurate. I would be in favor of the more accurate data in Discover and therefore including all of _source in ignored_field_values, but I am curious to hear also more thoughts on that from other people, with keeping the above described issues/behavior in mind.

markharwood added a commit to markharwood/elasticsearch that referenced this issue Oct 1, 2021
markharwood added a commit to markharwood/elasticsearch that referenced this issue Oct 13, 2021
Since Kibana's Discover switched to retrieving values via the fields API rather than source there have been gaps in the display caused by "ignored" fields (those that fall foul of ignore_above and ignore_malformed size and formatting rules).

This PR returns ignored values from source when a user-requested field fails to be parsed for a document. In these cases the corresponding hit adds a new ignored_field_values section in the response.

Closes elastic#74121
markharwood added a commit that referenced this issue Oct 13, 2021
Since Kibana's Discover switched to retrieving values via the fields API rather than source there have been gaps in the display caused by "ignored" fields (those that fall foul of ignore_above and ignore_malformed size and formatting rules).

This PR returns ignored values from source when a user-requested field fails to be parsed for a document. In these cases the corresponding hit adds a new ignored_field_values section in the response.

Closes #74121
markharwood added a commit that referenced this issue Oct 13, 2021
Backport of PR 78697
Add ignored field values to returned hits
Closes #74121
markharwood added a commit that referenced this issue Oct 13, 2021
BWC change following backport of PR 78697 to 7.x
Closes #74121
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet