Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix execution of exists query within nested queries on field with doc_values disabled #78841

Merged
merged 7 commits into from
Oct 19, 2021

Conversation

yrodiere
Copy link
Contributor

@yrodiere yrodiere commented Oct 7, 2021

Fixes #76362, which is a regression that first occurred in 7.14.0.

In order to fix the problem, we need to add a _field_names field to nested documents. However, just like root documents, this field will only be populated with fields that have no doc values nor norms. Which means that this patch won't change anything, except for users who explicitly opted out of doc values for one of their fields.

I'm sorry if this does not pass the full test suite; I tried running it locally but it fails for some unknown (but unrelated) reason. I executed a few relevant tests manually, and I'll have to wait for CI to give a full report.

If this gets merged, could we please backport this PR to 7.x? That would allow the Hibernate Search test suite to pass on something more recent than ES 7.13.

@elasticsearchmachine elasticsearchmachine added v8.0.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Oct 7, 2021
@mark-vieira
Copy link
Contributor

@elasticmachine retest this please

@yrodiere
Copy link
Contributor Author

yrodiere commented Oct 8, 2021

Thanks. Having a look at the test failures right now.

@yrodiere yrodiere force-pushed the exists-in-nested-query branch from e12665b to 8298689 Compare October 8, 2021 07:06
@yrodiere
Copy link
Contributor Author

yrodiere commented Oct 8, 2021

It appears MixedClusterClientYamlTestSuiteIT was failing because it checks what happens when running the tests I added against a clustering comprising ES 8 nodes and ES 7.16 nodes, and obviously the bug wasn't fixed in ES 7.16 yet: we need a backport for that.

For now I added guards to prevent testing against ES 7.x, like I've seen it done in that other PR.

I also fixed the tab that was making the "part-1" build fail.

@mark-vieira mark-vieira added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Oct 11, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Oct 11, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@jimczi jimczi requested a review from romseygeek October 12, 2021 15:58
@jimczi jimczi added the >bug label Oct 12, 2021
Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening @yrodiere!

I think we can do this more simply by changing the implementation of DocumentParserContext.addToFieldNames(); instead of collecting the field names and then adding them to the document in postParse(), we can instead add each field name individually as a new Field to the current document directly in addToFieldNames(). We don't store freqs or positions on the field names field so adding duplicate values won't matter, and then the existing logic that copies fields from nested docs to their parents will handle things naturally.

Would you be interested in trying this out?

@yrodiere
Copy link
Contributor Author

yrodiere commented Oct 14, 2021

Hey @romseygeek ,

Sure, I can do this.

However, I will ask confirmation about a few things before I start:

we can instead add each field name individually as a new Field to the current document directly in addToFieldNames()

You went out of your way to do the exact opposite, making sure the mapper itself adds the fields names: #71929 (in particular
0c509b4#diff-2996184d1f59c95c1530f925f7308296b84cb7f43fb47aac2845cde01510be5fR159-R166 ).

In fact, that's the main reason I tried to use FieldNamesFieldMapper; I thought that was a newly introduced practice that was preferred over the previous one.

Can you confirm you changed your mind and no longer want that?

the existing logic that copies fields from nested docs to their parents will handle things naturally

By "existing logic", do you mean the one I just added that keeps a reference to the parent in DocumentParserContext? Because that's the main driver for the complexity of this patch, so I'm guessing you won't like that. Before my patch, there wasn't any _field_names field in nested documents, so I doubt there was any code to copy it to the root document.

Or do you mean the include_in_parent parameter of nested field mapping? Because that's optional and disabled by default, so I believe relying on this would change the previous behavior that allowed running an exists query without wrapping it into a nested query.

Or do you mean another "existing logic" that copies _field_names from nested documents to the root document? Because I didn't see that anywhere...

To be clear I personally don't need _field_names to be copied to the parent document: I always wrap exists queries into nested queries when necessary, so I have no horse in this game. I just don't want to introduce backwards-incompatible changes that you will have to deal with later, so I thought I'd bring this to your attention.

@romseygeek
Copy link
Contributor

Can you confirm you changed your mind and no longer want that?

I can! I think we can avoid getting the FieldNamesFieldMapper itself, and just add a static helper method to it that would create the relevant IndexableField.

Or do you mean the include_in_parent parameter of nested field mapping? Because that's optional and disabled by default, so I believe relying on this would change the previous behavior that allowed running an exists query without wrapping it into a nested query.

I do indeed mean this. I would say that the current behaviour is a bug - running a field exists query against a top-level document that does not in fact include that field should not return the top-level doc.

@yrodiere yrodiere force-pushed the exists-in-nested-query branch from 8298689 to 80694d2 Compare October 14, 2021 14:15
@yrodiere
Copy link
Contributor Author

@romseygeek Done.

However, I couldn't use a static method: the _field_names feature can be disabled through a setting (_field_names.enabled?) for indexes created in Elasticsearch 7.x or before, so I need to retrieve the value of the setting to decide whether to add a field name or not, and the easiest way to do so is to just retrieve the FieldNamesFieldMapper.

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you! I think we need one more test scenario to check that includeInParent will work as expected, but other than that this is good to go.

assertThat(doc.docs().get(4).getNumericValue("nested1.nested2.integer2"), nullValue());
assertThat(doc.docs().get(4).get("_field_names"), nullValue());
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test that these also work for nested documents that have includeInParent set to true?

@romseygeek
Copy link
Contributor

@elasticmachine test this please

@yrodiere yrodiere force-pushed the exists-in-nested-query branch from 80694d2 to 1b395e9 Compare October 15, 2021 11:56
@yrodiere
Copy link
Contributor Author

yrodiere commented Oct 15, 2021

Thanks. I amended a commit to fix the checkstyle issue (unused import) and added a commit to test include_in_parent/include_in_root (it works as expected).

@romseygeek
Copy link
Contributor

@elasticmachine test this please

@romseygeek
Copy link
Contributor

@elasticmachine update branch

@romseygeek
Copy link
Contributor

@elasticmachine test this please

@yrodiere
Copy link
Contributor Author

yrodiere commented Oct 18, 2021

I tried to see if there was something I could do to fix the tests, but I'm not sure the failures are related to my changes. There's something about field caps, something about node discovery, ... I'm wondering if those aren't simply flaky tests?

@romseygeek
Copy link
Contributor

@elasticmachine update branch

@romseygeek
Copy link
Contributor

I tried to see if there was something I could do to fix the tests, but I'm not sure the failures I related to my changes. There's something about field caps, something about node discovery, ... I'm wondering if those aren't simply flaky tests?

We're coming up to a deadline so lots of stuff is going in which makes it difficult to keep up with backwards-compatibility checks. I'll keep updating the branch and retesting and will merge once we get a green CI. Thanks for your patience on this.

@romseygeek
Copy link
Contributor

@elasticmachine ok to test

@romseygeek
Copy link
Contributor

@elasticmachine update branch

@romseygeek
Copy link
Contributor

@elasticmachine update branch

@romseygeek romseygeek merged commit 3d35144 into elastic:master Oct 19, 2021
romseygeek pushed a commit to romseygeek/elasticsearch that referenced this pull request Oct 19, 2021
…_values disabled (elastic#78841)

The FieldNamesFieldMapper adds non-indexed fields to a special metadata
field so that exists queries can be run efficiently.  This is built in a postParse
method that is run once per document.  However, this means that nested
documents are not handled correctly - non-indexed field names are added to
the parent document's metadata field rather than to the nested document's
field.

This commit fixes things to add non-indexed field names directly to the
nested documents.
romseygeek added a commit that referenced this pull request Oct 19, 2021
…_values disabled (#78841) (#79462)

The FieldNamesFieldMapper adds non-indexed fields to a special metadata
field so that exists queries can be run efficiently.  This is built in a postParse
method that is run once per document.  However, this means that nested
documents are not handled correctly - non-indexed field names are added to
the parent document's metadata field rather than to the nested document's
field.

This commit fixes things to add non-indexed field names directly to the
nested documents.

Co-authored-by: Yoann Rodière <[email protected]>
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Oct 19, 2021
* upstream/master:
  Validate tsdb's routing_path (elastic#79384)
  Adjust the BWC version for the return200ForClusterHealthTimeout field (elastic#79436)
  API for adding and removing indices from a data stream (elastic#79279)
  Exposing the ability to log deprecated settings at non-critical level (elastic#79107)
  Convert operator privilege license object to LicensedFeature (elastic#79407)
  Mute SnapshotBasedIndexRecoveryIT testSeqNoBasedRecoveryIsUsedAfterPrimaryFailOver (elastic#79456)
  Create cache files with CREATE_NEW & SPARSE options (elastic#79371)
  Revert "[ML] Use a new annotations index for future annotations (elastic#79151)"
  [ML] Use a new annotations index for future annotations (elastic#79151)
  [ML] Removing legacy code from ML/transform auditor (elastic#79434)
  Fix rate agg with custom `_doc_count` (elastic#79346)
  Optimize SLM Policy Queries (elastic#79341)
  Fix execution of exists query within nested queries on field with doc_values disabled (elastic#78841)
  Stricter UpdateSettingsRequest parsing on the REST layer (elastic#79227)
  Do not release snapshot file download permit during recovery retries (elastic#79409)
  Preserve request headers in a mixed version cluster (elastic#79412)
  Adjust versions after elastic#79044 backport to 7.x (elastic#79424)
  Mute BulkByScrollUsesAllScrollDocumentsAfterConflictsIntegTests (elastic#79429)
  Fail on SSPL licensed x-pack sources (elastic#79348)

# Conflicts:
#	server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java
romseygeek added a commit to romseygeek/elasticsearch that referenced this pull request Nov 3, 2021
…_values disabled (elastic#78841) (elastic#79462)

The FieldNamesFieldMapper adds non-indexed fields to a special metadata
field so that exists queries can be run efficiently.  This is built in a postParse
method that is run once per document.  However, this means that nested
documents are not handled correctly - non-indexed field names are added to
the parent document's metadata field rather than to the nested document's
field.

This commit fixes things to add non-indexed field names directly to the
nested documents.

Co-authored-by: Yoann Rodière <[email protected]>
romseygeek added a commit that referenced this pull request Nov 3, 2021
…_values disabled (#78841) (#79462) (#80274)

The FieldNamesFieldMapper adds non-indexed fields to a special metadata
field so that exists queries can be run efficiently.  This is built in a postParse
method that is run once per document.  However, this means that nested
documents are not handled correctly - non-indexed field names are added to
the parent document's metadata field rather than to the nested document's
field.

This commit fixes things to add non-indexed field names directly to the
nested documents.

Co-authored-by: Yoann Rodière <[email protected]>
@stewi2
Copy link

stewi2 commented Nov 5, 2021

except for users who explicitly opted out of doc values for one of their fields

To clarify, #80309 happened on a text field, which doesn't support doc_values in the first place. I think the issue was more that we had also disabled norms, meaning that exists() queries had to fall back to using field_names, which didn't exist (as far as I understand).

"first": {
    "type": "text",
    "norms": false
},

"type": "text" basically implies "doc_values": false. Not sure if this PR would fix that case as well?

@yrodiere
Copy link
Contributor Author

yrodiere commented Nov 5, 2021

To clarify, #80309 happened on a text field, which doesn't support doc_values in the first place.

That's right, this is slightly different from #76362, which this PR is supposed to fix.

Not sure if this PR would fix that case as well?

It should.

I didn't add any code specific to doc_values. Elasticsearch already had code that detects whether a field needs an entry in the _field_names field, and that code already took into account both missing doc_values and missing norms. It worked well enough for both.

All this PR does is make sure that, when Elasticsearch detects that a field needs an entry in the _field_names field, then it adds this entry to the right (nested) document.

So, yes, I think this PR should fix #80309 as well.

If you want to be sure, you can add more tests to rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search/161_exists_query_within_nested_query.yml. Or... just test a patched version of Elasticsearch locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug external-contributor Pull request authored by a developer outside the Elasticsearch team :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team v7.15.2 v7.16.0 v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"exists" query in "nested" query never matches for fields without doc_values within nested fields
8 participants