Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check existing data for duplicate field docs in the migration assistance APIs and Migration Assistant #36629

Closed
geekpete opened this issue Dec 14, 2018 · 6 comments

Comments

@geekpete
Copy link
Member

Describe the feature:

Elasticsearch 5.x allows duplicate fields to be indexed into documents:
#19614

This was fixed in 6.x by enforcing strict duplicate validation:
#22073

A user who upgrades a cluster containing docs with duplicate fields from 5.x to 6.x will currently have no warning that their data might become unusable (for doing any operations requiring json parsing, including ?pretty, indexing, scripts) once upgraded to 6.x.

The error message example is:

"caused_by": { "type": "json_parse_exception",
"reason" : "Duplicate field 'FILECONTENT'\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@5d3ea628; line: 99, column: 14]"...

This scenario can occur when a custom indexing client may be inadvertently creating documents with duplicate fields, so this might be seen as an edge case as well by the few reports of users hitting duplicate fields issues I've seen but when it does occur it's a bad situation.

Once upgraded to 6.x, there are limited options to repair the problem documents due to the inability to use reindex without hitting the duplicate field error (even with the escape hatch enabled to disable validation: es.json.strict_duplicate_detection=false as it doesn't disable validatin for reindex operations).
For this reason, repairs are probably best done before upgrading on the 5.x cluster.

It seems that a fix is either to update in place using the the existing _source (eg with update script ctx._source = ctx._source;) or to reindex the documents which will create new documents without the duplicate fields:

One other thing to consider is that any scenario where the values for the duplicate field are different, then a more custom script or solution might be needed to be able to choose which value to keep, it gets messy then.

So users should be at least warned and if possible presented with either automatic or manual repair options.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@ywelsch
Copy link
Contributor

ywelsch commented Dec 14, 2018

Note that warnings for this would require a full scan of the data in the index, loading every source document to figure out whether it has duplicate fields.

Once upgraded to 6.x, there are limited options to repair the problem documents due to the inability to use reindex without hitting the duplicate field error (even with the escape hatch enabled to disable validation: es.json.strict_duplicate_detection=false as it doesn't disable validatin for reindex operations).

Why is the escape hatch not working? What kind of errors are you getting from the reindex API?

@gwbrown
Copy link
Contributor

gwbrown commented Dec 14, 2018

To add to what @ywelsch said, checking for this condition would require a significant rework of how the Migration APIs operate, as they currently use index metadata only. Even if the migration APIs were set up to run queries against indices, I'm not sure there's a way to detect this generically in a query, even a scripted one.

While I agree that we should do something to handle this case better, I'm not sure the migration API is the place to do it.

@geekpete
Copy link
Member Author

The escape hatch config setting had changed name from version 6.3 so the previous setting name was taking no effect. One setting name for 60. to 6.2+ and a different name setting for 6.3+

This info has been added in the PR:
#22073

I understand the reason we only use metadata to keep checks lightweight and the expense at scale of doing a search across every doc with a script or something that looks for dupe fields.
One way to minimise this cost might be to use a random function score query across the data, it won't be perfect to ensure all docs are checked but in the cases where all or a sufficient majority of the docs are broken it will at least alert the user. In the cases where the dupe fields are rare then missing them at an upgrade check step might be much less impact.

The fact that the escape hatch is workng as expected means the impact of this scenario is easy to deal with if a user upgrades and finds docs are affected, so this check might be added to an upgrade checklist in docs perhaps.

It's probably also a rare edge case as it takes a custom indexing app to make the error of duplicating fields and I think many of the mainstream indexing tools will generally avoid this.

@jakelandis
Copy link
Contributor

@geekpete - Happy to discuss further, but based on the conversation so far, it seems this is a low impact edge case with a valid workaround. Do you mind if we close this request ?

@geekpete
Copy link
Member Author

geekpete commented Jan 2, 2019

Agreed, closing.

@geekpete geekpete closed this as completed Jan 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants