Check existing data for duplicate field docs in the migration assistance APIs and Migration Assistant #36629

geekpete · 2018-12-14T07:09:42Z

Describe the feature:

Elasticsearch 5.x allows duplicate fields to be indexed into documents:
#19614

This was fixed in 6.x by enforcing strict duplicate validation:
#22073

A user who upgrades a cluster containing docs with duplicate fields from 5.x to 6.x will currently have no warning that their data might become unusable (for doing any operations requiring json parsing, including ?pretty, indexing, scripts) once upgraded to 6.x.

The error message example is:

"caused_by": { "type": "json_parse_exception",
"reason" : "Duplicate field 'FILECONTENT'\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@5d3ea628; line: 99, column: 14]"...

This scenario can occur when a custom indexing client may be inadvertently creating documents with duplicate fields, so this might be seen as an edge case as well by the few reports of users hitting duplicate fields issues I've seen but when it does occur it's a bad situation.

Once upgraded to 6.x, there are limited options to repair the problem documents due to the inability to use reindex without hitting the duplicate field error (even with the escape hatch enabled to disable validation: es.json.strict_duplicate_detection=false as it doesn't disable validatin for reindex operations).
For this reason, repairs are probably best done before upgrading on the 5.x cluster.

It seems that a fix is either to update in place using the the existing _source (eg with update script ctx._source = ctx._source;) or to reindex the documents which will create new documents without the duplicate fields:

One other thing to consider is that any scenario where the values for the duplicate field are different, then a more custom script or solution might be needed to be able to choose which value to keep, it gets messy then.

So users should be at least warned and if possible presented with either automatic or manual repair options.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-12-14T07:09:44Z

Pinging @elastic/es-core-features

ywelsch · 2018-12-14T11:45:54Z

Note that warnings for this would require a full scan of the data in the index, loading every source document to figure out whether it has duplicate fields.

Once upgraded to 6.x, there are limited options to repair the problem documents due to the inability to use reindex without hitting the duplicate field error (even with the escape hatch enabled to disable validation: es.json.strict_duplicate_detection=false as it doesn't disable validatin for reindex operations).

Why is the escape hatch not working? What kind of errors are you getting from the reindex API?

gwbrown · 2018-12-14T16:02:21Z

To add to what @ywelsch said, checking for this condition would require a significant rework of how the Migration APIs operate, as they currently use index metadata only. Even if the migration APIs were set up to run queries against indices, I'm not sure there's a way to detect this generically in a query, even a scripted one.

While I agree that we should do something to handle this case better, I'm not sure the migration API is the place to do it.

geekpete · 2018-12-14T21:59:17Z

The escape hatch config setting had changed name from version 6.3 so the previous setting name was taking no effect. One setting name for 60. to 6.2+ and a different name setting for 6.3+

This info has been added in the PR:
#22073

I understand the reason we only use metadata to keep checks lightweight and the expense at scale of doing a search across every doc with a script or something that looks for dupe fields.
One way to minimise this cost might be to use a random function score query across the data, it won't be perfect to ensure all docs are checked but in the cases where all or a sufficient majority of the docs are broken it will at least alert the user. In the cases where the dupe fields are rare then missing them at an upgrade check step might be much less impact.

The fact that the escape hatch is workng as expected means the impact of this scenario is easy to deal with if a user upgrades and finds docs are affected, so this check might be added to an upgrade checklist in docs perhaps.

It's probably also a rare edge case as it takes a custom indexing app to make the error of duplicating fields and I think many of the mainstream indexing tools will generally avoid this.

jakelandis · 2019-01-02T18:29:16Z

@geekpete - Happy to discuss further, but based on the conversation so far, it seems this is a low impact edge case with a valid workaround. Do you mind if we close this request ?

geekpete · 2019-01-02T23:36:54Z

Agreed, closing.

geekpete added >upgrade :Core/Features/Features labels Dec 14, 2018

gwbrown added >enhancement and removed >upgrade labels Dec 14, 2018

gwbrown mentioned this issue Dec 14, 2018

Deprecation check for renamed bulk threadpool settings #36662

Merged

geekpete closed this as completed Jan 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check existing data for duplicate field docs in the migration assistance APIs and Migration Assistant #36629

Check existing data for duplicate field docs in the migration assistance APIs and Migration Assistant #36629

geekpete commented Dec 14, 2018

elasticmachine commented Dec 14, 2018

ywelsch commented Dec 14, 2018

gwbrown commented Dec 14, 2018 •

edited

Loading

geekpete commented Dec 14, 2018

jakelandis commented Jan 2, 2019

geekpete commented Jan 2, 2019

Check existing data for duplicate field docs in the migration assistance APIs and Migration Assistant #36629

Check existing data for duplicate field docs in the migration assistance APIs and Migration Assistant #36629

Comments

geekpete commented Dec 14, 2018

elasticmachine commented Dec 14, 2018

ywelsch commented Dec 14, 2018

gwbrown commented Dec 14, 2018 • edited Loading

geekpete commented Dec 14, 2018

jakelandis commented Jan 2, 2019

geekpete commented Jan 2, 2019

gwbrown commented Dec 14, 2018 •

edited

Loading