-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check existing data for duplicate field docs in the migration assistance APIs and Migration Assistant #36629
Comments
Pinging @elastic/es-core-features |
Note that warnings for this would require a full scan of the data in the index, loading every source document to figure out whether it has duplicate fields.
Why is the escape hatch not working? What kind of errors are you getting from the reindex API? |
To add to what @ywelsch said, checking for this condition would require a significant rework of how the Migration APIs operate, as they currently use index metadata only. Even if the migration APIs were set up to run queries against indices, I'm not sure there's a way to detect this generically in a query, even a scripted one. While I agree that we should do something to handle this case better, I'm not sure the migration API is the place to do it. |
The escape hatch config setting had changed name from version 6.3 so the previous setting name was taking no effect. One setting name for 60. to 6.2+ and a different name setting for 6.3+ This info has been added in the PR: I understand the reason we only use metadata to keep checks lightweight and the expense at scale of doing a search across every doc with a script or something that looks for dupe fields. The fact that the escape hatch is workng as expected means the impact of this scenario is easy to deal with if a user upgrades and finds docs are affected, so this check might be added to an upgrade checklist in docs perhaps. It's probably also a rare edge case as it takes a custom indexing app to make the error of duplicating fields and I think many of the mainstream indexing tools will generally avoid this. |
@geekpete - Happy to discuss further, but based on the conversation so far, it seems this is a low impact edge case with a valid workaround. Do you mind if we close this request ? |
Agreed, closing. |
Describe the feature:
Elasticsearch 5.x allows duplicate fields to be indexed into documents:
#19614
This was fixed in 6.x by enforcing strict duplicate validation:
#22073
A user who upgrades a cluster containing docs with duplicate fields from 5.x to 6.x will currently have no warning that their data might become unusable (for doing any operations requiring json parsing, including
?pretty
, indexing, scripts) once upgraded to 6.x.The error message example is:
This scenario can occur when a custom indexing client may be inadvertently creating documents with duplicate fields, so this might be seen as an edge case as well by the few reports of users hitting duplicate fields issues I've seen but when it does occur it's a bad situation.
Once upgraded to 6.x, there are limited options to repair the problem documents due to the inability to use reindex without hitting the duplicate field error (even with the escape hatch enabled to disable validation:
es.json.strict_duplicate_detection=false
as it doesn't disable validatin for reindex operations).For this reason, repairs are probably best done before upgrading on the 5.x cluster.
It seems that a fix is either to update in place using the the existing _source (eg with update script
ctx._source = ctx._source;
) or to reindex the documents which will create new documents without the duplicate fields:One other thing to consider is that any scenario where the values for the duplicate field are different, then a more custom script or solution might be needed to be able to choose which value to keep, it gets messy then.
So users should be at least warned and if possible presented with either automatic or manual repair options.
The text was updated successfully, but these errors were encountered: