Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parser): handle duplicate ID errors during parsing #408

Merged
merged 6 commits into from
Feb 14, 2024

Conversation

perierc
Copy link
Contributor

@perierc perierc commented Feb 14, 2024

What

  • This PR merges the duplicate nodes (having the same ID) during parsing.
  • When a node with an already existing ID is harvested, we yield it, but we don't link it to any other node (no PreviousLink is related to the duplicate)
  • Added a new step to remove duplicate child links (due to the fact that duplicate nodes are yielded)
  • Added a step to remove duplicate entry nodes: translations are merged (union), properties are overwritten with the latest values found, comments are added.

Screenshot

The new error message for duplicate IDs looks like this:

image

Part of

Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool.

Some cosmetic changes proposed.

@eric-nguyen-cs do you remember a good taxonomy to test this ?

@perierc
Copy link
Contributor Author

perierc commented Feb 14, 2024

Really cool.

Some cosmetic changes proposed.

@eric-nguyen-cs do you remember a good taxonomy to test this ?

FYI I've tried with the Ingredients taxonomy (with 39 duplicate id errors) and I've verified manually that the first 10 errors were correctly merged, comparing with the original file, and it was all good

@eric-nguyen-cs
Copy link
Contributor

A few examples of the end result:
Screenshot 2024-02-14 at 16 15 30

Screenshot 2024-02-14 at 16 19 03

Copy link
Contributor

@eric-nguyen-cs eric-nguyen-cs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@perierc perierc merged commit e5720bf into main Feb 14, 2024
7 checks passed
@perierc perierc deleted the perierc/handle-duplicate-id branch February 14, 2024 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: No status
Development

Successfully merging this pull request may close these issues.

handle duplicate id while parsing
3 participants