Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Improve parser performance #318

Merged
merged 25 commits into from
Jan 17, 2024
Merged

Conversation

eric-nguyen-cs
Copy link
Contributor

@eric-nguyen-cs eric-nguyen-cs commented Dec 21, 2023

What

The goal of this PR is to improve the performance of the writing of the taxonomy data to the Neo4J database, so that every taxonomy is parsable in a raisonnable amount of time

Note

This PR is quite big, so to simplify the review, you can go through commit by commit (I tried to make the commits quite small and atomic)

Changes

We can have small performance increases for small taxonomies, but the changes mainly target medium or large taxonomies, by:

  • batching requests, as to not pay the cost of the query optimisation multiple times
  • using range indexes on the node id property to drastically improve look up times in big taxonomies when creating relationships

Fixes bug(s)

#289
#183

Part of

Follow up of this parser/db writer decoupling PR

Results

We have an overall speedup of x11 (mainly 730.6s -> 25.44s for the categories taxonomy). Before, the writes were very slow, so HTTP timeouts were possible when importing a taxonomy file, but now all taxonomies are parsable in less than 30s.

Details

Before:
Additives -> 34.93s
Allergens -> 3.31s
Amino Acids -> 1.6s
Categories -> 730.6s
Countries -> 103.49s
Data Quality -> 16.08s
Food Groups -> 1.63s
Improvements -> 1.06s
Ingredients -> 255.13s
Ingredients Analysis -> 1.12s
Ingredients Processing -> 5.94s
Labels -> 56.08s
Languages -> 32.26s
Minerals -> 6.58s
Misc -> 0.64s
Nova Groups -> 0.61s
Nucleotides -> 1.64s
Nutrients -> 10.79s
Origins -> 12.40s
Other Nutritional Substances -> 2.84s
Packaging Materials -> 6.90s
Packaging Recycling -> 0.94s
Packaging Shapes -> 3.81s
Periods After Opening -> 1.34s
Preservation -> 0.57s
States -> 2.19s
Test -> 1.41s
Vitamins -> 2.94s
(Total: 1299s)

After:
All taxonomies are parsable in less than 30s
Additives -> 4.78s
Allergens -> 1.23s
Amino Acids -> 0.77s
Categories -> 25.44s
Countries -> 15.71s
Data Quality -> 1.09s
Food Groups -> 0.96s
Improvements -> 1.41s
Ingredients -> 13.76s
Ingredients Analysis -> 1.25s
Ingredients Processing -> 2.28s
Labels -> 3.95s
Languages -> 14.26s
Minerals -> 4.0s
Misc -> 0.32s
Nova Groups -> 0.71s
Nucleotides -> 1.95s
Nutrients -> 5.87s
Origins -> 5.41s
Other Nutritional Substances -> 1.5s
Packaging Materials -> 2.4s
Packaging Recycling -> 0.66s
Packaging Shapes -> 2.38s
Periods After Opening -> 0.63s
Preservation -> 0.57s
States -> 0.83s
Test -> 0.82s
Vitamins -> 2.40s
(Total: 117s)

@eric-nguyen-cs eric-nguyen-cs requested a review from a team as a code owner December 21, 2023 13:49
@eric-nguyen-cs eric-nguyen-cs changed the title [FEAT] Improve parser performance feat: Improve parser performance Dec 21, 2023
@eric-nguyen-cs
Copy link
Contributor Author

I believe that the tests are not passing because the parent order written to the exported taxonomy file is not deterministic
Maybe solving this issue first is necessary to merge this PR and the previous one

Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good work.

I have some proposal though.

parser/openfoodfacts_taxonomy_parser/parser/parser.py Outdated Show resolved Hide resolved
parser/openfoodfacts_taxonomy_parser/parser/parser.py Outdated Show resolved Hide resolved
parser/openfoodfacts_taxonomy_parser/parser/parser.py Outdated Show resolved Hide resolved
parser/openfoodfacts_taxonomy_parser/parser/parser.py Outdated Show resolved Hide resolved
parser/openfoodfacts_taxonomy_parser/parser/parser.py Outdated Show resolved Hide resolved
Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great !

Base automatically changed from ericn/decouple-parser-and-db-writer to main January 17, 2024 11:50
@eric-nguyen-cs eric-nguyen-cs merged commit 78fdffc into main Jan 17, 2024
7 checks passed
@eric-nguyen-cs eric-nguyen-cs deleted the ericn/improve-parser-performance branch January 17, 2024 12:01
perierc pushed a commit that referenced this pull request Jan 17, 2024
* refactor: mark private function with _

* refactor(parser): add type annotations and clean up code

* chore: use context manager to close session in tests

* chore: update neo4j and Makefile

* refactor: create parser specific directory

* refactor: start taxonomy_parser by copying parser file

* refactor: move logger to separate file

* refactor: remove unnecessary code for taxonomy parser

* feat: update TaxonomyParser to return taxonomy class

* feat: update parser to use taxonomy parser

* chore: update tests for new taxonomy parser

* fix: remove multi_label for single project_label

* feat: improve node creation performance

* feat: add node id index to improve search query performance

* feat: improve previous link creation performance

* feat: improve child link creation performance

* feat: group queries into transaction

* chore: update logging info and add timing info

* fix: add db name to sessions

* refactor: move ellipsis func to logger class

* fix: stop id index creation if index exists

* fix: resolve comments

* fix: resolve comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

3 participants