feat: Improve parser performance #318

eric-nguyen-cs · 2023-12-21T13:49:32Z

What

The goal of this PR is to improve the performance of the writing of the taxonomy data to the Neo4J database, so that every taxonomy is parsable in a raisonnable amount of time

Note

This PR is quite big, so to simplify the review, you can go through commit by commit (I tried to make the commits quite small and atomic)

Changes

We can have small performance increases for small taxonomies, but the changes mainly target medium or large taxonomies, by:

batching requests, as to not pay the cost of the query optimisation multiple times
using range indexes on the node id property to drastically improve look up times in big taxonomies when creating relationships

Fixes bug(s)

#289
#183

Part of

Follow up of this parser/db writer decoupling PR

Results

We have an overall speedup of x11 (mainly 730.6s -> 25.44s for the categories taxonomy). Before, the writes were very slow, so HTTP timeouts were possible when importing a taxonomy file, but now all taxonomies are parsable in less than 30s.

Details

Before:
Additives -> 34.93s
Allergens -> 3.31s
Amino Acids -> 1.6s
Categories -> 730.6s
Countries -> 103.49s
Data Quality -> 16.08s
Food Groups -> 1.63s
Improvements -> 1.06s
Ingredients -> 255.13s
Ingredients Analysis -> 1.12s
Ingredients Processing -> 5.94s
Labels -> 56.08s
Languages -> 32.26s
Minerals -> 6.58s
Misc -> 0.64s
Nova Groups -> 0.61s
Nucleotides -> 1.64s
Nutrients -> 10.79s
Origins -> 12.40s
Other Nutritional Substances -> 2.84s
Packaging Materials -> 6.90s
Packaging Recycling -> 0.94s
Packaging Shapes -> 3.81s
Periods After Opening -> 1.34s
Preservation -> 0.57s
States -> 2.19s
Test -> 1.41s
Vitamins -> 2.94s
(Total: 1299s)

After:
All taxonomies are parsable in less than 30s
Additives -> 4.78s
Allergens -> 1.23s
Amino Acids -> 0.77s
Categories -> 25.44s
Countries -> 15.71s
Data Quality -> 1.09s
Food Groups -> 0.96s
Improvements -> 1.41s
Ingredients -> 13.76s
Ingredients Analysis -> 1.25s
Ingredients Processing -> 2.28s
Labels -> 3.95s
Languages -> 14.26s
Minerals -> 4.0s
Misc -> 0.32s
Nova Groups -> 0.71s
Nucleotides -> 1.95s
Nutrients -> 5.87s
Origins -> 5.41s
Other Nutritional Substances -> 1.5s
Packaging Materials -> 2.4s
Packaging Recycling -> 0.66s
Packaging Shapes -> 2.38s
Periods After Opening -> 0.63s
Preservation -> 0.57s
States -> 0.83s
Test -> 0.82s
Vitamins -> 2.40s
(Total: 117s)

eric-nguyen-cs · 2023-12-21T14:34:21Z

I believe that the tests are not passing because the parent order written to the exported taxonomy file is not deterministic
Maybe solving this issue first is necessary to merge this PR and the previous one

alexgarel

Really good work.

I have some proposal though.

parser/openfoodfacts_taxonomy_parser/parser/taxonomy_parser.py

parser/openfoodfacts_taxonomy_parser/parser/parser.py

alexgarel

Great !

parser/openfoodfacts_taxonomy_parser/parser/parser.py

* refactor: mark private function with _ * refactor(parser): add type annotations and clean up code * chore: use context manager to close session in tests * chore: update neo4j and Makefile * refactor: create parser specific directory * refactor: start taxonomy_parser by copying parser file * refactor: move logger to separate file * refactor: remove unnecessary code for taxonomy parser * feat: update TaxonomyParser to return taxonomy class * feat: update parser to use taxonomy parser * chore: update tests for new taxonomy parser * fix: remove multi_label for single project_label * feat: improve node creation performance * feat: add node id index to improve search query performance * feat: improve previous link creation performance * feat: improve child link creation performance * feat: group queries into transaction * chore: update logging info and add timing info * fix: add db name to sessions * refactor: move ellipsis func to logger class * fix: stop id index creation if index exists * fix: resolve comments * fix: resolve comments

eric-nguyen-cs added 12 commits December 8, 2023 15:05

refactor: mark private function with _

589b6d9

refactor(parser): add type annotations and clean up code

ae5af55

chore: use context manager to close session in tests

cef659a

chore: update neo4j and Makefile

802091d

refactor: create parser specific directory

30aa918

refactor: start taxonomy_parser by copying parser file

0c9c856

refactor: move logger to separate file

f26f256

refactor: remove unnecessary code for taxonomy parser

13cfce2

feat: update TaxonomyParser to return taxonomy class

3410ad0

feat: update parser to use taxonomy parser

6674876

chore: update tests for new taxonomy parser

cd1e288

Merge branch 'main' into ericn/decouple-parser-and-db-writer

1540015

eric-nguyen-cs requested a review from a team as a code owner December 21, 2023 13:49

github-actions bot assigned eric-nguyen-cs Dec 21, 2023

github-actions bot added parser backend labels Dec 21, 2023

eric-nguyen-cs changed the title ~~[FEAT] Improve parser performance~~ feat: Improve parser performance Dec 21, 2023

eric-nguyen-cs added 10 commits December 21, 2023 15:08

fix: remove multi_label for single project_label

1098bc7

feat: improve node creation performance

3e88470

feat: add node id index to improve search query performance

6d1eda1

feat: improve previous link creation performance

faea1a1

feat: improve child link creation performance

2b4f879

feat: group queries into transaction

095b09a

chore: update logging info and add timing info

c7e03e0

fix: add db name to sessions

c0f6f50

refactor: move ellipsis func to logger class

bdfb252

fix: stop id index creation if index exists

94bfa7d

eric-nguyen-cs force-pushed the ericn/improve-parser-performance branch from 1e7f120 to 94bfa7d Compare December 21, 2023 14:08

alexgarel requested changes Jan 9, 2024

View reviewed changes

teolemon added the 🎯 P1 label Jan 12, 2024

eric-nguyen-cs force-pushed the ericn/improve-parser-performance branch from 470b04f to 99f1d5f Compare January 12, 2024 09:18

fix: resolve comments

f1e936d

eric-nguyen-cs force-pushed the ericn/improve-parser-performance branch from 99f1d5f to f1e936d Compare January 12, 2024 09:23

eric-nguyen-cs requested a review from alexgarel January 12, 2024 09:27

alexgarel approved these changes Jan 17, 2024

View reviewed changes

parser/openfoodfacts_taxonomy_parser/parser/parser.py Outdated Show resolved Hide resolved

parser/openfoodfacts_taxonomy_parser/parser/parser.py Show resolved Hide resolved

fix: resolve comments

cf3a993

Base automatically changed from ericn/decouple-parser-and-db-writer to main January 17, 2024 11:50

Merge branch 'main' into ericn/improve-parser-performance

572fc78

eric-nguyen-cs merged commit 78fdffc into main Jan 17, 2024
7 checks passed

eric-nguyen-cs deleted the ericn/improve-parser-performance branch January 17, 2024 12:01

openfoodfacts-bot mentioned this pull request Jan 17, 2024

chore(main): release 1.0.0 #136

Merged

eric-nguyen-cs mentioned this pull request Jan 17, 2024

feat: Improve node creation performance #338

Merged

eric-nguyen-cs mentioned this pull request Apr 14, 2024

feat: Decouple parsing the taxonomy and writing the taxonomy to the database #317

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Improve parser performance #318

feat: Improve parser performance #318

eric-nguyen-cs commented Dec 21, 2023 •

edited

Loading

eric-nguyen-cs commented Dec 21, 2023

alexgarel left a comment

alexgarel left a comment

feat: Improve parser performance #318

feat: Improve parser performance #318

Conversation

eric-nguyen-cs commented Dec 21, 2023 • edited Loading

What

Note

Changes

Fixes bug(s)

Part of

Results

Details

eric-nguyen-cs commented Dec 21, 2023

alexgarel left a comment

Choose a reason for hiding this comment

alexgarel left a comment

Choose a reason for hiding this comment

eric-nguyen-cs commented Dec 21, 2023 •

edited

Loading