Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster import 2 #171

Merged
merged 8 commits into from
Oct 3, 2024
Merged

Faster import 2 #171

merged 8 commits into from
Oct 3, 2024

Conversation

laem
Copy link
Contributor

@laem laem commented Oct 3, 2024

Closes #170 to drop the first technique of using the CLI sqlite3 .import option, as it results in check errors and the IO to write the CSV is too slow.

In this PR, we're keeping the in-js sqlite3 import but we're keeping other optimizations developed in the first PR.

@laem
Copy link
Contributor Author

laem commented Oct 3, 2024

Step result : for a rather large dataset (Bretagne), we're getting 40 seconds of processing compared to the 1minute 04 seconds of the master branch.

The last optimisation attempt, db.transaction() does not appear to speed things up, but is cleaner IMHO.

Note : some tests are failing. One about shapes (I didn't look into it) and one more worrysome about get-stops.

@laem
Copy link
Contributor Author

laem commented Oct 3, 2024

@brendannee I can't find why this test fails : getShapesAsGeoJSON(): › should return geojson with shapes if they exist.

Could you look into it ?

@laem
Copy link
Contributor Author

laem commented Oct 3, 2024

Interesting ! I get a heap limit error when trying the big paris region GTFS (~1go).
I need to see if's it doesn't break the concept of transaction prepare of my pr.
image

OK, I see what's problematic : in master, at each batch, lines are popped from the main lines object. The whole stoptimes just does not fit in the heap size anymore because of list[].

Running with --max-old-space-size=16384 solves the problem, but of course I guess you don't want to impose this limitation. The size of the array could be limited like it was before but with a value higher than the old maxInsertVariables = 32_000;.

For the Parisian region, it lowers the import time from 4:34 to 2:50 :)

@laem
Copy link
Contributor Author

laem commented Oct 3, 2024

Result for my whole GTFS collection so far (one part of France, let's say 1/3rd).

[this branch] Parse GTFS: 10:26.820 (m:ss.mmm)
[master] Parse GTFS: 16:44.851 (m:ss.mmm)

Final db size : 9,4 Go / 11Go.

@brendannee brendannee merged commit 59fbd75 into BlinkTagInc:master Oct 3, 2024
0 of 2 checks passed
@laem
Copy link
Contributor Author

laem commented Oct 3, 2024

Wait @brendannee some tests did not pass 😅

@brendannee
Copy link
Member

Thanks so much for this big improvement.

I released a new version: https://github.com/BlinkTagInc/node-gtfs/releases/tag/4.15.0 (with a few other reorganizations).

Let me know what you think and if you have other ideas for improvements.

@brendannee
Copy link
Member

I fixed the issue with the tests that did not pass - so it should be good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants