[Meta-issue] Optimize pipeline (python/add_data.sh etc.) #76

anthonyfok · 2021-03-19T22:52:28Z

Goals include:

Use of e.g. /usr/bin/time -v for profiling
docker-compose logs -f -t provides log with timestamp
Some kind of DEBUG variable? e.g. Make the psql flag -a or --echo-all optional unless in DEBUG mode for a more concise log.
Add option to delete downloaded *.gpkg and *.csv files as soon as they have been imported to save space
etc.

Maybe in Round 2 of refactoring? Or this round? Need to discuss with Drew first:

Leave the model-factory/scripts/* files where they are instead of copying them?
Use e.g. _build and _data directories to separate our code from downloaded data and temporary build files?

Random ideas, questions, etc.

Make add_data.sh capable of being run over and over again ("incremental build", build stamp, etc.)
- ogr2ogr, if run repeatedly with the same data: -append, -update, or -overwrite
Use Backblaze S2 for large file storage for speed and reduced cost? https://nickb.dev/blog/backblaze-b2-as-a-cheaper-alternative-to-githubs-git-lfs
- GitHub data packs: Storage for $0.1/GB/month storage, Download for $0.1/GB download (TODO: verify)
- Amazon S3: TODO
- Backblaze S2: Storage for $0.005/GB/month storage; Download for $0.01/GB
~~Use eatmydata with PostgreSQL for speed~~ Use fsync=off, synchronous_commit=off and full_page_writes=off instead, see Speed up database writes with synchronous_commit=off (and full_page_write=off and fsync=off?) #77

The text was updated successfully, but these errors were encountered:

anthonyfok · 2021-03-31T18:25:01Z

[Edited] See #88 (comment) for a more complete benchmark (March 19 vs April 27)

Benchmark (in progress, to be edited)

Before:

Duration	Command
2s	`git clone https://github.com/OpenDRR/model-factory.git --depth 1`
4m58s	[Download] `git clone https://github.com/OpenDRR/boundaries.git --depth 1`
3m08	[Import] ogr2ogr run on the 9 .gpkg files from git clone of OpenDRR/boundaries
...	...

After:

Duration	Command
2s	`git clone https://github.com/OpenDRR/model-factory.git --depth 1`
43s to 1m20s	wget `https://opendrr.eccp.ca/file/OpenDRR/opendrr-boundaries.dump`
...	...

anthonyfok mentioned this issue Mar 19, 2021

add_data.sh: Preliminary reorganization #68

Merged

8 tasks

anthonyfok pinned this issue Apr 26, 2021

anthonyfok added this to the Sprint 33 milestone Apr 26, 2021

anthonyfok self-assigned this Apr 26, 2021

anthonyfok added Priority: Should Have Severity: Major Task labels Apr 26, 2021

anthonyfok changed the title ~~[Meta-issue] Optimize python/add_data.sh~~ [Meta-issue] Optimize pipeline (python/add_data.sh etc.) Apr 28, 2021

anthonyfok mentioned this issue Apr 29, 2021

Abort and restart curl download if speed too slow and/or fails #90

Open

jvanulde removed this from the Sprint 33 milestone May 6, 2021

anthonyfok mentioned this issue Jun 10, 2021

Fetch xz-compressed PSRA CSV files instead of LFS-stored ones #117

Open

anthonyfok removed the Severity: Major label Jan 17, 2022

Provide feedback