-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline optimization (Sprint 34–36) #105
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lintly has detected code quality issues in this pull request.
40e10fb
to
d3196e1
Compare
Hi @jvanulde and @drotheram, I think this PR is finally ready for merging, so please kindly review at your convenience. While all the origin goals are completed (please see the TODOs in the revised top post), my latest test run got up to "Load Social Fabric Views" until I ran out of memory. As this is the 2nd last step, I am super lucky and happy!
I went a bit off-topic in this PR: Besides speed optimization, there are some not-directly-related bug fixes, and probably more on reorganization of the code. In the future, I'll try to keep my PRs smaller and more focused. Here are the logs for my recent run, separated by service:
Alternatively, the combined full log: 2021-06-05_full.log Thanks again! |
(With revision up to 2021-05-20)
to simplify code for PSRA CSV imports (Updated 2021-05-21)
to download from xz-compressed repos for speed and cost-saving (no LFS) See #91
Also, upgrade git to the latest version (2.32.0.rc0 as of this writing) because "git checkout" for model-inputs got stuck with git 2.28. See #83
for a little bit of time-saving.
Commands are prefixed with RUN or "is_dry_run || " in add_data.sh for more verbose logging and to allow dry run. Dry-run mode may be enabled by using ADD_DATA_DRY_RUN=true in the .env file; see sample.env for example.
unless their values are literally "password". Also fix bug in LOG() where secrets were not hidden when there was only one argument.
Instead of manually passing the needed variables as command-line arguments to add_data.sh in python/Dockerfile, rely on these variables being already in the environment, as defined either in .env file for Docker Compose, or in the task definition of Amazon ECS. Also add a quick environment variable check at the beginning of add_data.sh to warn if any variable is empty. This fixes "curl: (3) URL using bad/illegal format or missing URL" error in "Creating PSRA Kibana Index Patterns" due to unquoted command-line arguments in python/Dockerfile, causing KIBANA_ENDPOINT to be empty when the optional ES_USER and ES_PASS are empty.
Put the main program in a function called "main" as the bottommost function from which all the major steps are called. See the Google Shell Style Guide at https://google.github.io/styleguide/shellguide.html#s7.8-main Other changes include: * LOG(): Correct quoting for one argument containing single quote * ERROR(): Exit the program after showing * check_environment_variables(): Abort if a mandatory variable is undefined * LOG(): Print line numbers and function names by default too, see ADD_DATA_PRINT_LINENO and ADD_DATA_PRINT_FUNCNAME in sample.env
Also: Rename import_data_from_postgis_to_elasticsearch to export_to_elasticsearch for shorter line lengths in the log.
8f76bfb
to
de8375c
Compare
lintly flake8 issues were resolved in #110
I originally wanted to make merge_csv to handle OpenQuake CSV comment header stripping, but have not found a good solution yet, so that functionality remains in fetch_psra_csv_from_model. This commit fixes the error in the merge_csv function description.
|
||
INFO "Trying to download pre-generated PostGIS database dump (for speed)..." | ||
if RUN curl -O -v --retry 999 --retry-max-time 0 https://opendrr.eccp.ca/file/OpenDRR/opendrr-boundaries.dump || \ | ||
RUN curl -O -v --retry 999 --retry-max-time 0 https://f000.backblazeb2.com/file/OpenDRR/opendrr-boundaries.dump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to the github repo or s3 bucket if needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Will follow up at #116, probably in a follow-up PR.
Revert my unrelated and undocumented and buggy change to the -p port setting for pg_isready in wait_for_postgres(). See the reviews at #105 for more details. Special thanks to @drotheram for catching this bug!
Remove RUN from the awk command, as RUN with '>' ended up prepending LOG() output into the first line of the merged CSV file. Special thanks to Drew Rotheram (@drotheram) for catching this! See reviews at #105 for more information.
This allows the script to be run without error in e.g. the db-opendrr (postgis) service container where GNU time is not pre-installed. Note: GNU time is not strictly needed because it is used mainly for tracking memory usage (Maximum resident set size or "maxresident") only. It may be installed in Debian-based container (e.g. db-opendrr) using "apt update && apt install time".
Since the code using backblaze2b has already been merged into the main branch, I'm proposing, if there are not other issues, that we merge into the main branch. We can treat the backblaze to s3 migration as a separate issue |
Thank you Drew for your amazing review, especially in how you have caught and resolved the bugs in the PR. Let's "Rebase and merge" this in synchronous with OpenDRR/model-factory#66. |
Looks good for the most part so far but raw data ingest of PSRA results for YT into PostGIS is failing at my runtime. Need to investigate further. Otherwise everything else looks good! |
Special thanks to @drotheram for catching this glaring mistake of mine! The error was introduced in commit 50af463 (commented EXPECTED_PT_LIST) and then in commit 12dbd02 where I renamed it to PT_LIST to replace the originally fetched list without actually verifying it.
to avoid "curl: (3) URL using bad/illegal format or missing URL" (non-fatal error) when ES_CREDENTIALS is empty where curl would interpret the quoted empty string "" as an URL.
Thanks a million Drew! Total respect for your meticulous code review! |
Revert my unrelated and undocumented and buggy change to the -p port setting for pg_isready in wait_for_postgres(). See the reviews at #105 for more details. Special thanks to @drotheram for catching this bug!
Remove RUN from the awk command, as RUN with '>' ended up prepending LOG() output into the first line of the merged CSV file. Special thanks to Drew Rotheram (@drotheram) for catching this! See reviews at #105 for more information.
Dear all,
This PR represents interim result of the ongoing the add_data.sh pipeline optimization work as of
Friday, May 21, 2021Friday, June 4, 2021 (status update on Monday, June 7).Fetch compressed CSV for PSRA and DSRA too: TODODefer to future pull requestsTo actually use these checksumsDefer to a future pull requestIt is a work-in-progress, and not quite ready for merging in light of the bugs/caveats below, but can be used for testing. The simplest way to test is probably to check out thepipeline-optimization
branch.2021-06-07 Update: I think it is finally ready for merging. Pleasesee #105 (comment) for details.
Previous notes
Caveats
My last run crashed here, apparently due to a missing or empty KIBANA_ENDPOINT variable, but could be bug in my RUN() function too:Resolved in commit 40e10fb on June 3, 2021.Observations
OpenDRR/model-factory/DSRA_outputs2postgres_lfs.py
. But yes, I am glad that Will suggested the need to refactor those older DSRA scripts (see add_data.sh related Python scripts - flexible data loading model-factory#53)Sample run log
See 2021-05-22-sample-run.log.
It was generated with:
(TODO: quick-and-dirty commands; need to improve.)