Pipeline optimization (Sprint 34–36) #105

anthonyfok · 2021-05-25T22:55:10Z

Dear all,

This PR represents interim result of the ongoing the add_data.sh pipeline optimization work as of ~~Friday, May 21, 2021~~ Friday, June 4, 2021 (status update on Monday, June 7).

It is a work-in-progress, and not quite ready for merging in light of the bugs/caveats below, but can be used for testing. The simplest way to test is probably to check out the pipeline-optimization branch.

2021-06-07 Update: I think it is finally ready for merging. Pleasesee #105 (comment) for details.

Previous notes

Caveats

About 2/3 (?) of files are still being downloaded over Git LFS?
I seem to have introduced at least the following new (non-aborting) error; yet to debug:
```
./add_data.sh: line 568: ((: --exposureAgg=b: attempted assignment to non-variable (error token is "=b")
```
(Update: Not seen in my June 4-to-5 run. The problem fixed itself? Not sure...)
As such, no guarantee of correctness (yet).

~~My last run crashed here, apparently due to a missing or empty KIBANA_ENDPOINT variable, but could be bug in my RUN() function too:~~ Resolved in commit 40e10fb on June 3, 2021.

2021-05-22T19:27:12.273954877Z Creating PSRA Kibana Index Patterns
2021-05-22T19:27:12.304591179Z [add_data] curl -X POST -H 'securitytenant: global' -H 'Content-Type: application/json' /api/saved_objects/index-pattern/psra*all_indicators_s -H 'kbn-xsrf: true' -d '{ "attributes": { "title":"psra*all_indicators_s"}}'
2021-05-22T19:27:12.314475601Z curl: (3) URL using bad/illegal format or missing URL
2021-05-22T19:27:12.315562560Z Command exited with non-zero status 3

Original code:

echo "Creating PSRA Kibana Index Patterns"
RUN curl -X POST -H "securitytenant: global" -H "Content-Type: application/json" "${KIBANA_ENDPOINT}/api/saved_objects/index-pattern/psra*all_indicators_s" -H "kbn-xsrf: true" -d '{ "attributes": { "title":"psra*all_indicators_s"}}'

Observations

An abnormality when fetching an LFS file in my last run: https://media.githubusercontent.com/media/OpenDRR/scenario-catalogue/master/FINISHED/s_shakemap_IDM7p1_Sidney_289.csv?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX, a 4.4MB file, took over 22 minutes over a 100Mbps connection. (Abort and restart curl download if speed too slow and/or fails #90)
Compared to the newer PSRA Python scripts, the DSRA Python scripts (which Drew says predates the use of Docker instances) using are a lot more memory intensive, with the following taking 8.25 GB of RAM:
```
[add_data] python3 DSRA_outputs2postgres_lfs.py --dsraModelDir=https://github.com/OpenDRR/scenario-catalogue/tree/master/FINISHED --columnsINI=DSRA_outputs2postgres.ini --eqScenario=SCM6p5_Ottawa
126.51user 35.87system 7:25.93elapsed 36%CPU (0avgtext+0avgdata 8255000maxresident)k
real     7m26.738s
```
vs. the newer PSRA scripts apparently taking only 21 MB of RAM. But need to verify more carefully the exact cause of the high memory usage in OpenDRR/model-factory/DSRA_outputs2postgres_lfs.py. But yes, I am glad that Will suggested the need to refactor those older DSRA scripts (see add_data.sh related Python scripts - flexible data loading model-factory#53)

Sample run log

See 2021-05-22-sample-run.log.

It was generated with:

docker-compose logs -t --tail="all" > "logs/$(date -Is).log"
egrep 'python-opendrr_1    |\[add_data\]|Z real|add_data\.sh' logs/2021-05-25T10:02:16-06:00.log | cut -b37- > 2021-05-22-sample-run.log

(TODO: quick-and-dirty commands; need to improve.)

github-actions

Lintly has detected code quality issues in this pull request.

anthonyfok · 2021-06-07T16:17:38Z

Hi @jvanulde and @drotheram,

I think this PR is finally ready for merging, so please kindly review at your convenience.

While all the origin goals are completed (please see the TODOs in the revised top post), my latest test run got up to "Load Social Fabric Views" until I ran out of memory. As this is the 2nd last step, I am super lucky and happy!

python-opendrr_1  | 2021-06-05T16:19:07.537772692Z [add_data:809:import_data_from_postgis_to_elasticsearch] RUN: python3 socialFabric_postgres2es.py --type=all_indicators --aggregation=sauid --geometry=geom_poly --idField=Sauid
python-opendrr_1  | 2021-06-05T17:34:21.236350433Z Command terminated by signal 9
python-opendrr_1  | 2021-06-05T17:34:25.132522098Z 162.12user 33.91system 1:15:02elapsed 4%CPU (0avgtext+0avgdata 10087920maxresident)k
python-opendrr_1  | 2021-06-05T17:34:25.132546155Z 1361040inputs+0outputs (165614major+3378895minor)pagefaults 0swaps

I went a bit off-topic in this PR: Besides speed optimization, there are some not-directly-related bug fixes, and probably more on reorganization of the code. In the future, I'll try to keep my PRs smaller and more focused.

Here are the logs for my recent run, separated by service:

Alternatively, the combined full log: 2021-06-05_full.log

Thanks again!
(And again, my apologies for my delay in other tasks... will be getting to them soon ASAP!)

(With revision up to 2021-05-20)

to simplify code for PSRA CSV imports (Updated 2021-05-21)

to download from xz-compressed repos for speed and cost-saving (no LFS) See #91

See #77

Also, upgrade git to the latest version (2.32.0.rc0 as of this writing) because "git checkout" for model-inputs got stuck with git 2.28. See #83

for a little bit of time-saving.

Commands are prefixed with RUN or "is_dry_run || " in add_data.sh for more verbose logging and to allow dry run. Dry-run mode may be enabled by using ADD_DATA_DRY_RUN=true in the .env file; see sample.env for example.

See PR #89

unless their values are literally "password". Also fix bug in LOG() where secrets were not hidden when there was only one argument.

Instead of manually passing the needed variables as command-line arguments to add_data.sh in python/Dockerfile, rely on these variables being already in the environment, as defined either in .env file for Docker Compose, or in the task definition of Amazon ECS. Also add a quick environment variable check at the beginning of add_data.sh to warn if any variable is empty. This fixes "curl: (3) URL using bad/illegal format or missing URL" error in "Creating PSRA Kibana Index Patterns" due to unquoted command-line arguments in python/Dockerfile, causing KIBANA_ENDPOINT to be empty when the optional ES_USER and ES_PASS are empty.

Put the main program in a function called "main" as the bottommost function from which all the major steps are called. See the Google Shell Style Guide at https://google.github.io/styleguide/shellguide.html#s7.8-main Other changes include: * LOG(): Correct quoting for one argument containing single quote * ERROR(): Exit the program after showing * check_environment_variables(): Abort if a mandatory variable is undefined * LOG(): Print line numbers and function names by default too, see ADD_DATA_PRINT_LINENO and ADD_DATA_PRINT_FUNCNAME in sample.env

Also: Rename import_data_from_postgis_to_elasticsearch to export_to_elasticsearch for shorter line lengths in the log.

lintly flake8 issues were resolved in #110

I originally wanted to make merge_csv to handle OpenQuake CSV comment header stripping, but have not found a good solution yet, so that functionality remains in fetch_psra_csv_from_model. This commit fixes the error in the merge_csv function description.

drotheram · 2021-06-09T21:47:38Z

python/add_data.sh

+
+  INFO "Trying to download pre-generated PostGIS database dump (for speed)..."
+  if RUN curl -O -v --retry 999 --retry-max-time 0 https://opendrr.eccp.ca/file/OpenDRR/opendrr-boundaries.dump || \
+    RUN curl -O -v --retry 999 --retry-max-time 0 https://f000.backblazeb2.com/file/OpenDRR/opendrr-boundaries.dump


Can we move this to the github repo or s3 bucket if needed?

Good point! Will follow up at #116, probably in a follow-up PR.

python/add_data.sh

@drotheram

Revert my unrelated and undocumented and buggy change to the -p port setting for pg_isready in wait_for_postgres(). See the reviews at #105 for more details. Special thanks to @drotheram for catching this bug!

@drotheram

Remove RUN from the awk command, as RUN with '>' ended up prepending LOG() output into the first line of the merged CSV file. Special thanks to Drew Rotheram (@drotheram) for catching this! See reviews at #105 for more information.

This allows the script to be run without error in e.g. the db-opendrr (postgis) service container where GNU time is not pre-installed. Note: GNU time is not strictly needed because it is used mainly for tracking memory usage (Maximum resident set size or "maxresident") only. It may be installed in Debian-based container (e.g. db-opendrr) using "apt update && apt install time".

drotheram · 2021-06-11T20:03:01Z

Since the code using backblaze2b has already been merged into the main branch, I'm proposing, if there are not other issues, that we merge into the main branch. We can treat the backblaze to s3 migration as a separate issue

anthonyfok · 2021-06-11T22:00:04Z

Thank you Drew for your amazing review, especially in how you have caught and resolved the bugs in the PR.

Let's "Rebase and merge" this in synchronous with OpenDRR/model-factory#66.
(I'll wait for you to review OpenDRR/model-factory#66 first.)

drotheram · 2021-06-12T00:11:39Z

Looks good for the most part so far but raw data ingest of PSRA results for YT into PostGIS is failing at my runtime. Need to investigate further. Otherwise everything else looks good!

python/add_data.sh

@drotheram

Special thanks to @drotheram for catching this glaring mistake of mine! The error was introduced in commit 50af463 (commented EXPECTED_PT_LIST) and then in commit 12dbd02 where I renamed it to PT_LIST to replace the originally fetched list without actually verifying it.

python/add_data.sh

to avoid "curl: (3) URL using bad/illegal format or missing URL" (non-fatal error) when ES_CREDENTIALS is empty where curl would interpret the quoted empty string "" as an URL.

anthonyfok · 2021-06-14T16:19:02Z

Thanks a million Drew! Total respect for your meticulous code review!

@drotheram

Revert my unrelated and undocumented and buggy change to the -p port setting for pg_isready in wait_for_postgres(). See the reviews at #105 for more details. Special thanks to @drotheram for catching this bug!

@drotheram

Remove RUN from the awk command, as RUN with '>' ended up prepending LOG() output into the first line of the merged CSV file. Special thanks to Drew Rotheram (@drotheram) for catching this! See reviews at #105 for more information.

anthonyfok requested review from drotheram and jvanulde May 25, 2021 22:56

anthonyfok mentioned this pull request May 26, 2021

Speed up database writes with synchronous_commit=off (and full_page_write=off and fsync=off?) #77

Open

github-actions bot previously requested changes May 28, 2021

View reviewed changes

anthonyfok changed the title ~~Pipeline optimization (stage 1)~~ Pipeline optimization (Sprint 35) Jun 2, 2021

anthonyfok added this to the Sprint 35 milestone Jun 2, 2021

anthonyfok self-assigned this Jun 2, 2021

anthonyfok changed the title ~~Pipeline optimization (Sprint 35)~~ Pipeline optimization (Sprint 34–35) Jun 2, 2021

anthonyfok force-pushed the pipeline-optimization branch 3 times, most recently from 40e10fb to d3196e1 Compare June 3, 2021 17:25

anthonyfok mentioned this pull request Jun 3, 2021

Merge Joost's gen-pygeoapi-config branch into master #110

Merged

anthonyfok added Bug Something isn't working Enhancement New feature or request labels Jun 7, 2021

anthonyfok added 15 commits June 7, 2021 10:19

add_data: Use jq to simplify code

dfeb9e9

(With revision up to 2021-05-20)

add_data: Add fetch_psra_csv_from_model and merge_csv functions

50af463

to simplify code for PSRA CSV imports (Updated 2021-05-21)

add_data: New fetch_csv_xz function

4fda1d5

to download from xz-compressed repos for speed and cost-saving (no LFS) See #91

add_data: Add set_synchronous_commit() for speed

ee3c8ad

See #77

add_data: Add LOG and RUN functions

6463dba

add_data: Speed up file writes with eatmydata

7e97869

add_data: Fetch pointers of CSV files for "oid sha256"

38d59b9

Also, upgrade git to the latest version (2.32.0.rc0 as of this writing) because "git checkout" for model-inputs got stuck with git 2.28. See #83

add_data: Clone model-factory before the wait for postgres

8491609

for a little bit of time-saving.

add_data: Do not add quotes in LOG when there is only one argument

7dc525d

add_data: Add INFO, WARN and ERROR functions for logging

7b0b921

add_data: Convert more echos and comments to LOG() etc.

8f6467e

add_data: Allow dry run for testing and debugging

12dbd02

Commands are prefixed with RUN or "is_dry_run || " in add_data.sh for more verbose logging and to allow dry run. Dry-run mode may be enabled by using ADD_DATA_DRY_RUN=true in the .env file; see sample.env for example.

Make add_data.sh and postgis/*.sh ShellCheck clean

313776f

See PR #89

add_data: Hide secrets such as POSTGRES_PASS and ES_PASS in LOG()

3116de2

unless their values are literally "password". Also fix bug in LOG() where secrets were not hidden when there was only one argument.

anthonyfok added 2 commits June 7, 2021 10:19

add_data: Use RUN for all the main steps too

de8375c

Also: Rename import_data_from_postgis_to_elasticsearch to export_to_elasticsearch for shorter line lengths in the log.

anthonyfok force-pushed the pipeline-optimization branch from 8f76bfb to de8375c Compare June 7, 2021 16:19

anthonyfok mentioned this pull request Jun 7, 2021

Add .github/workflows/shellcheck.yml #89

Merged

drotheram reviewed Jun 9, 2021

View reviewed changes

python/add_data.sh Show resolved Hide resolved

This was referenced Jun 10, 2021

Switch from Backblaze B2 to Amazon S3 #116

Open

GitHub Actions for CI tests #113

Open

anthonyfok added 4 commits June 10, 2021 15:20

add_data: Fix section comment and code indentation

60e3e57

add_data: Revert change to pg_isready port setting

1bfe87d

Revert my unrelated and undocumented and buggy change to the -p port setting for pg_isready in wait_for_postgres(). See the reviews at #105 for more details. Special thanks to @drotheram for catching this bug!

add_data: Fix erroneous CSV generated by merge_csv

21fb875

Remove RUN from the awk command, as RUN with '>' ended up prepending LOG() output into the first line of the merged CSV file. Special thanks to Drew Rotheram (@drotheram) for catching this! See reviews at #105 for more information.

anthonyfok mentioned this pull request Jun 11, 2021

Add HEADER to CSV copy statement OpenDRR/model-factory#66

Merged

anthonyfok modified the milestones: Sprint 35, Sprint 36 Jun 11, 2021

anthonyfok changed the title ~~Pipeline optimization (Sprint 34–35)~~ Pipeline optimization (Sprint 34–36) Jun 11, 2021

drotheram reviewed Jun 12, 2021

View reviewed changes

python/add_data.sh Outdated Show resolved Hide resolved

anthonyfok commented Jun 12, 2021

View reviewed changes

python/add_data.sh Outdated Show resolved Hide resolved

add_data: Remove double quotes around ${ES_CREDENTIALS:-}

3b9c39c

to avoid "curl: (3) URL using bad/illegal format or missing URL" (non-fatal error) when ES_CREDENTIALS is empty where curl would interpret the quoted empty string "" as an URL.

anthonyfok merged commit 784ed80 into master Jun 14, 2021

anthonyfok deleted the pipeline-optimization branch June 14, 2021 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline optimization (Sprint 34–36) #105

Pipeline optimization (Sprint 34–36) #105

anthonyfok commented May 25, 2021 •

edited

Loading

github-actions bot left a comment

anthonyfok commented Jun 7, 2021

drotheram Jun 9, 2021

anthonyfok Jun 10, 2021

drotheram commented Jun 11, 2021

anthonyfok commented Jun 11, 2021

drotheram commented Jun 12, 2021

anthonyfok commented Jun 14, 2021

Pipeline optimization (Sprint 34–36) #105

Pipeline optimization (Sprint 34–36) #105

Conversation

anthonyfok commented May 25, 2021 • edited Loading

Previous notes

Caveats

Observations

Sample run log

github-actions bot left a comment

Choose a reason for hiding this comment

anthonyfok commented Jun 7, 2021

drotheram Jun 9, 2021

Choose a reason for hiding this comment

anthonyfok Jun 10, 2021

Choose a reason for hiding this comment

drotheram commented Jun 11, 2021

anthonyfok commented Jun 11, 2021

drotheram commented Jun 12, 2021

anthonyfok commented Jun 14, 2021

anthonyfok commented May 25, 2021 •

edited

Loading