Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline optimization (Sprint 34–36) #105

Merged
merged 24 commits into from
Jun 14, 2021
Merged

Conversation

anthonyfok
Copy link
Member

@anthonyfok anthonyfok commented May 25, 2021

Dear all,

This PR represents interim result of the ongoing the add_data.sh pipeline optimization work as of Friday, May 21, 2021 Friday, June 4, 2021 (status update on Monday, June 7).

  • Fetch from compressed repos (Stage 1):
    • fetch_csv_xz function
    • Move code to fetch_psra_csv_from_model and merge_csv functions
    • Fetch compressed CSV for PSRA and DSRA too: TODO Defer to future pull requests
  • Speed up file writes with eatmydata
  • Set synchronous_commit=off for speed
  • Add helper functions: LOG, RUN, INFO, WARN, ERROR
    • Hide secrets (GITHUB_TOKEN, POSTGRES_PASS, ES_PASS) in LOG
    • RUN(): Print line number and function too, with options to turn them off.
  • Add dry-run mode testing and debugging
  • Make add_data.sh and postgis/*.sh ShellCheck clean
  • Read variables from environment instead of command-line arguments (Fix potential curl error due to KIBANA_ENDPOINT being empty at "Creating PSRA Kibana Index Patterns")
  • Verification of file checksums (Stage 1)
    • Fetch pointers of CSV files for "oid sha256"
    • To actually use these checksums Defer to a future pull request
  • Move all major steps into their own functions, and call them from a new main() function (ideas from Google Shell Style Guide). To be used as prerequisite for Split off ES section into a separate add_es_data.sh script (Was: Update add_data.sh script to allow running Postgis and ES sections independently) #99

It is a work-in-progress, and not quite ready for merging in light of the bugs/caveats below, but can be used for testing. The simplest way to test is probably to check out the pipeline-optimization branch.

2021-06-07 Update: I think it is finally ready for merging. Pleasesee #105 (comment) for details.


Previous notes

Caveats

  • About 2/3 (?) of files are still being downloaded over Git LFS?
  • I seem to have introduced at least the following new (non-aborting) error; yet to debug:
    ./add_data.sh: line 568: ((: --exposureAgg=b: attempted assignment to non-variable (error token is "=b")
    
    (Update: Not seen in my June 4-to-5 run. The problem fixed itself? Not sure...)
  • As such, no guarantee of correctness (yet).
  • My last run crashed here, apparently due to a missing or empty KIBANA_ENDPOINT variable, but could be bug in my RUN() function too: Resolved in commit 40e10fb on June 3, 2021.
    2021-05-22T19:27:12.273954877Z Creating PSRA Kibana Index Patterns
    2021-05-22T19:27:12.304591179Z [add_data] curl -X POST -H 'securitytenant: global' -H 'Content-Type: application/json' /api/saved_objects/index-pattern/psra*all_indicators_s -H 'kbn-xsrf: true' -d '{ "attributes": { "title":"psra*all_indicators_s"}}'
    2021-05-22T19:27:12.314475601Z curl: (3) URL using bad/illegal format or missing URL
    2021-05-22T19:27:12.315562560Z Command exited with non-zero status 3
    
    Original code:
    echo "Creating PSRA Kibana Index Patterns"
    RUN curl -X POST -H "securitytenant: global" -H "Content-Type: application/json" "${KIBANA_ENDPOINT}/api/saved_objects/index-pattern/psra*all_indicators_s" -H "kbn-xsrf: true" -d '{ "attributes": { "title":"psra*all_indicators_s"}}'

Observations

Sample run log

See 2021-05-22-sample-run.log.

It was generated with:

docker-compose logs -t --tail="all" > "logs/$(date -Is).log"
egrep 'python-opendrr_1    |\[add_data\]|Z real|add_data\.sh' logs/2021-05-25T10:02:16-06:00.log | cut -b37- > 2021-05-22-sample-run.log

(TODO: quick-and-dirty commands; need to improve.)

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lintly has detected code quality issues in this pull request.

@anthonyfok anthonyfok changed the title Pipeline optimization (stage 1) Pipeline optimization (Sprint 35) Jun 2, 2021
@anthonyfok anthonyfok added this to the Sprint 35 milestone Jun 2, 2021
@anthonyfok anthonyfok self-assigned this Jun 2, 2021
@anthonyfok anthonyfok changed the title Pipeline optimization (Sprint 35) Pipeline optimization (Sprint 34–35) Jun 2, 2021
@anthonyfok anthonyfok force-pushed the pipeline-optimization branch 3 times, most recently from 40e10fb to d3196e1 Compare June 3, 2021 17:25
@anthonyfok anthonyfok added Bug Something isn't working Enhancement New feature or request labels Jun 7, 2021
@anthonyfok
Copy link
Member Author

Hi @jvanulde and @drotheram,

I think this PR is finally ready for merging, so please kindly review at your convenience.

While all the origin goals are completed (please see the TODOs in the revised top post), my latest test run got up to "Load Social Fabric Views" until I ran out of memory. As this is the 2nd last step, I am super lucky and happy!

python-opendrr_1  | 2021-06-05T16:19:07.537772692Z [add_data:809:import_data_from_postgis_to_elasticsearch] RUN: python3 socialFabric_postgres2es.py --type=all_indicators --aggregation=sauid --geometry=geom_poly --idField=Sauid
python-opendrr_1  | 2021-06-05T17:34:21.236350433Z Command terminated by signal 9
python-opendrr_1  | 2021-06-05T17:34:25.132522098Z 162.12user 33.91system 1:15:02elapsed 4%CPU (0avgtext+0avgdata 10087920maxresident)k
python-opendrr_1  | 2021-06-05T17:34:25.132546155Z 1361040inputs+0outputs (165614major+3378895minor)pagefaults 0swaps

I went a bit off-topic in this PR: Besides speed optimization, there are some not-directly-related bug fixes, and probably more on reorganization of the code. In the future, I'll try to keep my PRs smaller and more focused.

Here are the logs for my recent run, separated by service:

Alternatively, the combined full log: 2021-06-05_full.log

Thanks again!
(And again, my apologies for my delay in other tasks... will be getting to them soon ASAP!)

(With revision up to 2021-05-20)
to simplify code for PSRA CSV imports

(Updated 2021-05-21)
to download from xz-compressed repos for speed and cost-saving (no LFS)

See #91
Also, upgrade git to the latest version (2.32.0.rc0 as of this writing)
because "git checkout" for model-inputs got stuck with git 2.28.

See #83
Commands are prefixed with RUN or "is_dry_run || " in add_data.sh
for more verbose logging and to allow dry run.

Dry-run mode may be enabled by using ADD_DATA_DRY_RUN=true
in the .env file; see sample.env for example.
unless their values are literally "password".

Also fix bug in LOG() where secrets were not hidden
when there was only one argument.
Instead of manually passing the needed variables as command-line arguments
to add_data.sh in python/Dockerfile, rely on these variables being already
in the environment, as defined either in .env file for Docker Compose,
or in the task definition of Amazon ECS.

Also add a quick environment variable check at the beginning of
add_data.sh to warn if any variable is empty.

This fixes "curl: (3) URL using bad/illegal format or missing URL" error
in "Creating PSRA Kibana Index Patterns" due to unquoted command-line
arguments in python/Dockerfile, causing KIBANA_ENDPOINT to be empty
when the optional ES_USER and ES_PASS are empty.
Put the main program in a function called "main" as the bottommost function
from which all the major steps are called.  See the Google Shell Style Guide
at https://google.github.io/styleguide/shellguide.html#s7.8-main

Other changes include:

 * LOG(): Correct quoting for one argument containing single quote
 * ERROR(): Exit the program after showing
 * check_environment_variables(): Abort if a mandatory variable is undefined
 * LOG(): Print line numbers and function names by default too,
   see ADD_DATA_PRINT_LINENO and ADD_DATA_PRINT_FUNCNAME in sample.env
Also: Rename import_data_from_postgis_to_elasticsearch to
export_to_elasticsearch for shorter line lengths in the log.
I originally wanted to make merge_csv to handle OpenQuake CSV comment
header stripping, but have not found a good solution yet, so that
functionality remains in fetch_psra_csv_from_model.

This commit fixes the error in the merge_csv function description.

INFO "Trying to download pre-generated PostGIS database dump (for speed)..."
if RUN curl -O -v --retry 999 --retry-max-time 0 https://opendrr.eccp.ca/file/OpenDRR/opendrr-boundaries.dump || \
RUN curl -O -v --retry 999 --retry-max-time 0 https://f000.backblazeb2.com/file/OpenDRR/opendrr-boundaries.dump
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to the github repo or s3 bucket if needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Will follow up at #116, probably in a follow-up PR.

Revert my unrelated and undocumented and buggy change to the
-p port setting for pg_isready in wait_for_postgres().
See the reviews at #105 for more details.

Special thanks to @drotheram for catching this bug!
Remove RUN from the awk command, as RUN with '>' ended up prepending
LOG() output into the first line of the merged CSV file.

Special thanks to Drew Rotheram (@drotheram) for catching this!
See reviews at #105 for more information.
This allows the script to be run without error in e.g. the db-opendrr
(postgis) service container where GNU time is not pre-installed.

Note: GNU time is not strictly needed because it is used mainly for
tracking memory usage (Maximum resident set size or "maxresident") only.
It may be installed in Debian-based container (e.g. db-opendrr) using
"apt update && apt install time".
@drotheram
Copy link
Contributor

Since the code using backblaze2b has already been merged into the main branch, I'm proposing, if there are not other issues, that we merge into the main branch. We can treat the backblaze to s3 migration as a separate issue

@anthonyfok
Copy link
Member Author

Thank you Drew for your amazing review, especially in how you have caught and resolved the bugs in the PR.

Let's "Rebase and merge" this in synchronous with OpenDRR/model-factory#66.
(I'll wait for you to review OpenDRR/model-factory#66 first.)

@anthonyfok anthonyfok modified the milestones: Sprint 35, Sprint 36 Jun 11, 2021
@anthonyfok anthonyfok changed the title Pipeline optimization (Sprint 34–35) Pipeline optimization (Sprint 34–36) Jun 11, 2021
@drotheram
Copy link
Contributor

Looks good for the most part so far but raw data ingest of PSRA results for YT into PostGIS is failing at my runtime. Need to investigate further. Otherwise everything else looks good!

python/add_data.sh Outdated Show resolved Hide resolved
Special thanks to @drotheram for catching this glaring mistake of mine!

The error was introduced in commit 50af463 (commented EXPECTED_PT_LIST)
and then in commit 12dbd02 where I renamed it to PT_LIST to replace
the originally fetched list without actually verifying it.
python/add_data.sh Outdated Show resolved Hide resolved
to avoid "curl: (3) URL using bad/illegal format or missing URL"
(non-fatal error) when ES_CREDENTIALS is empty where curl would
interpret the quoted empty string "" as an URL.
@anthonyfok
Copy link
Member Author

Thanks a million Drew! Total respect for your meticulous code review!

@anthonyfok anthonyfok merged commit 784ed80 into master Jun 14, 2021
anthonyfok added a commit that referenced this pull request Jun 14, 2021
Revert my unrelated and undocumented and buggy change to the
-p port setting for pg_isready in wait_for_postgres().
See the reviews at #105 for more details.

Special thanks to @drotheram for catching this bug!
anthonyfok added a commit that referenced this pull request Jun 14, 2021
Remove RUN from the awk command, as RUN with '>' ended up prepending
LOG() output into the first line of the merged CSV file.

Special thanks to Drew Rotheram (@drotheram) for catching this!
See reviews at #105 for more information.
@anthonyfok anthonyfok deleted the pipeline-optimization branch June 14, 2021 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants