Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add parallelism to major chunk of hooks. Check Parallelism section in README #620

Merged
merged 55 commits into from
Feb 17, 2024

Conversation

MaxymVlasov
Copy link
Collaborator

@MaxymVlasov MaxymVlasov commented Feb 9, 2024

Put an x into the box if that apply:

  • This PR introduces breaking change.
  • This PR fixes a bug.
  • This PR adds new functionality.
  • This PR enhances existing functionality.

Reasoning

We have a GH workflow that runs lockflies updates every week (implementation and reasoning here).
It usually takes from 2h 30min to 3h 15min. That was fine for us, till we found that our GH runners, based on AWS EC2s, started silently failing after 30min "without recent logs", and that was fixed by crutch which sends a dummy log every 10min.

However, during the debugging, I spent some time describing why hooks were not utilizing all the provided resources.
And that means a waste of time and money, not only for that corner case but for every huge commit, which can cause opting out by git commit -n of using hooks locally for changes that affect many directories.

Description of your changes

  • Add per-hook --parallelism-limit setting to --hook-config. Defaults to number of logical CPUs - 1
  • Main implementation uses flock, which not pre-installed in macOS and some ancient BSD/Linux systems.
    • To maintain backward compatibility, for cases where no flock - fallback to "bash-native locking mechanism" was added.
  • As quick tests show, ~5% of stacks face race condition problem, no matter if any locking mechanism exists or dirs try to init in parallel. I suppose the lock failed as it uses disk when hooks run in memory, so the creation of the lock can take some time as there bunch of caches between Mem and Disk. These milliseconds are enough to allow running a few t init in parallel.
  • Final implementation uses a retry mechanism for cases when race condition failed to t init directory.

In quick tests, I can say that on big changes:

  • With flock, there 300-500% speed increase for most of the hooks.
  • Without flock, there 250-450% speed increase for most of the hooks.
  • Up to 2000% speed increase for terraform_validate, and up to 500% - for other affected hooks.
  • When --parallelism-limit=1 I observed an insignificant increase in time (about 5-10%) compared to v1.86.0 which has no parallelism at all. This may be the cost of maintaining parallelism or the result of external factors since the tests were not conducted in a vacuum.

For small changes, improvements are less significant.

How can we test changes

There is no easy way to test it, but that could be used as a testing template
repos:
- repo: https://github.com/antonbabenko/pre-commit-terraform
  rev: f8d26f85b054e2842ef1b0c5bcef82d8c8cf1832
  hooks:
    # - id: infracost_breakdown
    #   args:
    #     # - --args=--config-file=test/infracost.yml
    #     - --args=--path=./environment/qa
    #     - --args=--terraform-plan-flags="-var-file=../../test.tfvars"t
    #     - --args=--terraform-plan-flags="-var-file=../../test2.tfvars"
    #     - --hook-config='.totalHourlyCost >= 0.1'
    #     - --hook-config=".totalHourlyCost|tonumber > 1"
    #     - -h ".projects[].diff.totalMonthlyCost|tonumber!=10000"
    #     - -h [.projects[].diff.totalMonthlyCost | select (.!=null) | tonumber] | add > 1000
    #   verbose: true # Always show costs

    # - id: terraform_checkov
    #   args:
    #   #   - --args=--quiet
    #     - --args=--skip-check CKV_SECRET_14
    #     # - --env-vars=ANSI_COLORS_DISABLED=true
    #     # - --hook-config=--parallelism-limit=1


    # - id: terraform_docs
    #   args:
    #   # - --hook-config=--path-to-file=README.md        # Valid UNIX path. I.e. ../TFDOC.md or docs/README.md etc.
    #   # - --hook-config=--add-to-existing-file=true     # Boolean. true or false
    #   - --hook-config=--create-file-if-not-exist=true # Boolean. true or false
    #   # - --args=--config=__GIT_WORKING_DIR__/.terraform-docs.yml
    #   - --args=--config=.terraform-docs.yml
    #   # - --args=--lockfile=false
    #   # - --args=--max-depth=2


    # - id: terraform_fmt
    #   exclude: environment\/.*$


    # - id: terraform_providers_lock
    #   args:
    #   # - --args=-platform=windows_amd64
    #   # - --args=-platform=darwin_amd64
    #   - --args=-platform=linux_amd64

    #   - --hook-config=--mode=always-regenerate-lockfile
    #   # - --hook-config=--parallelism-limit=CPU_NUM*2
    #   verbose: true

    # - id: terraform_tflint
    #   args:
    #     - --args=--config=__GIT_WORKING_DIR__/.config/tflint.hcl
    #     # # - '--args=--config=__GIT_WORKING_DIR__/${CONFIG_FILE}' # export CONFIG_FILE=.tflint.hcl
    #     # - --args=--config=__GIT_WORKING_DIR__/${CONFIG_NAME}.${CONFIG_EXT} # export CONFIG_NAME=.tflint; export CONFIG_EXT=hcl
    #     - --args=--module

    # - id: terraform_tfsec
    #   args:
    #       - --args=--config-file=.tfsec.json
    #       - --args=--force-all-dirs
    #       - --args=--exclude-downloaded-modules
    #       - --args=--concise-output
    #       - --args=--config-file=__GIT_WORKING_DIR__/.tfsec.json

    # - id: terraform_trivy
    #   args:
    #     - --hook-config=--parallelism-limit=CPU_NUM*4


    # - id: terraform_validate
    #   args:
    #     - --tf-init-args=-upgrade
    #     - --tf-init-args=-get=true
    #     - --env-vars=AWS_DEFAULT_REGION="us-west-2"
    #     - --env-vars=AWS_ACCESS_KEY_ID="anaccesskey"
    #     - --env-vars=AWS_SECRET_ACCESS_KEY="asecretkey"
    #     - --hook-config=--retry-once-with-cleanup=true     # Boolean. true or false
    # #    - --hook-config=--parallelism-limit=CPU_NUM*4
    #   verbose: true



    # - id: terraform_validate
    #   args:
    #     - --hook-config=--retry-once-with-cleanup=true


    # - id: terragrunt_fmt


    # - id: terragrunt_validate


    # - id: terrascan
    #   args:
    #     - --args=--non-recursive # avoids scan errors on subdirectories without Terraform config files
    #     - --args=--policy-type aws
    #     # - --hook-config=--parallelism-limit=CPU_NUM*2


    - id: tfupdate
      name: Autoupdate AWS provider versions
      args:

        - --args=provider aws
        - --args=--version '~> 4.2.0'
        - --hook-config=--parallelism-limit=1


    # - id: tfupdate
    #   name: Autoupdate Helm provider versions
    #   args:
    #     - --args=provider helm
    #     - --args=-v 2.5.0
    #     - --args=--ignore-path sdfs/sfdfs

    # - id: tfupdate
    #   name: Update Google provider to specified version
    #   args:
    #     - --args=provider google
    #     - --args=--ignore-path sdfs/sfdfs
    #     - --args=--version ">= 4.18.0, < 5"

    # - id: tfupdate
    #   name: Update Terraform provider to specified version
    #   args:
    #     - --args=terraform
    #     - --args=--version "> 1.1.8"
    #     - --env-vars=AWS_DEFAULT_REGION="us-west-2"
    #     - --env-vars=AWS_ACCESS_KEY_ID="anaccesskey"
    #     - --env-vars=AWS_SECRET_ACCESS_KEY="asecretkey"

README.md Outdated Show resolved Hide resolved
@MaxymVlasov MaxymVlasov requested review from antonbabenko and robinbowes and removed request for robinbowes February 13, 2024 13:33
README.md Show resolved Hide resolved
hooks/_common.sh Outdated Show resolved Hide resolved
…623)

So, I found that `nproc` always shows how many CPUs available is. K8s "limits" and docker `--cpus` are throttling mechanisms, which do not hide the visibility of all cores.
There are a few workarounds, but  IMO, it is better to implement checks for that than do them

>Workaround for docker - set `--cpuset-cpus`
>Workaraund for K8s - somehow deal with [kubelet static CPU management policy](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#cpu-management-policies), as [recommend in Reddit](https://news.ycombinator.com/item?id=25224714)

* Send all "colorify" logs through stderr, as make able to add user-facing-logs in functions that also need to return same value to the function-caller. Needed for `common::get_cpu_num` err_msg show up

---------

Co-authored-by: George L. Yermulnik <[email protected]>
hooks/_common.sh Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
hooks/_common.sh Outdated Show resolved Hide resolved
hooks/_common.sh Show resolved Hide resolved
Copy link
Owner

@antonbabenko antonbabenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG! It seems to work as expected ‼️

@yermulnik
Copy link
Collaborator

OMG! It seems to work as expected ‼️

Yeah, that's a bit of magic by Max 😄

@MaxymVlasov MaxymVlasov merged commit 6c6eca4 into master Feb 17, 2024
5 checks passed
@MaxymVlasov MaxymVlasov deleted the feat/parallelizm branch February 17, 2024 20:41
antonbabenko pushed a commit that referenced this pull request Feb 17, 2024
# [1.87.0](v1.86.1...v1.87.0) (2024-02-17)

### Features

* Add parallelism to major chunk of hooks. Check `Parallelism` section in README ([#620](#620)) ([6c6eca4](6c6eca4))
@antonbabenko
Copy link
Owner

This PR is included in version 1.87.0 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants