From 0475eb94c58e47d42cca5a7ab1fb8d59d4707276 Mon Sep 17 00:00:00 2001 From: Geoffrey Oxberry Date: Thu, 15 Jun 2023 07:33:37 -0700 Subject: [PATCH] regression detector: change baseline variant from latest main to merge base (#17449) * regression detector: add compute merge base job We need to compute the merge base of a non-`main` branch with respect to `main` in order to establish the commit SHA of a baseline variant in regression detection. This commit adds a job to compute that merge base, echo the result to a file (sans newline), and then upload that result to the Single-Machine Performance S3 bucket for Agent team. Signed-off-by: Geoffrey M. Oxberry * regression detector: note build step we may need The existing Single-Machine Performance (SMP) regression detector setup doesn't build additional containers -- it just publishes containers already built in Agent CI to SMP's ECR for Agent images. I've written the new "merge-base" container assuming we can continue with that strategy, but if we can't continue with that strategy for some reason, then this commit includes a bunch of comments sketching out a backup plan that would build the container we would need and publish it to SMP's ECR for Agent images. Signed-off-by: Geoffrey M. Oxberry * regression detectr: sketch functional test tweaks As in the spirit of the previous commit, this commit adds some comments on how to transition the existing setup in `functional_test/regression_detector.yml` from using the latest commit on `main` to using a "merge base" baseline SHA. There are a few parts of this commit that are actually functional, but shouldn't functionally alter the existing regression detector output -- it still uses a "latest `main`" baseline SHA -- but it does introduce the idea of getting the new baseline SHA from an artifact in a previous stage, rather than uploading it to S3 (although I also upload the merge base baseline SHA to S3 as well, as a contingency plan). Signed-off-by: Geoffrey M. Oxberry * smp, docker_linux.yml: use literal for image I thought using a `!reference` tag would work for specifying an image for the merge-base-computing job, but that mental model may be wrong, so this commit replaces that tag with the literal it's supposed to (de)reference. Signed-off-by: Geoffrey M. Oxberry * smp, docker_linux.yml: comment out merge base job The merge base job isn't running because no runners are configured to run it. This issue could be one that I can't resolve, so it could be that I need to get permissions to run it. For now, I'll comment out this job and instead try and get the information necessary to compute a baseline as part of the regression detector job, but without otherwise altering the regression detector behavior. Signed-off-by: Geoffrey M. Oxberry * regression_detector.yml: compute merge base As a test of whether I can compute the merge base using a `git fetch` command, this commit adds that command (and other supporting commands) to the regression detector job to see if these commands will work. Signed-off-by: Geoffrey M. Oxberry * smp, docker_linux.yml: add container build notes I've managed to figure out how to get the most important case to work -- the one in which the base branch of the pull request (i.e., in GitLab terms, the target branch of the merge request) is `main`. Having managed to get this case to work, this commit updates my implementation notes to reflect how to implement the case when the base branch of the pull request *is not* `main`. Signed-off-by: Geoffrey M. Oxberry * regression detector: set baseline to merge base It looks like creating a separate job (at least in `.gitlab/container_build/docker_linux.yml`) will not work because GitLab won't run it in a runner. This behavior may be a configuration setting for security reasons, and may require repo admin privileges to change. Instead, this commit implements setting the baseline SHA to the merge base, along with some commands to abort the regression detector job if the base branch of the pull request (target branch of the merge request) is not `main`, and copies the merge base SHA to S3 for debugging/auditing purposes. Signed-off-by: Geoffrey M. Oxberry * regression_detector.yml: remove obsolete comments Now that I'm pretty sure I've figured out how to change the baseline SHA to the merge base of a pull request, this commit deletes most of the commented-out lines I introduced into `.gitlab/functional_test/regression_detector.yml` as notes to myself. These comments are no longer necessary for record-keeping. Signed-off-by: Geoffrey M. Oxberry * docker_linux.yml: correct line comment This commit corrects a line comment in which I write "Use this line if comparing `main` against itself makes sense" when I should have written "Use this line if comparing `main` against itself does not make sense". Signed-off-by: Geoffrey M. Oxberry * regression_detector.yml: make if block a one-liner GitLab apparently doesn't echo multiline commands by default, so this commit rewrites this multiline `if` block as a one line command for debugging purposes. Signed-off-by: Geoffrey M. Oxberry * regression_detector.yml: remove backticks I'm pretty sure that the "command `main` not found" message I saw was because the shell was likely interpreting those backticks as running a command. This commit replaces the backitcks with single quotes to avoid having the shell attempt to run a command called `main`. Signed-off-by: Geoffrey M. Oxberry * regression_detector.yml: disable base branch check This commit temporarily disables the check for the base branch because it doesn't quite seem to work at the moment, and it adds enough debugging output so I can get some idea of why that check does not work. Signed-off-by: Geoffrey M. Oxberry * regression_detector.yml: add artifacts + cleanup This commit cleans up the last bits of debugging comments and code in `.gitlab/functional_test/regression_detector.yml`, plus it adds the reporting outputs as artifacts to provide additional diagnostic output for debugging purposes. Signed-off-by: Geoffrey M. Oxberry * docker_linux.yml: remove comment cruft This commit removes a large comment block I put in `.gitlab/container_build/docker_linux.yml` because I no longer think it's a good idea to compute the merge base commit in a job separate from the regression detector job. Signed-off-by: Geoffrey M. Oxberry * regression detector: compute baseline via loop The previous regression detector baseline SHA computation assumed that an image exists in ECR for the commit returned by `$(git commit merge-base HEAD main)`, *i.e.*, the merge base with `main`. However, the container associated with that commit may not exist because that container may have failed to build, or may have failed to upload to ECR, so we need to check whether that container exists in ECR. If that container exists, then the merge base with `main` is the baseline SHA. If not, then we must iterate over predecessor commits in `main` until we find a commit in `main` for which a container exists in ECR. The first commit we find in this loop becomes the baseline SHA for the regression detector. This commit implements that check, along with a loop that iterates over predecessor commits, if necessary, as described above. The implementation is currently a rather verbose one-liner because single-line statements are easier to debug in GitLab CI. Once this implementation succeeds, a subsequent commit will clean up the implementation by changing it to a multi-line statement, after which this branch will be ready for review. Signed-off-by: Geoffrey M. Oxberry * regression detector: reduce stderr noise The initial implementation of the container existence check turns out to be pretty noisy when a container doesn't exist because `awscli` will output a bunch of error information on failure. While this information is helpful for debugging purposes, it will be annoying in CI, so this commit redirects the `stderr` of that command to `/dev/null` to reduce noise in the CI output of the regression detector CI job. Signed-off-by: Geoffrey M. Oxberry * regression detector: fix copy before write error The regression detector job in the Agent CI pipeline currently fails because it attempts to copy a file that doesn't exist to S3. This commit fixes that error by moving the copy command to a point after the file is generated. Signed-off-by: Geoffrey M. Oxberry * regression detector: tweak ecr query for debugging For some reason, the `aws ecr describe-images` command is not finding images that I know exist. To aid in debugging at the expense of noise, this commit removes redirection of `stderr` to `/dev/null`, adds the `--registry-id` flag to be more explicit about ECR repo location, and also moves the `--profile` flag and its argument to a more readable location. Signed-off-by: Geoffrey M. Oxberry * regression detector: remove "latest main" job The baseline-SHA-computing steps in the `single-machine-performance-regression_detector` job make computing the commit of "latest `main`" unnecessary, so this commit deletes the job that does that computation. Signed-off-by: Geoffrey M. Oxberry * regression detector: remove loop cleanup comment Due to falling behind on Agent performance investigations, I don't think I'm going to get to clean up the long, one line loop statement, so this commit deletes that aspirational comment -- it can be cleaned up in a subsequent pull request. Signed-off-by: Geoffrey M. Oxberry * regression detector: pin pr-commenter version This commit attempts to triage the `pr-commenter` errors by pinning the `pr-commenter` version in a fashion similar to that used by other Datadog repositories (e.g., `dogweb`). Signed-off-by: Geoffrey M. Oxberry * regression detector: delete more stale comments For now, I don't plan to move the baseline SHA computation into a separate job, although that may happen later. Given this change of plans, this commit removes the stale comment regarding refactoring the baseline SHA computation to a separate job. Signed-off-by: Geoffrey M. Oxberry --------- Signed-off-by: Geoffrey M. Oxberry --- .gitlab/container_build/docker_linux.yml | 17 --------- .../functional_test/regression_detector.yml | 35 +++++++++++++------ 2 files changed, 25 insertions(+), 27 deletions(-) diff --git a/.gitlab/container_build/docker_linux.yml b/.gitlab/container_build/docker_linux.yml index 9dfd539b4676a4..19289b03e52615 100644 --- a/.gitlab/container_build/docker_linux.yml +++ b/.gitlab/container_build/docker_linux.yml @@ -119,23 +119,6 @@ docker_build_agent7: TAG_SUFFIX: -7 BUILD_ARG: --target test --build-arg PYTHON_VERSION=3 --build-arg DD_AGENT_ARTIFACT=datadog-agent_7*_amd64.deb -single_machine_performance-push-main-sha: - stage: container_build - rules: - - !reference [.on_main_a7] - image: 486234852809.dkr.ecr.us-east-1.amazonaws.com/ci/datadog-agent-buildimages/docker_x64$DATADOG_AGENT_BUILDIMAGES_SUFFIX:$DATADOG_AGENT_BUILDIMAGES - tags: ["runner:docker"] - needs: - - docker_build_agent7 - script: - - echo -n ${CI_COMMIT_SHA} > .main_sha - - SMP_ACCOUNT_ID=$(aws ssm get-parameter --region us-east-1 --name ci.datadog-agent.single-machine-performance-account-id --with-decryption --query "Parameter.Value" --out text) - - SMP_AGENT_TEAM_ID=$(aws ssm get-parameter --region us-east-1 --name ci.datadog-agent.single-machine-performance-agent-team-id --with-decryption --query "Parameter.Value" --out text) - - aws configure set aws_access_key_id $(aws ssm get-parameter --region us-east-1 --name ci.datadog-agent.single-machine-performance-bot-access-key-id --with-decryption --query "Parameter.Value" --out text) --profile single-machine-performance - - aws configure set aws_secret_access_key $(aws ssm get-parameter --region us-east-1 --name ci.datadog-agent.single-machine-performance-bot-access-key --with-decryption --query "Parameter.Value" --out text) --profile single-machine-performance - - aws configure set region us-west-2 --profile single-machine-performance - - aws s3 cp --profile single-machine-performance --only-show-errors .main_sha s3://${SMP_AGENT_TEAM_ID}-smp-artifacts/information/latest_main_sha - single_machine_performance-amd64-a7: extends: .docker_publish_job_definition stage: container_build diff --git a/.gitlab/functional_test/regression_detector.yml b/.gitlab/functional_test/regression_detector.yml index 041f06ad696154..0d362fa22ef32b 100644 --- a/.gitlab/functional_test/regression_detector.yml +++ b/.gitlab/functional_test/regression_detector.yml @@ -8,7 +8,10 @@ single-machine-performance-regression_detector: artifacts: expire_in: 1 weeks paths: - - submission_metadata + - submission_metadata # for provenance, debugging + - ${CI_COMMIT_SHA}-baseline_sha # for provenance, debugging + - outputs/report.md # for debugging, also on S3 + - outputs/report.html # for debugging, also on S3 when: always variables: SMP_VERSION: 0.7.3 @@ -22,7 +25,7 @@ single-machine-performance-regression_detector: # 'comparison'. We are guaranteed by the structure of the pipeline that # 'comparison' exists, not so much with 'baseline' as it has to come from main # merge pipeline run. This is solved in datadog-agent by updating a file in S3 - # with the SHA of the latest passing build from main. It's solved in Vector by + # with the SHA of the merge base from main. It's solved in Vector by # building Vector twice for each Regression Detector run. # # We allow failure for now. _Unfortunately_ this also means that if the @@ -30,6 +33,14 @@ single-machine-performance-regression_detector: # flagged. allow_failure: true script: + # Ensure output files exist for artifact downloads step + - mkdir outputs # Also needed for smp job sync step + - touch outputs/report.md # Will be emitted by smp job sync + - touch outputs/report.html # Will be emitted by smp job sync + # Compute merge base of current commit and `main` + - git fetch origin + - SMP_MERGE_BASE=$(git merge-base ${CI_COMMIT_SHA} origin/main) + - echo "Merge base is ${SMP_MERGE_BASE}" # Setup AWS credentials for single-machine-performance AWS account - AWS_NAMED_PROFILE="single-machine-performance" - SMP_ACCOUNT_ID=$(aws ssm get-parameter --region us-east-1 --name ci.datadog-agent.single-machine-performance-account-id --with-decryption --query "Parameter.Value" --out text) @@ -41,11 +52,15 @@ single-machine-performance-regression_detector: # Download smp binary and prepare it for use - aws --profile single-machine-performance s3 cp s3://smp-cli-releases/v${SMP_VERSION}/x86_64-unknown-linux-musl/smp smp - chmod +x smp - # Submit job, using the SHA of the last successful main commit as baseline. - # This will have been built in a separate pipeline run. The comparison will - # have been built previously in this pipeline run. - - aws --profile ${AWS_NAMED_PROFILE} s3 cp --only-show-errors s3://${SMP_AGENT_TEAM_ID}-smp-artifacts/information/latest_main_sha .latest_main_sha - - BASELINE_SHA=$(cat .latest_main_sha) + - BASELINE_SHA="${SMP_MERGE_BASE}" + - echo "Computing baseline..." + - echo "Checking if image exists for commit ${BASELINE_SHA}..." + - while [[ ! $(aws ecr describe-images --profile single-machine-performance --registry-id "${SMP_ACCOUNT_ID}" --repository-name "${SMP_AGENT_TEAM_ID}-agent" --image-ids imageTag="${BASELINE_SHA}-7-amd64") ]]; do echo "No image exists for ${BASELINE_SHA} - checking predecessor of ${BASELINE_SHA} next"; BASELINE_SHA=$(git rev-parse ${BASELINE_SHA}^); echo "Checking if image exists for commit ${BASELINE_SHA}..."; done + - echo "Image exists for commit ${BASELINE_SHA}" + - echo "Baseline SHA is ${BASELINE_SHA}" + - echo -n "${BASELINE_SHA}" > "${CI_COMMIT_SHA}-baseline_sha" + # Copy the baseline SHA to SMP for debugging purposes later + - aws s3 cp --profile single-machine-performance --only-show-errors "${CI_COMMIT_SHA}-baseline_sha" "s3://${SMP_AGENT_TEAM_ID}-smp-artifacts/information/" - BASELINE_IMAGE=${SMP_ECR_URL}/${SMP_AGENT_TEAM_ID}-agent:${BASELINE_SHA}-7-amd64 - echo "${BASELINE_SHA} | ${BASELINE_IMAGE}" - COMPARISON_IMAGE=${SMP_ECR_URL}/${SMP_AGENT_TEAM_ID}-agent:${CI_COMMIT_SHA}-7-amd64 @@ -76,7 +91,6 @@ single-machine-performance-regression_detector: --wait-delay-seconds 60 --submission-metadata submission_metadata # Now that the job is completed pull the analysis report, output it to stdout. - - mkdir outputs - RUST_LOG="${RUST_LOG}" ./smp --team-id ${SMP_AGENT_TEAM_ID} --aws-named-profile ${AWS_NAMED_PROFILE} job sync --submission-metadata submission_metadata @@ -93,7 +107,7 @@ single-machine-performance-regression_detector: - dpkg -i dd-package.deb - rm dd-package.deb - apt-get update - - dd-package --bucket binaries.ddbuild.io --package devtools/dd-package-dev + - dd-package --bucket binaries.ddbuild.io --package devtools/dd-package-dev --distribution "20.04" # Kludge from https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4645#note_287636439 to avoid # doubled output - echo "" @@ -103,7 +117,8 @@ single-machine-performance-regression_detector: - "# beneath the error message in the GitLab CI logs #" - "#################################################################################" - echo "" - - dd-package --bucket binaries.ddbuild.io --package devtools/pr-commenter + # Pinned version of pr-commenter taken from https://github.com/DataDog/dogweb/blob/b2a891c2a2186b8fecd27ca9b13ebabe5f289a4a/tasks/gitlab/k8s-diff-helper.sh#L22 + - dd-package --bucket binaries.ddbuild.io --package devtools/pr-commenter --distribution "20.04" --version "14692290-a6440fd1" # Post HTML report to GitHub - cat outputs/report.html | /usr/local/bin/pr-commenter --for-pr="$CI_COMMIT_REF_NAME" # Finally, exit 1 if the job signals a regression else 0.