Skip to content

Commit

Permalink
regression detector: change baseline variant from latest main to merg…
Browse files Browse the repository at this point in the history
…e base (#17449)

* regression detector: add compute merge base job

We need to compute the merge base of a non-`main` branch with respect
to `main` in order to establish the commit SHA of a baseline variant
in regression detection. This commit adds a job to compute that merge
base, echo the result to a file (sans newline), and then upload that
result to the Single-Machine Performance S3 bucket for Agent team.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression detector: note build step we may need

The existing Single-Machine Performance (SMP) regression detector
setup doesn't build additional containers -- it just publishes
containers already built in Agent CI to SMP's ECR for Agent
images. I've written the new "merge-base" container assuming we can
continue with that strategy, but if we can't continue with that
strategy for some reason, then this commit includes a bunch of
comments sketching out a backup plan that would build the container we
would need and publish it to SMP's ECR for Agent images.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression detectr: sketch functional test tweaks

As in the spirit of the previous commit, this commit adds some
comments on how to transition the existing setup in
`functional_test/regression_detector.yml` from using the latest commit
on `main` to using a "merge base" baseline SHA.

There are a few parts of this commit that are actually functional, but
shouldn't functionally alter the existing regression detector output
-- it still uses a "latest `main`" baseline SHA -- but it does
introduce the idea of getting the new baseline SHA from an artifact in
a previous stage, rather than uploading it to S3 (although I also
upload the merge base baseline SHA to S3 as well, as a contingency
plan).

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* smp, docker_linux.yml: use literal for image

I thought using a `!reference` tag would work for specifying an image
for the merge-base-computing job, but that mental model may be wrong,
so this commit replaces that tag with the literal it's supposed to
(de)reference.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* smp, docker_linux.yml: comment out merge base job

The merge base job isn't running because no runners are configured to
run it. This issue could be one that I can't resolve, so it could be
that I need to get permissions to run it. For now, I'll comment out
this job and instead try and get the information necessary to compute
a baseline as part of the regression detector job, but without
otherwise altering the regression detector behavior.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression_detector.yml: compute merge base

As a test of whether I can compute the merge base using a `git fetch`
command, this commit adds that command (and other supporting commands)
to the regression detector job to see if these commands will work.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* smp, docker_linux.yml: add container build notes

I've managed to figure out how to get the most important case to work
-- the one in which the base branch of the pull request (i.e., in
GitLab terms, the target branch of the merge request) is
`main`. Having managed to get this case to work, this commit updates
my implementation notes to reflect how to implement the case when the
base branch of the pull request *is not* `main`.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression detector: set baseline to merge base

It looks like creating a separate job (at least in
`.gitlab/container_build/docker_linux.yml`) will not work because
GitLab won't run it in a runner. This behavior may be a configuration
setting for security reasons, and may require repo admin privileges to
change. Instead, this commit implements setting the baseline SHA to
the merge base, along with some commands to abort the regression
detector job if the base branch of the pull request (target branch of
the merge request) is not `main`, and copies the merge base SHA to S3
for debugging/auditing purposes.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression_detector.yml: remove obsolete comments

Now that I'm pretty sure I've figured out how to change the baseline
SHA to the merge base of a pull request, this commit deletes most of
the commented-out lines I introduced into
`.gitlab/functional_test/regression_detector.yml` as notes to
myself. These comments are no longer necessary for record-keeping.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* docker_linux.yml: correct line comment

This commit corrects a line comment in which I write "Use this line if
comparing `main` against itself makes sense" when I should have
written "Use this line if comparing `main` against itself does not
make sense".

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression_detector.yml: make if block a one-liner

GitLab apparently doesn't echo multiline commands by default, so this
commit rewrites this multiline `if` block as a one line command for
debugging purposes.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression_detector.yml: remove backticks

I'm pretty sure that the "command `main` not found" message I saw was
because the shell was likely interpreting those backticks as running a
command. This commit replaces the backitcks with single quotes to
avoid having the shell attempt to run a command called `main`.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression_detector.yml: disable base branch check

This commit temporarily disables the check for the base branch because
it doesn't quite seem to work at the moment, and it adds enough
debugging output so I can get some idea of why that check does not
work.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression_detector.yml: add artifacts + cleanup

This commit cleans up the last bits of debugging comments and code in
`.gitlab/functional_test/regression_detector.yml`, plus it adds the
reporting outputs as artifacts to provide additional diagnostic output
for debugging purposes.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* docker_linux.yml: remove comment cruft

This commit removes a large comment block I put in
`.gitlab/container_build/docker_linux.yml` because I no longer think
it's a good idea to compute the merge base commit in a job separate
from the regression detector job.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression detector: compute baseline via loop

The previous regression detector baseline SHA computation assumed that
an image exists in ECR for the commit returned by `$(git commit
merge-base HEAD main)`, *i.e.*, the merge base with `main`. However,
the container associated with that commit may not exist because that
container may have failed to build, or may have failed to upload to
ECR, so we need to check whether that container exists in ECR. If that
container exists, then the merge base with `main` is the baseline
SHA. If not, then we must iterate over predecessor commits in `main`
until we find a commit in `main` for which a container exists in
ECR. The first commit we find in this loop becomes the baseline SHA
for the regression detector.

This commit implements that check, along with a loop that iterates
over predecessor commits, if necessary, as described above. The
implementation is currently a rather verbose one-liner because
single-line statements are easier to debug in GitLab CI. Once this
implementation succeeds, a subsequent commit will clean up the
implementation by changing it to a multi-line statement, after which
this branch will be ready for review.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression detector: reduce stderr noise

The initial implementation of the container existence check turns out
to be pretty noisy when a container doesn't exist because `awscli`
will output a bunch of error information on failure. While this
information is helpful for debugging purposes, it will be annoying in
CI, so this commit redirects the `stderr` of that command to
`/dev/null` to reduce noise in the CI output of the regression
detector CI job.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression detector: fix copy before write error

The regression detector job in the Agent CI pipeline currently fails
because it attempts to copy a file that doesn't exist to S3. This
commit fixes that error by moving the copy command to a point after
the file is generated.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression detector: tweak ecr query for debugging

For some reason, the `aws ecr describe-images` command is not finding
images that I know exist. To aid in debugging at the expense of noise,
this commit removes redirection of `stderr` to `/dev/null`, adds the
`--registry-id` flag to be more explicit about ECR repo location, and
also moves the `--profile` flag and its argument to a more readable
location.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression detector: remove "latest main" job

The baseline-SHA-computing steps in the
`single-machine-performance-regression_detector` job make computing
the commit of "latest `main`" unnecessary, so this commit deletes the
job that does that computation.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression detector: remove loop cleanup comment

Due to falling behind on Agent performance investigations, I don't
think I'm going to get to clean up the long, one line loop statement,
so this commit deletes that aspirational comment -- it can be cleaned
up in a subsequent pull request.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression detector: pin pr-commenter version

This commit attempts to triage the `pr-commenter` errors by pinning
the `pr-commenter` version in a fashion similar to that used by
other Datadog repositories (e.g., `dogweb`).

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

* regression detector: delete more stale comments

For now, I don't plan to move the baseline SHA computation into a
separate job, although that may happen later. Given this change of
plans, this commit removes the stale comment regarding refactoring the
baseline SHA computation to a separate job.

Signed-off-by: Geoffrey M. Oxberry <[email protected]>

---------

Signed-off-by: Geoffrey M. Oxberry <[email protected]>
  • Loading branch information
goxberry authored Jun 15, 2023
1 parent 4f2dd17 commit 0475eb9
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 27 deletions.
17 changes: 0 additions & 17 deletions .gitlab/container_build/docker_linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -119,23 +119,6 @@ docker_build_agent7:
TAG_SUFFIX: -7
BUILD_ARG: --target test --build-arg PYTHON_VERSION=3 --build-arg DD_AGENT_ARTIFACT=datadog-agent_7*_amd64.deb

single_machine_performance-push-main-sha:
stage: container_build
rules:
- !reference [.on_main_a7]
image: 486234852809.dkr.ecr.us-east-1.amazonaws.com/ci/datadog-agent-buildimages/docker_x64$DATADOG_AGENT_BUILDIMAGES_SUFFIX:$DATADOG_AGENT_BUILDIMAGES
tags: ["runner:docker"]
needs:
- docker_build_agent7
script:
- echo -n ${CI_COMMIT_SHA} > .main_sha
- SMP_ACCOUNT_ID=$(aws ssm get-parameter --region us-east-1 --name ci.datadog-agent.single-machine-performance-account-id --with-decryption --query "Parameter.Value" --out text)
- SMP_AGENT_TEAM_ID=$(aws ssm get-parameter --region us-east-1 --name ci.datadog-agent.single-machine-performance-agent-team-id --with-decryption --query "Parameter.Value" --out text)
- aws configure set aws_access_key_id $(aws ssm get-parameter --region us-east-1 --name ci.datadog-agent.single-machine-performance-bot-access-key-id --with-decryption --query "Parameter.Value" --out text) --profile single-machine-performance
- aws configure set aws_secret_access_key $(aws ssm get-parameter --region us-east-1 --name ci.datadog-agent.single-machine-performance-bot-access-key --with-decryption --query "Parameter.Value" --out text) --profile single-machine-performance
- aws configure set region us-west-2 --profile single-machine-performance
- aws s3 cp --profile single-machine-performance --only-show-errors .main_sha s3://${SMP_AGENT_TEAM_ID}-smp-artifacts/information/latest_main_sha

single_machine_performance-amd64-a7:
extends: .docker_publish_job_definition
stage: container_build
Expand Down
35 changes: 25 additions & 10 deletions .gitlab/functional_test/regression_detector.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@ single-machine-performance-regression_detector:
artifacts:
expire_in: 1 weeks
paths:
- submission_metadata
- submission_metadata # for provenance, debugging
- ${CI_COMMIT_SHA}-baseline_sha # for provenance, debugging
- outputs/report.md # for debugging, also on S3
- outputs/report.html # for debugging, also on S3
when: always
variables:
SMP_VERSION: 0.7.3
Expand All @@ -22,14 +25,22 @@ single-machine-performance-regression_detector:
# 'comparison'. We are guaranteed by the structure of the pipeline that
# 'comparison' exists, not so much with 'baseline' as it has to come from main
# merge pipeline run. This is solved in datadog-agent by updating a file in S3
# with the SHA of the latest passing build from main. It's solved in Vector by
# with the SHA of the merge base from main. It's solved in Vector by
# building Vector twice for each Regression Detector run.
#
# We allow failure for now. _Unfortunately_ this also means that if the
# Regression Detector finds a performance issue with a PR it will not be
# flagged.
allow_failure: true
script:
# Ensure output files exist for artifact downloads step
- mkdir outputs # Also needed for smp job sync step
- touch outputs/report.md # Will be emitted by smp job sync
- touch outputs/report.html # Will be emitted by smp job sync
# Compute merge base of current commit and `main`
- git fetch origin
- SMP_MERGE_BASE=$(git merge-base ${CI_COMMIT_SHA} origin/main)
- echo "Merge base is ${SMP_MERGE_BASE}"
# Setup AWS credentials for single-machine-performance AWS account
- AWS_NAMED_PROFILE="single-machine-performance"
- SMP_ACCOUNT_ID=$(aws ssm get-parameter --region us-east-1 --name ci.datadog-agent.single-machine-performance-account-id --with-decryption --query "Parameter.Value" --out text)
Expand All @@ -41,11 +52,15 @@ single-machine-performance-regression_detector:
# Download smp binary and prepare it for use
- aws --profile single-machine-performance s3 cp s3://smp-cli-releases/v${SMP_VERSION}/x86_64-unknown-linux-musl/smp smp
- chmod +x smp
# Submit job, using the SHA of the last successful main commit as baseline.
# This will have been built in a separate pipeline run. The comparison will
# have been built previously in this pipeline run.
- aws --profile ${AWS_NAMED_PROFILE} s3 cp --only-show-errors s3://${SMP_AGENT_TEAM_ID}-smp-artifacts/information/latest_main_sha .latest_main_sha
- BASELINE_SHA=$(cat .latest_main_sha)
- BASELINE_SHA="${SMP_MERGE_BASE}"
- echo "Computing baseline..."
- echo "Checking if image exists for commit ${BASELINE_SHA}..."
- while [[ ! $(aws ecr describe-images --profile single-machine-performance --registry-id "${SMP_ACCOUNT_ID}" --repository-name "${SMP_AGENT_TEAM_ID}-agent" --image-ids imageTag="${BASELINE_SHA}-7-amd64") ]]; do echo "No image exists for ${BASELINE_SHA} - checking predecessor of ${BASELINE_SHA} next"; BASELINE_SHA=$(git rev-parse ${BASELINE_SHA}^); echo "Checking if image exists for commit ${BASELINE_SHA}..."; done
- echo "Image exists for commit ${BASELINE_SHA}"
- echo "Baseline SHA is ${BASELINE_SHA}"
- echo -n "${BASELINE_SHA}" > "${CI_COMMIT_SHA}-baseline_sha"
# Copy the baseline SHA to SMP for debugging purposes later
- aws s3 cp --profile single-machine-performance --only-show-errors "${CI_COMMIT_SHA}-baseline_sha" "s3://${SMP_AGENT_TEAM_ID}-smp-artifacts/information/"
- BASELINE_IMAGE=${SMP_ECR_URL}/${SMP_AGENT_TEAM_ID}-agent:${BASELINE_SHA}-7-amd64
- echo "${BASELINE_SHA} | ${BASELINE_IMAGE}"
- COMPARISON_IMAGE=${SMP_ECR_URL}/${SMP_AGENT_TEAM_ID}-agent:${CI_COMMIT_SHA}-7-amd64
Expand Down Expand Up @@ -76,7 +91,6 @@ single-machine-performance-regression_detector:
--wait-delay-seconds 60
--submission-metadata submission_metadata
# Now that the job is completed pull the analysis report, output it to stdout.
- mkdir outputs
- RUST_LOG="${RUST_LOG}" ./smp --team-id ${SMP_AGENT_TEAM_ID} --aws-named-profile ${AWS_NAMED_PROFILE}
job sync
--submission-metadata submission_metadata
Expand All @@ -93,7 +107,7 @@ single-machine-performance-regression_detector:
- dpkg -i dd-package.deb
- rm dd-package.deb
- apt-get update
- dd-package --bucket binaries.ddbuild.io --package devtools/dd-package-dev
- dd-package --bucket binaries.ddbuild.io --package devtools/dd-package-dev --distribution "20.04"
# Kludge from https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4645#note_287636439 to avoid
# doubled output
- echo ""
Expand All @@ -103,7 +117,8 @@ single-machine-performance-regression_detector:
- "# beneath the error message in the GitLab CI logs #"
- "#################################################################################"
- echo ""
- dd-package --bucket binaries.ddbuild.io --package devtools/pr-commenter
# Pinned version of pr-commenter taken from https://github.com/DataDog/dogweb/blob/b2a891c2a2186b8fecd27ca9b13ebabe5f289a4a/tasks/gitlab/k8s-diff-helper.sh#L22
- dd-package --bucket binaries.ddbuild.io --package devtools/pr-commenter --distribution "20.04" --version "14692290-a6440fd1"
# Post HTML report to GitHub
- cat outputs/report.html | /usr/local/bin/pr-commenter --for-pr="$CI_COMMIT_REF_NAME"
# Finally, exit 1 if the job signals a regression else 0.
Expand Down

0 comments on commit 0475eb9

Please sign in to comment.