Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable omnibus build cache #20117

Merged
merged 51 commits into from
Apr 17, 2024
Merged

Enable omnibus build cache #20117

merged 51 commits into from
Apr 17, 2024

Conversation

chouquette
Copy link
Contributor

@chouquette chouquette commented Oct 13, 2023

What does this PR do?

This PR enables omnibus git cache, in order to stop rebuilding all of our dependencies during each CI job.
All packages before the agent are expected to rarely change, and shouldn't have to be rebuilt every single time.
On average, this saves about 15/20 minutes per job.

This also allows individual developers to skip rebuilding every single dependencies if they wish to add a new software dependency to the agent. All they need to do is to provide the OMNIBUS_GIT_CACHE_DIR environment variable to a directory of their choosing.

Motivation

This is part of the currently running initiative to reduce the median pipeline duration under 2 hours. This specific investigation is listed under https://datadoghq.atlassian.net/browse/APL-1805

Associated RFC: https://docs.google.com/document/d/1PSGpd2ixXXMbfzC1j0o514SXypcDiVamxBL3db__Bt0/edit?usp=sharing

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Reviewer's Checklist

  • If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
  • Use the major_change label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote.
  • A release note has been added or the changelog/no-changelog label has been applied.
  • Changed code has automated tests for its functionality.
  • Adequate QA/testing plan information is provided if the qa/skip-qa label is not applied.
  • At least one team/.. label has been applied, indicating the team(s) that should QA this change.
  • If applicable, docs team has been notified or an issue has been opened on the documentation repo.
  • If applicable, the need-change/operator and need-change/helm labels have been applied.
  • If applicable, the k8s/<min-version> label, indicating the lowest Kubernetes version compatible with this feature.
  • If applicable, the config template has been updated.

@chouquette chouquette force-pushed the chouquette/omnibus_cache branch from 79302a8 to f8fd7e1 Compare October 13, 2023 08:12
@pr-commenter
Copy link

pr-commenter bot commented Oct 13, 2023

Bloop Bleep... Dogbot Here

Regression Detector Results

Run ID: 6ad93a5c-18e7-4e1d-a6ad-f271790a3eac
Baseline: d40837d
Comparison: 9f05afb

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

Experiments with missing or malformed data

  • basic_py_check

Usually, this warning means that there is no usable optimization goal data for that experiment, which could be a result of misconfiguration.

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI
file_to_blackhole % cpu utilization -0.26 [-6.80, +6.29]

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI
uds_dogstatsd_to_api_cpu % cpu utilization +1.21 [-0.25, +2.66]
tcp_syslog_to_blackhole ingress throughput +1.17 [+1.12, +1.22]
process_agent_standard_check memory utilization +0.67 [+0.63, +0.70]
process_agent_real_time_mode memory utilization +0.25 [+0.21, +0.28]
otel_to_otel_logs ingress throughput +0.18 [-0.44, +0.81]
idle memory utilization +0.09 [+0.05, +0.12]
trace_agent_json ingress throughput +0.01 [-0.03, +0.05]
trace_agent_msgpack ingress throughput +0.01 [-0.01, +0.02]
uds_dogstatsd_to_api ingress throughput +0.00 [-0.00, +0.00]
tcp_dd_logs_filter_exclude ingress throughput -0.00 [-0.00, +0.00]
file_tree memory utilization -0.17 [-0.23, -0.10]
file_to_blackhole % cpu utilization -0.26 [-6.80, +6.29]
process_agent_standard_check_with_stats memory utilization -0.39 [-0.42, -0.35]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

@chouquette chouquette force-pushed the chouquette/omnibus_cache branch from 4784f52 to 216f58e Compare October 18, 2023 08:38
@chouquette chouquette force-pushed the chouquette/omnibus_cache branch 2 times, most recently from a532fb1 to 9c889d9 Compare November 9, 2023 10:42
@chouquette chouquette force-pushed the chouquette/omnibus_cache branch 4 times, most recently from 86eac62 to e9bdfc8 Compare January 9, 2024 13:02
@chouquette chouquette force-pushed the chouquette/omnibus_cache branch from fcfd7cc to 3545f3f Compare February 29, 2024 09:59
They are expected to almost always change and wouldn keep invalidating
the cache.
Not caching those will allow us to not regenerate the cache when there's
no need to, which saves a few minutes it takes to recreate the cache
bundle and upload it to s3
so that we can measure the results until it's merged & further worked on
@chouquette chouquette force-pushed the chouquette/omnibus_cache branch from 9f05afb to bae0667 Compare March 4, 2024 15:14
@pr-commenter
Copy link

pr-commenter bot commented Mar 18, 2024

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=31521461 --os-family=ubuntu

@pr-commenter
Copy link

pr-commenter bot commented Mar 18, 2024

Regression Detector

Regression Detector Results

Run ID: 1eb5b777-6a43-4ad3-99e9-9242695e691e
Baseline: 92d6c0a
Comparison: 1e414d2

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI
file_to_blackhole % cpu utilization +0.24 [-5.66, +6.15]

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI
tcp_syslog_to_blackhole ingress throughput +1.32 [+1.24, +1.39]
process_agent_real_time_mode memory utilization +0.78 [+0.74, +0.83]
idle memory utilization +0.27 [+0.23, +0.31]
file_to_blackhole % cpu utilization +0.24 [-5.66, +6.15]
otel_to_otel_logs ingress throughput +0.23 [-0.18, +0.64]
process_agent_standard_check_with_stats memory utilization +0.23 [+0.19, +0.28]
trace_agent_json ingress throughput +0.03 [-0.01, +0.07]
file_tree memory utilization +0.01 [-0.10, +0.13]
trace_agent_msgpack ingress throughput +0.01 [-0.01, +0.02]
uds_dogstatsd_to_api ingress throughput +0.00 [-0.20, +0.20]
tcp_dd_logs_filter_exclude ingress throughput +0.00 [-0.02, +0.02]
uds_dogstatsd_to_api_cpu % cpu utilization -0.06 [-3.11, +2.98]
process_agent_standard_check memory utilization -0.22 [-0.27, -0.17]
basic_py_check % cpu utilization -1.62 [-4.19, +0.95]
pycheck_1000_100byte_tags % cpu utilization -2.47 [-7.32, +2.39]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

@Pythyu Pythyu added changelog/no-changelog qa/no-code-change No code change in Agent code requiring validation labels Mar 19, 2024
for k, v in environment.items():
print(f'\tUsing environment variable {k} to compute cache key')
h.update(str.encode(f'{k}={v}'))
# FIXME: include omnibus-ruby and omnibus-software version once they are pinned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that's a rather old comment. Removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And as you mentionned on slack, the fixme was actually valid. I just pushed a commit that actually fixes the fixme.

Thanks again for noticing

@chouquette chouquette force-pushed the chouquette/omnibus_cache branch from 16020bb to e9d2dad Compare April 16, 2024 09:38
@dd-devflow
Copy link

dd-devflow bot commented Apr 16, 2024

⚠️ MergeQueue

This merge request was unqueued

If you need support, contact us on Slack #devflow!



def _get_omnibus_commits(field):
release_version = os.environ['RELEASE_VERSION_7']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to check RELEASE_VERSION, and then RELEASE_VERSION_7 if the first one is not found, because of the way we set these variables currently.

Note for later: we should standardize all builds, make them set RELEASE_VERSION explicitly, to avoid having to use this hack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this was causing a failure on windows.
I might have some questions about these variables during the summit, it's a bit unclear to me.

@@ -11,6 +11,7 @@ if NOT DEFINED GO_VERSION_CHECK set GO_VERSION_CHECK=%~4

set OMNIBUS_BUILD=omnibus.build
set OMNIBUS_ARGS=--python-runtimes "%PY_RUNTIMES%"
set INSTALL_DIR=opt\datadog-agent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did it work for windows build? Is it a new directory to be created ? In what parent directory?

Copy link
Contributor Author

@chouquette chouquette Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worked as this is only used for knowing the subdirectory in which to locate the cache, so it would be created in the value provided by OMNIBUS_GIT_CACHE_DIR, which points to C:\TEMP\omnibus-git-cache by default on Windows.
However, I believe this is actually not needed. I removed it and will check the pipeline results

@chouquette chouquette requested a review from KSerrania April 16, 2024 14:13
Copy link
Contributor

@KSerrania KSerrania left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If you plan on merging this today, can you sync with @FlorentClarret? This will may conflict with the Python linter changes he's making in #24590, so he'll have to rebase.

Comment on lines +151 to +153
buildimages_hash = _get_build_images(ctx)
for img_hash in buildimages_hash:
h.update(str.encode(img_hash))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed for now, but for debugging purposes, you may want to log explicitly what buildimages entries go in the cache key, like you do for the environment variables.

@chouquette
Copy link
Contributor Author

/merge

@dd-devflow
Copy link

dd-devflow bot commented Apr 17, 2024

🚂 MergeQueue

This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.

Use /merge -c to cancel this operation!

@chouquette
Copy link
Contributor Author

/merge

@dd-devflow
Copy link

dd-devflow bot commented Apr 17, 2024

🚂 MergeQueue

This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.

Use /merge -c to cancel this operation!

@chouquette
Copy link
Contributor Author

/merge -c

@dd-devflow
Copy link

dd-devflow bot commented Apr 17, 2024

⚠️ MergeQueue

This merge request was unqueued

If you need support, contact us on Slack #devflow!

@chouquette
Copy link
Contributor Author

/merge

@dd-devflow
Copy link

dd-devflow bot commented Apr 17, 2024

🚂 MergeQueue

This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.

Use /merge -c to cancel this operation!

Copy link
Contributor

@iliakur iliakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems fine from agent ints perspective, tho would be nice to get rid of the repetitive always_build true

@dd-devflow
Copy link

dd-devflow bot commented Apr 17, 2024

🚂 MergeQueue

Pull request added to the queue.

This build is next! (estimated merge in less than 49m)

Use /merge -c to cancel this operation!

@dd-mergequeue dd-mergequeue bot merged commit d3f3164 into main Apr 17, 2024
189 checks passed
@dd-mergequeue dd-mergequeue bot deleted the chouquette/omnibus_cache branch April 17, 2024 14:38
@github-actions github-actions bot added this to the 7.54.0 milestone Apr 17, 2024
CelianR pushed a commit that referenced this pull request Apr 26, 2024
Co-authored-by: alopezz <[email protected]>
Co-authored-by: Pythyu <[email protected]>
alexgallotta pushed a commit that referenced this pull request May 9, 2024
Co-authored-by: alopezz <[email protected]>
Co-authored-by: Pythyu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog/no-changelog qa/no-code-change No code change in Agent code requiring validation team/agent-build-and-releases BaRX
Projects
None yet
Development

Successfully merging this pull request may close these issues.