Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPV: Release v29.0.0. #1459

Merged
merged 1 commit into from
Nov 2, 2024
Merged

CAPV: Release v29.0.0. #1459

merged 1 commit into from
Nov 2, 2024

Conversation

njuettner
Copy link
Member

@njuettner njuettner commented Oct 22, 2024

Towards: giantswarm/roadmap#3710

Checklist

  • Roadmap issue created
  • Release uses latest stable Flatcar
  • Release uses latest Kubernetes patch version

Triggering E2E tests

To trigger the E2E test for each new Release added in this PR, add a comment with the following:

/run releases-test-suites

If you want to trigger conformance tests, you can do so by adding a comment similar to the following:

/run conformance-tests PROVIDER=capa RELEASE_VERSION=29.1.0

For more details see the README.md.

@njuettner

This comment was marked as outdated.

@Gacko Gacko changed the title Release: CAPV v29.0.0. CAPV: Release v29.0.0. Oct 22, 2024
@Gacko Gacko force-pushed the capv-29 branch 2 times, most recently from 583767c to d5c9eba Compare October 22, 2024 16:44
@giantswarm giantswarm deleted a comment from tinkerers-ci bot Oct 22, 2024
@giantswarm giantswarm deleted a comment from tinkerers-ci bot Oct 22, 2024
@giantswarm giantswarm deleted a comment from tinkerers-ci bot Oct 22, 2024
@Gacko Gacko marked this pull request as ready for review October 22, 2024 18:15
@Gacko Gacko requested a review from a team as a code owner October 22, 2024 18:15
@giantswarm giantswarm deleted a comment from tinkerers-ci bot Oct 22, 2024
@giantswarm giantswarm deleted a comment from tityosbot Oct 22, 2024
vsphere/v29.0.0/release.yaml Outdated Show resolved Hide resolved
vsphere/v29.0.0/release.diff Outdated Show resolved Hide resolved
vsphere/v29.0.0/README.md Outdated Show resolved Hide resolved
@Gacko Gacko force-pushed the capv-29 branch 2 times, most recently from 070d09e to 2ff2c65 Compare October 23, 2024 15:33
@giantswarm giantswarm deleted a comment from tinkerers-ci bot Oct 23, 2024
@giantswarm giantswarm deleted a comment from tinkerers-ci bot Oct 23, 2024
Copy link
Member

@TheoBrigitte TheoBrigitte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have observability-bundle 1.7.0 as part of this release ?

vsphere/v29.0.0/README.md Show resolved Hide resolved
vsphere/v29.0.0/README.md Show resolved Hide resolved
vsphere/v29.0.0/release.diff Show resolved Hide resolved
vsphere/v29.0.0/release.yaml Show resolved Hide resolved
@Gacko Gacko force-pushed the capv-29 branch 3 times, most recently from b909de4 to 875a40f Compare October 24, 2024 18:19
@Gacko

This comment was marked as outdated.

@tinkerers-ci

This comment was marked as outdated.

@Gacko
Copy link
Member

Gacko commented Oct 24, 2024

@vxav: Tests are failing, because the image hasn't been copied to vSphere, yet. Can you make sure the image for Flatcar 3975.2.2, Kubernetes 1.29.10 and OS Tooling 1.20.1 is present? Thank you!

@njuettner

This comment was marked as outdated.

@njuettner
Copy link
Member Author

@vxav: Tests are failing, because the image hasn't been copied to vSphere, yet. Can you make sure the image for Flatcar 3975.2.2, Kubernetes 1.29.10 and OS Tooling 1.20.1 is present? Thank you!

Copied it 👍🏻

@tinkerers-ci

This comment was marked as outdated.

@njuettner

This comment was marked as outdated.

@Gacko
Copy link
Member

Gacko commented Oct 31, 2024

The same observability-bundle version used in this PR is also being used in already released WC releases. They worked perfectly fine when they got released and we didn't have any issues with tests.

Now it seems like you recently introduced a change to the logging-operator which obviously affects the observability-bundle in existing WC releases and now needs to be fixed to make these releases work again.

As a customer I'd expect these releases to be tested and working, so cluster creation shouldn't break out of nowhere.

Can you please elaborate on what has been changed in logging-operator and how this affects existing WC releases? I'd expect these releases to be stable and immutable and implementing changes in an MC operator, which changes the behavior observability-bundle in existing releases, definitely breaks this contract.

@Gacko

This comment was marked as outdated.

@tinkerers-ci

This comment was marked as outdated.

@Gacko
Copy link
Member

Gacko commented Oct 31, 2024

/run releases-test-suites TARGET_SUITES=./providers/capv/standard PREVIOUS_RELEASE=28.0.1 TARGET_RELEASES=vsphere-29.0.0

@tinkerers-ci
Copy link

tinkerers-ci bot commented Oct 31, 2024

releases-test-suites

Run name pr-releases-1459-releases-test-suites9rnx2
Commit SHA a9620c4
Result Succeeded ✅

📋 View full results in Tekton Dashboard

Rerun trigger:
/run releases-test-suites


Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run releases-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

Alternatively, or in addition to, you can also specify TARGET_RELEASES to trigger tests for specific releases. E.g. /run releases-test-suites TARGET_SUITES=./providers/capa/standard TARGET_RELEASES=aws-25.0.0-test.1

@njuettner njuettner requested review from a team November 1, 2024 09:43
@njuettner
Copy link
Member Author

@giantswarm/team-rocket I think we can finally move on, if you could take another look?

@Gacko
Copy link
Member

Gacko commented Nov 1, 2024

/run releases-test-suites TARGET_SUITES=./providers/capv/upgrade PREVIOUS_RELEASE=28.0.1 TARGET_RELEASES=vsphere-29.0.0

@tinkerers-ci
Copy link

tinkerers-ci bot commented Nov 1, 2024

releases-test-suites

Run name pr-releases-1459-releases-test-suitesr6lqz
Commit SHA a9620c4
Result Failed ❌

📋 View full results in Tekton Dashboard

Rerun trigger:
/run releases-test-suites


Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run releases-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

Alternatively, or in addition to, you can also specify TARGET_RELEASES to trigger tests for specific releases. E.g. /run releases-test-suites TARGET_SUITES=./providers/capa/standard TARGET_RELEASES=aws-25.0.0-test.1

@njuettner
Copy link
Member Author

/run releases-test-suites TARGET_SUITES=./providers/capv/upgrade PREVIOUS_RELEASE=28.0.1 TARGET_RELEASES=vsphere-29.0.0

@tinkerers-ci
Copy link

tinkerers-ci bot commented Nov 1, 2024

releases-test-suites

Run name pr-releases-1459-releases-test-suites2cwrg
Commit SHA a9620c4
Result Failed ❌

📋 View full results in Tekton Dashboard

Rerun trigger:
/run releases-test-suites


Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run releases-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

Alternatively, or in addition to, you can also specify TARGET_RELEASES to trigger tests for specific releases. E.g. /run releases-test-suites TARGET_SUITES=./providers/capa/standard TARGET_RELEASES=aws-25.0.0-test.1

@QuentinBisson
Copy link
Contributor

/run releases-test-suites TARGET_SUITES=./providers/capv/upgrade PREVIOUS_RELEASE=28.0.1 TARGET_RELEASES=vsphere-29.0.0

@tinkerers-ci
Copy link

tinkerers-ci bot commented Nov 2, 2024

releases-test-suites

Run name pr-releases-1459-releases-test-suitesg68nr
Commit SHA a9620c4
Result Failed ❌

📋 View full results in Tekton Dashboard

Rerun trigger:
/run releases-test-suites


Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run releases-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

Alternatively, or in addition to, you can also specify TARGET_RELEASES to trigger tests for specific releases. E.g. /run releases-test-suites TARGET_SUITES=./providers/capa/standard TARGET_RELEASES=aws-25.0.0-test.1

@QuentinBisson
Copy link
Contributor

/run releases-test-suites TARGET_SUITES=./providers/capv/upgrade PREVIOUS_RELEASE=28.0.1 TARGET_RELEASES=vsphere-29.0.0

@Gacko
Copy link
Member

Gacko commented Nov 2, 2024

@QuentinBisson or someone else from @giantswarm/team-atlas: Can you please reply to this? It would be very helpful, even if only for documentation. Also I'd be interested in what has changed between the different runs of Releases Test Suites as I'd prefer to see them reliably fixed instead of having them pass once out of ten. 🙂

@tinkerers-ci
Copy link

tinkerers-ci bot commented Nov 2, 2024

releases-test-suites

Run name pr-releases-1459-releases-test-suitesq2gvn
Commit SHA a9620c4
Result Succeeded ✅

📋 View full results in Tekton Dashboard

Rerun trigger:
/run releases-test-suites


Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run releases-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

Alternatively, or in addition to, you can also specify TARGET_RELEASES to trigger tests for specific releases. E.g. /run releases-test-suites TARGET_SUITES=./providers/capa/standard TARGET_RELEASES=aws-25.0.0-test.1

@QuentinBisson
Copy link
Contributor

QuentinBisson commented Nov 2, 2024

@Gacko I'll write something on monday, I wanted to focus on fixing this first but I did not forget your message :)

@Gacko
Copy link
Member

Gacko commented Nov 2, 2024

Ok, thank you!

I'll run the standard tests one last time and merge this PR once they pass.

/run releases-test-suites TARGET_SUITES=./providers/capv/standard TARGET_RELEASES=vsphere-29.0.0

@QuentinBisson
Copy link
Contributor

QuentinBisson commented Nov 2, 2024

There's currently a fixed branch of the logging operator on gcapeverde so tests should work 🤞🏻

@QuentinBisson
Copy link
Contributor

I was going to run them again anyway 😅

@tinkerers-ci
Copy link

tinkerers-ci bot commented Nov 2, 2024

releases-test-suites

Run name pr-releases-1459-releases-test-suitesp5rrg
Commit SHA a9620c4
Result Succeeded ✅

📋 View full results in Tekton Dashboard

Rerun trigger:
/run releases-test-suites


Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run releases-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

Alternatively, or in addition to, you can also specify TARGET_RELEASES to trigger tests for specific releases. E.g. /run releases-test-suites TARGET_SUITES=./providers/capa/standard TARGET_RELEASES=aws-25.0.0-test.1

@Gacko Gacko added the skip/ci Instructs PR Gatekeeper to ignore any required PR checks label Nov 2, 2024
@Gacko Gacko merged commit 1329100 into master Nov 2, 2024
5 checks passed
@Gacko Gacko deleted the capv-29 branch November 2, 2024 16:36
@QuentinBisson
Copy link
Contributor

The same observability-bundle version used in this PR is also being used in already released WC releases. They worked perfectly fine when they got released and we didn't have any issues with tests.

Now it seems like you recently introduced a change to the logging-operator which obviously affects the observability-bundle in existing WC releases and now needs to be fixed to make these releases work again.

As a customer I'd expect these releases to be tested and working, so cluster creation shouldn't break out of nowhere.

Can you please elaborate on what has been changed in logging-operator and how this affects existing WC releases? I'd expect these releases to be stable and immutable and implementing changes in an MC operator, which changes the behavior observability-bundle in existing releases, definitely breaks this contract.

We are indeed configuring the observability-platform apps in releases via operators. We initialy built the logging and observability operators as a safety mechanism to be able to change some of our apps config on the fly (for prometheus-agent and so on) to counteract the lack/slowness of customer upgrades in the past because we were swarmed with lots of day and night alerts and that was unsufferable and waiting for a customer to upgrade was a no-go.

We used this mechanism to build some features on our apps as well like:

  • sharding of the monitoring agent based on the metrics on prometheus and Mimir (because that would require someone managing KEDA on WCs and no one wants to)
  • secret management
  • dynamic enabling/disabling of loggging and monitoring at the cluster level

This used to work quite well in the past but we recently enabled a feature flag on the logging-operator that replaced promtail with alloy in the observability platform giantswarm/logging-operator#246 which caused issues last week.

This changed had been manually tested in the past but it missed that the grafana-agent application was failing to deploy (because of an issue with it's CRD management in our currently deployed release of it) which broke the cluster creation test.

In the mean time, a configuration breaking change in the alloy secret management was introduced in the observability-bundle and the change was not properly reflected in the logging-operator which caused the upgrade test to fail because the secret for alloy was not created and so alloy in CAPV 28 did not actually deploy :(

Last week, this became problematic and we are definitely sorry about all of this :(. I am opening a PM today so we can investigate how we can move forward without a lot of the operator work going on behind the hood (we need some config coming from MCs like secret to talk to loki and so on but not as much as we have today) but it should be our priority that we find something that does not break any existing releases.
Would you be up to a discussion to find out how we could integrate better?

By the way, we've been having discussions about this topic for years now and I really thought everyone was aware of it. We really need to find a better way to move forward (cc @JosephSalisbury) and that will require improvements on the release and delivery process as well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip/ci Instructs PR Gatekeeper to ignore any required PR checks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants