Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable CDC on DSE clusters #478

Merged
merged 13 commits into from
Jan 23, 2023
Merged

Conversation

Miles-Garnsey
Copy link
Member

What this PR does:

Ensures MCAC doesn't get enabled for DSE clusters.
Ensures the management API images come from the management API docker repository. These contain the CDC agent.

Which issue(s) this PR fixes:
Fixes #476

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CHANGELOG.md updated (not required for documentation PRs)
  • CLA Signed: DataStax CLA

@Miles-Garnsey Miles-Garnsey requested a review from a team as a code owner January 18, 2023 03:52
Copy link
Contributor

@emerkle826 emerkle826 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave approval to @burmanm, but I did have the one comment.

pkg/cdc/generate_config.go Outdated Show resolved Hide resolved
README.md Outdated
@@ -158,7 +158,7 @@ defaults:
cassandra:
repository: "k8ssandra/cass-management-api"
dse:
repository: "datastax/dse-server"
repository: "datastax/dse-mgmtapi-6_8"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires upgrade test to verify that existing clusters do not break, lets say 6.8.9 installations.

@@ -14,7 +14,7 @@ When leveraging a single endpoint ingress / load balancer we lose the ability to
k3d load image --cluster k3s-default \
datastax/cass-operator:1.3.0 \
datastax/cass-config-builder:1.0-ubi7 \
datastax/dse-server:6.8.4-ubi7 \
datastax/dse-mgmtapi-6_8:6.8.4-ubi7 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image does not exists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this so we're talking about a recent version. The UBI will exist tomorrow :)

@burmanm
Copy link
Contributor

burmanm commented Jan 18, 2023

cdc_successful_test must be modified to test with DSE also (and updated in the .github configs).

@burmanm
Copy link
Contributor

burmanm commented Jan 18, 2023

And of course CHANGELOG ;)

@Miles-Garnsey
Copy link
Member Author

Thanks @burmanm, I've made the requested changes.

@burmanm
Copy link
Contributor

burmanm commented Jan 18, 2023

No, not yet. The same integration test must work in DSE & OSS (just like we do for most tests), not separate one. There's a method that takes input the version and that should be used. No separate testfiles needed or maintaining of multiple tests per version.

@Miles-Garnsey
Copy link
Member Author

No, not yet. The same integration test must work in DSE & OSS (just like we do for most tests), not separate one. There's a method that takes input the version and that should be used. No separate testfiles needed or maintaining of multiple tests per version.

I don't know about that. The problem is, we need to delete Pulsar to ensure the two tests don't interfere with each other. Otherwise Pulsar will still have the data from the previous test which might result in inaccurate test results.

Also, if we do the tests serially, they will be very slow due to Pulsar taking a while to roll out (although image pulls might be reduced for Pulsar components, which could make things faster.. IDK.)

@burmanm
Copy link
Contributor

burmanm commented Jan 18, 2023

I don't know about that. The problem is, we need to delete Pulsar to ensure the two tests don't interfere with each other. Otherwise Pulsar will still have the data from the previous test which might result in inaccurate test results.

This is not what I meant. The test code must be the same (cdc_successful_test), the execution in different VM is done by the GHA. Just like all the other tests behave, there's a separate code that modifies the input file to match the correct serverVersion and serverType.

@Miles-Garnsey
Copy link
Member Author

This is not what I meant. The test code must be the same (cdc_successful_test), the execution in different VM is done by the GHA. Just like all the other tests behave, there's a separate code that modifies the input file to match the correct serverVersion and serverType.

I'm not following what you mean sorry. The parallelisation happens at the level of the test name doesn't it? And that means that we need to have two test files? Or am I wrong?

Maybe you can show me an example of the way you'd like it structured from another test.

@burmanm
Copy link
Contributor

burmanm commented Jan 18, 2023

I'm not following what you mean sorry. The parallelisation happens at the level of the test name doesn't it? And that means that we need to have two test files? Or am I wrong?

You are, the parallelisation between different settings happens in the GHA. Most of the other tests are all working this way, just look at for example https://github.com/k8ssandra/cass-operator/blob/master/tests/test_all_the_things/test_all_the_things_suite_test.go#L62

That method will return the deployment file as DSE / Cass (3.11/4.0.x/4.1.x ..). Those are then different tests to be run.

@burmanm
Copy link
Contributor

burmanm commented Jan 18, 2023

And separate note, this PR can't be merged as is, the image_config.yaml can't change in the master branch. Otherwise it will break clusters that are running older versions that our images do not support.

@emerkle826
Copy link
Contributor

@Miles-Garnsey I wanted to try to clarify a few things about the tests.

You added a new test, cdc_successful_dse, but if you look at the GitHub Actions test run here, you'll see that it actually ran using Cassandra 4.0.7, not DSE.

It looks like you cloned the existing cdc_successful test to make this new test, and in doing so, you copied the code that dynamically changes the test fixtures to specify the server version that is run in the GitHub Action matrix. The utility function is CreateTestFile, and it will modify the test fixture yaml file passed in based on environment variables set when the tests runs.

The environment variables that can be set are M_SERVER_VERSION, M_SERVER_TYPE and M_SERVER_IMAGE. If any of these are set when the test runs, calls to CreateTestFile will replace the test fixture data with those overrides. So you don't need to add that test, you need to put it in one of the GitHub Actions jobs that has DSE in the test matrix. That said, there isn't a spot where you can add cdc_successful as it should only run against Cassandra 4 and DSE. So you might need to create a new job with a matrix that specifies only those two server versions and only the cdc_successful integration_test.

Also, to clarify a little of what @burmanm is talking about:

the image_config.yaml can't change in the master branch. Otherwise it will break clusters that are running older versions that our images do not support.

The main issue here is that changing the default DSE image repo from datastax/dse-server to datastax/dse-mgmtapi-6_8 will break for anyone currently using DSE versions older than 6.8.25. That is because the datastax/dse-mgmtapi-6_8 repo does not have any DSE images for versions older than 6.8.25. If someone is currently using cass-operator with DSE 6.8.9, upgrading with this change will fail to find that image since it doesn't exist.

A possible solution to this last point would be to publish older DSE images (all that cass-operator supports) to the datastax/dse-mgmtapi-6_8 repo. But I imagine that would be every DSE version from 6.8.25 back to 6.8.9. So a lot of images to publish. And I don't think it would be feasible to effectively test each of those versions to make sure things work as they currnetly do with the datastax/dse-server repo images.

@Miles-Garnsey
Copy link
Member Author

Miles-Garnsey commented Jan 19, 2023

@emerkle826 and I have had a chat about this and he's walked me through some of your comments @burmanm.

Test versions/serverTypes

@emerkle826, thanks for explaining this, it looks like a bunch of changes were made a few months ago and the CDC test no longer applies the manifests verbatim. I'll do as you suggested and update the workflow instead of duplicating the tests.

And separate note, this PR can't be merged as is, the image_config.yaml can't change in the master branch. Otherwise it will break clusters that are running older versions that our images do not support.

Image registry change

Erik has provided some explanation of this above. It seems that the issue is that users may not have the image set, and (given we only publish 6.8.25+ via the management api image registry) a customer who has no image set but has the serverVersion set at e.g. 6.8.5 would experience problems due to the change in repo that we're implementing.

There are two solutions to this problem in my view:

  1. Make a note in the changelog that users on versions < 6.8.25 need to explicitly set their image coordinates (or ImageConfig) prior to upgrade.
  2. Publish images all the way back to 6.8.9 (or whatever the earliest version cass-operator's next release supports, I can't even find this information) in the management API registry so that the move is transparent to users. We'd also likely need to develop tests to ensure that all of these images work as expected.

I think that there is a consensus that we need to be building our own Docker images for DSE if we're going to be able to implement features like this.

@adejanovski
Copy link
Contributor

  • Make a note in the changelog that users on versions < 6.8.25 need to explicitly set their image coordinates (or ImageConfig) prior to upgrade.
  • Publish images all the way back to 6.8.9 (or whatever the earliest version cass-operator's next release supports, I can't even find this information) in the management API registry so that the move is transparent to users. We'd also likely need to develop tests to ensure that all of these images work as expected.

I'm not crazy about the former, we can't assume that folks will read the release notes before upgrading.
The latter would have my preference, but don't we have alternative solutions?
If we managed to implement something like "before 6.8.25 use datastax/dse-server and after use k8ssandra/dse-mgmt...." it would allow a smooth transition without the need for users to do something specific or us having to rebuild a whole lot of images.
@burmanm, do you think that would be feasible without spending too much time on it?

@burmanm
Copy link
Contributor

burmanm commented Jan 19, 2023

If we managed to implement something like "before 6.8.25 use datastax/dse-server and after use k8ssandra/dse-mgmt...." it would allow a smooth transition without the need for users to do something specific or us having to rebuild a whole lot of images.

That can be done in the images.go, but I'd say it should almost be 6.8.31 only automatic, etc. Whenever we do the decided cut. Until then users might have deployed clusters with older versions, which we would then have to test for compatibility.

In any case, not for next release (1.14.0).

@emerkle826
Copy link
Contributor

We do now have Management API images for DSE based on RedHat UBI images:
https://hub.docker.com/r/datastax/dse-mgmtapi-6_8/tags

@burmanm
Copy link
Contributor

burmanm commented Jan 19, 2023

As discussed today, the approach we will take is documented in #479. This will be a separate PR (target release 1.15.0). From this PR, the image_config.yaml as well as image.go changes should be removed.

@adejanovski
Copy link
Contributor

As discussed today, the approach we will take is documented in #479. This will be a separate PR (target release 1.15.0). From this PR, the image_config.yaml as well as image.go changes should be removed.

+1, let's do this

@@ -3,7 +3,7 @@ apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.10.0
controller-gen.kubebuilder.io/version: v0.8.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong controller-gen

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's supposed to be 0.10.0 (that PR was done 3 months ago). You're recreating these with 0.8.0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, sorry! Fixed.

Those files get generated via kubebuilder don't they? I ran make controller-gen because I know that is local to the repo, but we use a system level kubebuilder in this repo don't we? Maybe we should be using a local one to avoid this problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's because the make targets use a local version of controller-gen which is under ./bin and is only re-downloaded if it's not already in that folder. You may have had an older version there, which can be fixed by just deleting the binary and re-running the make target.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it turns out I'd only updated the one in k8ssandra-operator, not cass-operator. My bad, apologies.

@@ -91,9 +91,12 @@ func updateAdditionalJVMOpts(optsSlice []string, CDCConfig *cassdcapi.CDCConfigu
}
}
CDCOpt := fmt.Sprintf("-javaagent:%s=%s", "/opt/cdc_agent/cdc-agent.jar", strings.Join(optsSlice, ","))
// We want to disable MCAC when the server being deployed is DSE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also want to remove MCAC agent when MGMT_API_DISABLE_MCAC is set (like done in https://github.com/k8ssandra/management-api-for-apache-cassandra/blob/master/scripts/docker-entrypoint.sh#L43)

Does this process mess up with that logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but that's a problem to solve under that PR, not this one IMO. That one is still under development anyway.

Copy link
Member Author

@Miles-Garnsey Miles-Garnsey Jan 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created #481 to track this. We need it also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(separate PR)

- upgrade_operator # Test is not setup to run against 4.0
- additional_seeds #TODO: Fails against C* 4.0, fix in https://github.com/k8ssandra/cass-operator/issues/459
- scale_down_unbalanced_racks #TODO: Fails against C* 4.0 and DSE 6.8, fix in https://github.com/k8ssandra/cass-operator/issues/459
- upgrade_operator # Test is not setup to run against 4.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's up with these mass-reindents?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation has been pretty inconsistent so I've turned on Trim Trailing Whitespace and Yaml › Format: Enable in vs code, I think everyone should.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also reindented all our CRDs. And those are now being regenerated by the make manifests. So we have 10k line commits doing reindents and I have no idea if anything was changed in them (I just assumed they were re-created after changing the controller-gen version). We can't reindent stuff with these tools if they don't understand what's generated.

@adejanovski adejanovski mentioned this pull request Jan 20, 2023
5 tasks
@@ -299,7 +299,7 @@ jobs:
- name: set environment for DSE
if: ${{ startsWith( matrix.version, '6.8' ) }}
run: |
echo "M_SERVER_IMAGE=datastax/dse-mgmtapi-6_8:${{ matrix.version }}-jdk8" >> $GITHUB_ENV
echo "M_SERVER_IMAGE=datastax/dse-server:${{ matrix.version }}-jdk8" >> $GITHUB_ENV
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. this isn't right. It should be in the mgmtapi-6_8 still.

@burmanm burmanm merged commit 270521c into k8ssandra:master Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable CDC agent support for DSE
4 participants