Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ingress and node-services during reconcile #674

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

janhoy
Copy link
Contributor

@janhoy janhoy commented Jan 3, 2024

Fixes #673

DRAFT, code largely generated by GitHub copilot, just throwing up the idea

@janhoy
Copy link
Contributor Author

janhoy commented Feb 20, 2024

Any takers for a review? It works when testing locally :)

Copy link
Contributor

@HoustonPutman HoustonPutman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

The tests already pass without the PR (we delete all services between tests), so I would hope they would pass with the PR. The way to actually test this would be to make a cloud with node services, switch it to just use the headless and make sure that the node services no longer exist. Not sure that's required, but it likely would be good to test. I think we saw issues in a previous release when trying to delete resources automatically.

controllers/solrcloud_controller.go Show resolved Hide resolved
controllers/solrcloud_controller.go Show resolved Hide resolved
@janhoy janhoy marked this pull request as ready for review February 26, 2024 23:06
@janhoy
Copy link
Contributor Author

janhoy commented Feb 27, 2024

switch it to just use the headless and make sure that the node services no longer exist.

I might take a stab at such a test one day. Extending one of the existing, so after verifying that node-services are there, call an "update" on the CRD, then wait for reconsile to happen, and check that they are gone. Is there some test method that waits for reconcile or do we need to do polling with timeout?

@HoustonPutman
Copy link
Contributor

switch it to just use the headless and make sure that the node services no longer exist.

I might take a stab at such a test one day. Extending one of the existing, so after verifying that node-services are there, call an "update" on the CRD, then wait for reconsile to happen, and check that they are gone. Is there some test method that waits for reconcile or do we need to do polling with timeout?

If you look at the existing unit tests, that's exactly how all of the Eventually...() functions work. They are eventually consistent checks. There should be some similar tests around changing state halfway through, such as creating necessary secrets for basicAuth after the SolrCloud has already been created. I can help you with the test, but obviously if you get started first, it'll help better with the learning 🙂

@HoustonPutman
Copy link
Contributor

Btw this is going to be impacted by the work in #692 . The two approaches should not necessarily conflict with eachother though. We might not want to prune old node services too heavily, in case the user is using autoscaling. At that point it's likely better to keep the old ones around. (which will result in less rolling restarts due to the hostAliases being updated when the new node services are created.)

janhoy added 2 commits May 6, 2024 14:16
# Conflicts:
#	controllers/solrcloud_controller.go
#	helm/solr-operator/Chart.yaml
@gerlowskija
Copy link
Contributor

Trying to follow up on some of these older PRs. AFAICT from the discussion above, the main thing holding up this PR is a test like the one Houston described in his earlier comment? Is that right @janhoy ?

@janhoy
Copy link
Contributor Author

janhoy commented Dec 7, 2024

Suppose so. I think I got stuck with the test part. Anyone may feel free to grab this, I won't pick it up soonish.

@gerlowskija
Copy link
Contributor

Alright, rather than introducing a net-new test, I was able to add some assertions into solrcloud_scaling_test.go that should validate the functionality here. There's still something not quite right with the test assertions (perhaps we're not waiting long enough to see the pod-services disappear?), but it's a start.

In the meantime, we should decide how we want this behavior to interact with the work in #692. One idea that Houston suggested offline is that rather than deleting orphaned services right away, we could instead give them an annotation that indicates that they should be deleted after X time (say, 30 minutes?). What do we all think about that?

@HoustonPutman
Copy link
Contributor

Ahhh @gerlowskija , I think we were misunderstanding. Also misunderstanding with regards to the test. This is only deleting services/ingresses if the entire externalAddressibility feature is changed. Not when the solrcloud is scaled up/down. This testing can probably be done in the unit tests, and we have no concerns around autoscaling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Operator never deletes ingress or per-node services
3 participants