-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-5637, DPE-5276] Implement expose-external config option with values false (clusterip), nodeport and loadbalancer #328
Conversation
…erip), nodeport and loadbalancer
src/kubernetes_charm.py
Outdated
# Delete and re-create until https://bugs.launchpad.net/juju/+bug/2084711 resolved | ||
if service_exists: | ||
logger.info(f"Deleting service {service_type=}") | ||
self._lightkube_client.delete( | ||
res=lightkube.resources.core_v1.Service, | ||
name=self._service_name, | ||
namespace=self.model.name, | ||
) | ||
logger.info(f"Deleted service {service_type=}") | ||
|
||
logger.info(f"Applying service {service_type=}") | ||
self._lightkube_client.apply(service, field_manager=self.app.name) | ||
logger.info(f"Applied service {service_type=}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you find more information about this? I don't think we should need to delete and re-create the service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed a juju bug: https://bugs.launchpad.net/juju/+bug/2084711, which has been triaged
essentially, we have included deletion + recreation of service as a workaround until we get help from juju to determine what may be happening
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like this might be a misconfiguration of metallb and not a juju bug—I don't see how patching a k8s service not created by juju would cause the juju cli to have issues
did you try the multiple ips that @taurus-forever mentioned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did try using multiple IPs for metallb. unfortunately, that did not work. additionally, in the bug report, i was able to confirm the issue using microk8s.kubectl
without any charm code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you test with EKS or GKE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, i did not yet test with EKS or GKE. i dont believe that testing with these platforms should necessarily be a blocker for this PR
we tested on AKS, and this issue did not manifest itself in AKS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this issue did not manifest itself in AKS
it sounds like it might be a metallb+microk8s issue then? in that case I think we should consider patching the service instead of deleting + re-creating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree, but we are unable to run integration tests without deleting and recreating (as tests experience flicker of the juju client and fail with a Bad file descriptor
error). @paulomach please share your thoughts when you are able
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @shayancanonical , I validated the flaky behavior independently of a charm, but saw no issue in AKS.
Independently, this does not seems to be an issue in the charm/lightkube.
So let's not block the PR, since this can be refactored once we have better understanding or a fix.
.github/workflows/ci.yaml
Outdated
uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@v22.0.0 | ||
uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@feature/metallb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guessing this is temporary for testing? is there a dpw pr that needs review?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will open a PR in dpw shortly
|
||
def _get_node_hosts(self) -> list[str]: | ||
"""Return the node ports of nodes where units of this app are scheduled.""" | ||
peer_relation = self.model.get_relation(self._PEER_RELATION_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how was this done before without the peer relation?
should self._*endpoint be renamed to endpoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prior to this PR, we were only providing the host:node_port
of one unit in the application. upon discussions, we realized that we would need to provide all host:node_port
s where the units are scheduled. we are unable to determine nodes where units are deployed without the peer relation which provides all the available/active units
.github/workflows/ci.yaml
Outdated
@@ -96,7 +96,7 @@ jobs: | |||
- lint | |||
- unit-test | |||
- build | |||
uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@v23.0.2 | |||
uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@feature/metallb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert before merging.
src/kubernetes_charm.py
Outdated
@@ -51,11 +61,18 @@ | |||
class KubernetesRouterCharm(abstract_charm.MySQLRouterCharm): | |||
"""MySQL Router Kubernetes charm""" | |||
|
|||
_PEER_RELATION_NAME = "mysql-router-peers" | |||
_SERVICE_PATCH_TIMEOUT = 5 * 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sleeping up to 5 minutes might produce more issues.
Do we have other options here? Time for Pebble notices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the tests. If feasible it would be nice to have proper config validation, but that can be done later, given the timing we want to achieve.
There are some other non-blocking comments
expose-external: | ||
description: | | ||
String to determine how to expose the MySQLRouter externally from the Kubernetes cluster. | ||
Possible values: 'false', 'nodeport', 'loadbalancer' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: default false
may imply true
is valid value. Change to something else? no
none
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paulomach any ideas for alternatives? expose-external: none
? expose-external: clusterip
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additionally, the possible values are false
and nodeport
in kafka-k8s. it may not be a good idea to introduce inconsistencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh bummer, ok maybe for another time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, it seems it was a Marc nitpick also that Mykola did not responded to in the original kafka spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default false may imply true is valid value. Change to something else? no none
👋🏻 I subscribe Paulo's comment here.
I understand we are "compromising" in order to have a similar interface across DP charms (i.e. kafka-k8s), but it is confusing and we all should make an effort to reduce these occurrences in the future.
def external_connectivity(self, event) -> bool: | ||
"""Whether any of the relations are marked as external.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still in use by vm charm? https://github.com/canonical/mysql-router-operator/blob/60ad0549c590d48d77d37f83fe8d105f5a182d4a/src/machine_charm.py#L114
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that this file has implementation details that significantly diverge from the vm implementation (how endpoints are determined). thus, we should not share the data_provides.py
file between vm and k8s
furthermore, we should take a more intentional approach to shared code between the routers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait what? before this, database_provides.py was shared between vm & k8s
imo, database_provides.py should be identical between vm & k8s
f"{unit_name}.{self._charm.app.name}", | ||
f"{unit_name}.{self._charm.app.name}.{self._charm.model_service_domain}", | ||
f"{service_name}.{self._charm.app.name}", | ||
f"{service_name}.{self._charm.app.name}.{self._charm.model_service_domain}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if user tries to connect with juju's service, should that be possible?
also, can you double-check the changes to sans with @delgod if you haven't already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confirmed with mykola, answer: we should keep SANs as permissive as possible. added back unit specific SANs in c5df314
@@ -19,6 +19,12 @@ def model_service_domain(monkeypatch, request): | |||
monkeypatch.setattr( | |||
"kubernetes_charm.KubernetesRouterCharm.model_service_domain", request.param | |||
) | |||
monkeypatch.setattr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why is this being patched again here?
does it need to be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is because we're monkey patching model_service_domain
above which will affect the output of _get_hosts_ports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to patch it in tests/unit/conftest.py and this file?
Great work @shayancanonical! I'm about to do a similar feature in When do we check whether the "reconciliation" was successful? In the PR I saw this has been checked on the |
@theoctober19th usually
|
Thanks for the explanation, @shayancanonical. We had discussed this in our team, and thanks to @welpaolo we had discussed some additional possibility of having a flag set to false in peer relation databag whenever a service is being created / deleted, which would then trigger peer-relation-changed event, and in that event we check for the availablity of the service and either a) reset the flag, and update the endpoints or b) defer the peer-relation-changed event hook and then basically repeat this process when the deferred hook gets fired later. This can also be combined with checking the status of service in other event hooks (including update-status hook). This will effectively check service availablity during either peer-relation-changed or other event hooks, whichever occurs early. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what I understand LGTM (I read this spec to catch up with the context) 👍🏻
I would suggest addressing all the unresolved comments before going forward with the merge. It seems there were many conversations before my arrival.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Let's get this merged
…lb for integration tests (#244) We need metallb enabled for integration tests where we are creating `loadbalancer` type service in canonical/mysql-router-k8s-operator#328 Add ability to enable metallb
## Issue The `abstract_charm.py` has diverged with the K8s charm following in support of HACluster charm integration + `expose-external` config in the K8s charm. ## Solution Standardize the files Counterpart PR in K8s charm: canonical/mysql-router-k8s-operator#328
Prerequisite
Need to merge canonical/data-platform-workflows#244 for metallb support in CI
Issue
We have outlined the approach to expose our K8s charms externally in DA122
Summary:
external-node-connectivity
provided in data_interfaces in K8s charmsexpose-external
with valuesfalse
(for clusterip svc),nodeport
(for nodeport svc) andloadbalancer
(for loadbalancer svc)Solution
Implement the contents of the spec
Testing