Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend MachineStatus to add ProviderID #565

Merged

Conversation

vikaschoudhary16
Copy link
Contributor

@vikaschoudhary16 vikaschoudhary16 commented Oct 28, 2018

A new field, ProviderID, is being proposed to add in machine object Spec. This field is required by higher level consumers of cluster-api such as autoscaler for the machines v/s nodes reconciliation logic. With cluster-api as a generic out-of-tree provider for autoscaler, this field is required by autoscaler to be able to have a provider view of list of machines.

This field will be set by actuators and consumed by higher level entities who will be interfacing with cluster-api as generic provider.

Add ProviderID to the machine Spec

/cc @roberthbailey @hardikdr @enxebre @derekwaynecarr

@k8s-ci-robot
Copy link
Contributor

@vikaschoudhary16: GitHub didn't allow me to request PR reviews from the following users: hardikdr, enxebre.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

A new field, ProviderID, is being proposed to add in MachineStatus. This field is required by higher level consumers of cluster-api such as autoscaler for the machines v/s nodes reconciliation logic. With cluster-api as a generic out-of-tree provider for autoscaler, this field is required by autoscaler to be able to have a provider view of list of machines.

Add ProviderID to the machine status

/cc @roberthbailey @hardikdr @enxebre @derekwaynecarr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 28, 2018

// ProviderID is the identification ID of the machine provided by the provider
// +optional
ProviderID *string `json:"providerID,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this comment to make it clear that this field must match the provider ID as seen on the node object corresponding to this machine.

@alvaroaleman
Copy link
Member

My 2C to this:

  • I am in favor of adding this but for a different reason: To allow the machine-controller/actuator to store an ID to identify the instance at the cloud provider in case identifying information on the instance gets lost (rename, someone deletes the tag etc)
  • I am opposed to merging anything because of cluster-autoscaler until the proposal got merged, so if ppl think the point above does not apply, IMHO we should not merge this

Copy link
Member

@hardikdr hardikdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @alvaroaleman .
I support adding ProviderIDfield in the MachineStatus but only for the reason mentioned above.
I can recall autoscaler-integration requires it, probably for mapping the node object with machine-object, and in turn machine-deployment object. But this should rather settle at the KEP opened for the same.

@sidharthsurana
Copy link
Contributor

As, I mentioned during today's sig meeting as well, keeping the ProviderID in the MachineStatus will make this field non-portable. That is, this information will be lost if ever a pivot would need to be done. As during the pivot the Status field of an object is not transferred to the target destination. So, if at all we need this info, we should look at adding that as a field in the Spec part of the Machine object definition.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 16, 2018
@vikaschoudhary16
Copy link
Contributor Author

@sidharthsurana
Since providerID is an "observed" attribute, i am not sure if it make sense to make it part of spec.
Regarding pivot point, can you please share some reference so that i can understand what does actually pivot mean here? I could not understand what you meant by:

That is, this information will be lost if ever a pivot would need to be done.

@roberthbailey
Copy link
Contributor

@vikaschoudhary16 - Is it always observed? Or does that depend on the provider? I think the question comes down to whether we think that it should be possible to use the provider id to match a machine with an existing infrastructure resource via this field (which is how @sidharthsurana is using it) by saying that through some means, the caller knows a priori what it should be set to.

The precedent here is the service IP (for k8s services) where you can ask for a specific one by setting the value in the spec, but if you leave it blank the service controller chooses one and fills it in for you. It may not be the best precedent to follow, but it seems to work pretty well in that case.

@vikaschoudhary16
Copy link
Contributor Author

@roberthbailey thanks!! sounds logical to me.

@vikaschoudhary16 vikaschoudhary16 force-pushed the add-provider-id branch 2 times, most recently from 9ffb881 to 4b38736 Compare November 19, 2018 04:53
@hardikdr
Copy link
Member

Is there any blocker yet for merging this PR?
@vikaschoudhary16 @sidharthsurana

@vikaschoudhary16
Copy link
Contributor Author

@hardikdr none to my knowledge, thanks!!!

@sidharthsurana
Copy link
Contributor

@vikaschoudhary16 @hardikdr The point i was trying to make is as follows:

Say a user is using the clusterctl to deploy the cluster. (Note clusterctl is just an example, as you could do the same using simply kubectl). The general workflow would be that the cluster-api controllers and actuators would be running first in the bootstrap cluster where the initial Cluster and Machine objects will be created. The actuators comes into action and provisions the master machine. As per your description of this PR the actuator would then update the Status of the Machine object with the ProviderID value. From the discussion above seems like one of the use of this ProviderID field would be to match the machine to the node. Now, before we get to the machine to node matching, the next step would be to perform the pivot of the cluster-api controllers and objects to the target cluster just created. The process for this is pretty simple, the cluster-api controllers are deployed on the target cluster, followed by copying the objects from the bootstrap cluster to the target cluster. Now this is where the catch is. If you add the ProviderID property on the Status part of the Machine object, then when you copy over the Machine object for the master to the target cluster, now on the target cluster the Machine object will NOT have anything in the Status field as it is ignored when creating the object on the target cluster.
Thus, in this case if the idea of the ProviderID in the Status was to help match the Node object with the Machine that would not happen as the field is not missing from the Status part.
Thus my suggestion as to make this part of the Spec and not the Status that way even when you copy over the definition of the Machine object that information will persist.

I hope the above explanation helps.

@alvaroaleman
Copy link
Member

@sidharthsurana But wouldn't then the better solution be to copy over the Status as well when pivoting?

@hardikdr
Copy link
Member

I would second with the idea of also migrating the Status while pivoting.
Although, looking at core Node-object, ProviderID field seems to be falling under Spec.

@alvaroaleman @sidharthsurana thoughts ? We could conclude on Spec for now and move to Status incase we see a need? or vice versa ?

@vikaschoudhary16 seems to have already migrated the ProviderID to Spec. Based on the conclusion here maybe you want to update the commit description which refers to the MachineStatus atm.

@sidharthsurana
Copy link
Contributor

@sidharthsurana But wouldn't then the better solution be to copy over the Status as well when pivoting?

Yes, but if you are using kubectl to create the object, the Status even if present in the object definition, it will be ignored. It's more of a k8s client behavior.

@sidharthsurana
Copy link
Contributor

@alvaroaleman @sidharthsurana thoughts ? We could conclude on Spec for now and move to Status incase we see a need? or vice versa ?

Here is my take, if the ProviderID is a field that can be re-populated even if it is lost, then Status is okay since, even after pivot the controller/actuator on the target cluster would re-populate it properly. But if we are depending on this property for the controller/actuator to work then this needs to be in the Spec section.

@sidharthsurana
Copy link
Contributor

In general I am okay with adding this field, my only concern is about where we add it. Since I am not a 100% sure of the underlying use cases thus the questions.

@alvaroaleman
Copy link
Member

Could the motivation be that it is a one-time observation and should not be updated later, which makes it appropriate for Spec? - not very sure.

Yes, I think so.

Yes, but if you are using kubectl to create the object, the Status even if present in the object definition, it will be ignored. It's more of a k8s client behavior.

Okay but if that is the only reason to move that field to Spec we should probably instead try to figure out how to make the client set the status. Maybe the issue here is that we have a Status subresource so we need to use the Status client?

Here is my take, if the ProviderID is a field that can be re-populated even if it is lost, then Status is okay since, even after pivot the controller/actuator on the target cluster would re-populate it properly. But if we are depending on this property for the controller/actuator to work then this needs to be in the Spec section.

Hm. I would view it more as: If this is immutable and will never be changed once populated, it belongs into Spec. If there is a chance this gets changed during the lifetime of a machine, it should be part of the Status. In the Kubermatic machine-controller we do recreate cloud provider instances if someone deleted them manually (e.G. by accident). This in turn means that the ID of those instances changes. If we put the ProviderID into the Spec we basically can not support this scenario because its not safe for the controller to update that field.

@enxebre
Copy link
Member

enxebre commented Nov 22, 2018

If you add the ProviderID property on the Status part of the Machine object, then when you copy over the Machine object for the master to the target cluster, now on the target cluster the Machine object will NOT have anything in the Status field as it is ignored when creating the object on the target cluster.

Make sense, I see this a consequence of an arbitrary process (pivoting) followed in some cases for coming up with a cluster, so I think the tooling driving it should be responsible for ensuring its expectations. But I think we should not choose object spec/object status based on it.

Could the motivation be that it is a one-time observation and should not be updated later, which makes it appropriate for Spec? - not very sure.

The difference I can see with the node object being in the spec is that a machine could be backing a different node during time. E.g if the external state is changed i.e an instance is manually deleted, eventually the machine controller should try to reconcile desired state and create a new instance based on the existing machine object.

@vikaschoudhary16
Copy link
Contributor Author

Now i think it makes more sense to add it in Status. Thanks a lot guys for this great discussion so far. I want to summarize why i now think it makes more sense to keep it at Status:

  1. It is possible that a machine object starts pointing different instance, for ex original instance gets deleted/crashed etc. Therefore ProviderID is NOT immutable for a machine object and thus keeping in Status makes sense.
  2. In Node object, ProviderID is immutable and thus makes sense to see it in Spec.
  3. Regarding pivoting to target cluster, agree with @alvaroaleman @enxebre that it should be handled at client/tooling and it should not influence where to keep ProviderID in machine object.

I am feeling to revert it back to initial change i.e ProviderID in Status.

@sidharthsurana @alvaroaleman @enxebre @hardikdr @roberthbailey thoughts?

@sidharthsurana
Copy link
Contributor

3. Regarding pivoting to target cluster, agree with @alvaroaleman @enxebre that it should be handled at client/tooling and it should not influence where to keep ProviderID in machine object

I agree that the tooling should handle the case of making the status and spec in sync. However, keep in mind, even if the tooling tries to handle that, it would not be an atomic operation rather it would be a 2 step operation. That means, that the controller/actuator running on the target cluster will get notified of the changes to the object twice and the actuator then needs to handle this case gracefully.

Hm. I would view it more as: If this is immutable and will never be changed once populated, it belongs into Spec. If there is a chance this gets changed during the lifetime of a machine, it should be part of the Status. In the Kubermatic machine-controller we do recreate cloud provider instances if someone deleted them manually (e.G. by accident). This in turn means that the ID of those instances changes. If we put the ProviderID into the Spec we basically can not support this scenario because its not safe for the controller to update that field.

Although I get the motivation of making the spec immutable, but IMHO that is kind of subjective. I like that idea of spec being immutable too, but in general there may be cases which may demand that the spec might be treated as mutable as well. For instance, right now if you want to upgrade the kubelet version, what do one do? well we update the spec to the new kubelet version and the actuator needs to handle that situation. Another case could be of vertical scaling. What if I just want to increase say the cpu/memory of a worker. One way is create a brand new machine with the new size and delete the old one. However, a better approach might be that if the underlying provider supports hot-add of CPU/Memory then simply reconfigure the instance. Thus for this later to happen, one might have to update the spec to reflect the new intent.
I think the immutability/mutability of the spec is not as simple as black and white case. Considering CRDs doesn't provide us any strict enforcement of any immutability.

Regarding the specific scenario @alvaroaleman you are referring to, in general is the simple scenario of given a machine object how does the actuator figure out that the underlying actual instance exist. In general there are 3 options to resolve this.

  1. decorate/enhance the Machine object with some sort of identifier from the underlying provider. So that the actuator to precisely query for the object and resolve the question.
  2. decorate the backend instance on the provider with some metadata from the Machine object. So that the actuator could query the underlying provider filtering by that metadata coming from the Machine object.
  3. Simply a mixture of both 1 and 2 above.

Here our case is scenario 1, so where do we put that metadata inside the Machine object, does it matter? For the case where someone accidentally deletes the backend instance, then the next reconciliation loop when the actuator tries to look up the backend instance by that ID it would not be able to find that, and that is a clear indication that it needs to re-create it + once it has created update the ID back in the same location (regardless of where it is Spec/Status) with 100% confidence.

@roberthbailey
Copy link
Contributor

We discussed this quite a bit during the meeting today. Some takeaways:

  • @justinsb said "The way spec and status was explained to me is that if it can be recovered then it goes into status and if not then it goes into spec" which would argue for .... maybe the spec, but it depends on the environment
  • @timothysc suggested that until we have a client other than the autoscaler that needs to know this on a consistent basis across multiple providers, we should consider using an annotation instead of making a first class field
  • @hardikdr reiterated that having a strong and consistent link between the machine, node, and underlying infrastructure machine (physical or virtual) was still worth it and that we should structure the provider ID in the same way that the cloud providers in k8s do
  • Nate had a suggestion that if we want to keep the field in status (because it's observed) we can use an annotation as a mechanism to pass the a priori value to the controller during the pivot.

@frobware
Copy link

  • @timothysc suggested that until we have a client other than the autoscaler that needs to know this on a consistent basis across multiple providers, we should consider using an annotation instead of making a first class field

This is the route I chose for the autoscaler / cluster-api implementation I have been working on here: openshift/kubernetes-autoscaler#14

@vikaschoudhary16
Copy link
Contributor Author

vikaschoudhary16 commented Dec 5, 2018

We can almost do most of the things with annotations instead using concrete apis. In my opinion, annotations should be used for experimentation where we are not sure if this api will ever be needed or not. Here i think most of us are agree that this information is required, therefore making it api makes more sense to me. APIs provides greater confidence to build something upon as compared to annotations. Different implementations, different annotation name, more confusion.

@hardikdr
Copy link
Member

Ping.
Let's please discuss this topic in next meeting and try to settle?
@vikaschoudhary16 @alvaroaleman @sidharthsurana @roberthbailey

@dlipovetsky
Copy link
Contributor

dlipovetsky commented Jan 23, 2019

I think ProviderID field in the Machine object is similar to the Node field in the Pod object. The Node field is part of the Pod spec, and is generally set by the scheduler, but can be set by the user (e.g. to pin a Pod to some Node). The ProviderID field will also generally be set by a controller, but it may be set by the user. Therefore I think the ProviderID can be in the Spec.

@roberthbailey
Copy link
Contributor

As alluded to by @dlipovetsky above, we discussed this during the meeting today. We agreed that we wanted the field and that it should go into the spec.

We also agreed that when making this sort of change we need to be updating our documentation (both inline and in the gitbook). @hardikdr volunteered to pick this change up if you aren't able to make those changes are part of getting it merged.

@vikaschoudhary16
Copy link
Contributor Author

vikaschoudhary16 commented Jan 25, 2019

Hey guys thanks a lot for discussing this. I could not attend cluster-api meeting because of TZ problem. I am ready to take up whatever it takes to get this merged.
Will coordinate with @hardikdr

@roberthbailey
Copy link
Contributor

Thanks!

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment


// ProviderID is the identification ID of the machine provided by the provider.
// This field must match the provider ID as seen on the node object corresponding to this machine.
// +optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some use cases and a more detailed description for this field? We should also make sure to communicate to users when the field should be set and when it could be consumed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@vikaschoudhary16
Copy link
Contributor Author

/retest

@vincepri
Copy link
Member

@vikaschoudhary16 Can you rebase on master?

/approve
/assign @detiber @roberthbailey

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vikaschoudhary16, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 30, 2019
@vikaschoudhary16
Copy link
Contributor Author

@vincepri rebased!

Copy link
Member

@detiber detiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit on removal of a newline, otherwise lgtm

@@ -153,7 +166,6 @@ type MachineStatus struct {
// serialized/deserialized from this field.
// +optional
ProviderStatus *runtime.RawExtension `json:"providerStatus,omitempty"`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ProviderStatus and Addresses are not directly related it would be nice to keep this newline here to more easily differentiate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@detiber
Copy link
Member

detiber commented Jan 30, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 30, 2019
@k8s-ci-robot k8s-ci-robot merged commit 3e0d971 into kubernetes-sigs:master Jan 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.