Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayCluster][Fix] Add expectations of RayCluster #2150

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

Eikykun
Copy link

@Eikykun Eikykun commented May 16, 2024

Why are these changes needed?

This PR attempts to address issues #715 and #1936 by adding expectation capabilities to ensure the pod is in the desired state during the next Reconcile following pod deletion/creation.

Similar solutions can be referred to at:

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421
Copy link
Member

Hi @Eikykun, thank you for the PR! I will review it next week. Are you on Ray Slack? We can iterate more quickly there since this is a large PR. My Slack handle is "Kai-Hsun Chen (ray team)". Thanks!

@kevin85421
Copy link
Member

I will review this PR tomorrow.

@kevin85421
Copy link
Member

cc @rueian Would you mind giving this PR a review? I think I don't have enough time to review it today. Thanks!

Comment on lines 142 to 148
defer func() {
if satisfied {
ae.subjects.Delete(expectation)
}
}()

satisfied, err = expectation.(*ActiveExpectation).isSatisfied()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many read-after-write operations in the ActiveExpectations. Should we use a mutex to wrap these operations? For example, will the above ae.subjects.Delete(expectation) delete an unsatisfied expectation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subjects of ActiveExpectations are ThreadSafeStore, utilizing store provided by k8s.io/client-go/tools/cache. Therefore, operations on ActiveExpectations.subjects are thread-safe.
For ActiveExpectations.subjects items, the ActiveExpectation also utilizes a ThreadSafeStore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return fmt.Errorf("fail to get active expectation item for %s when expecting: %s", key, err)
}

ae.recordTimestamp = time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a mutex for updating the recordTimestamp?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a mutex for updating the recordTimestamp?

Thanks for your review. 😺

The use of a mutex for ActiveExpectation's recordTimestamp depends on the context of the ActiveExpectations. Currently, it is only employed in the controller's Reconcile func. Within the controller, multiple workers reconcile concurrently, but only one worker handles a given reconcile.Request at any given time.

Whether the recordTimestamp is used with a mutex in ActiveExpectation depends on its usage context. It is currently only used within the Reconcile func in the controller, where multiple workers are running in parallel. However, for the same ReconcileRequest, only a single worker handles it at any given time.

As a result, only one goroutine handles the ActiveExpectations associated with the same RayCluster, ensuring there won't be any concurrency issues with reading and writing.

However, this can cause issues if ActiveExpectations are used externally by something like an EventHandler. But we aren't seeing this use case currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@rueian
Copy link
Contributor

rueian commented May 30, 2024

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

@Eikykun
Copy link
Author

Eikykun commented Jun 3, 2024

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

Apologies, I'm not quite clear about what "related informer cache" refers to.

@rueian
Copy link
Contributor

rueian commented Jun 8, 2024

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

Apologies, I'm not quite clear about what "related informer cache" refers to.

According to #715, the root cause is the stale informer cache, so I am wondering if the issue can be solved by fixing the cache, for example doing a manual Resync somehow.

@kevin85421
Copy link
Member

I am reviewing this PR now. I will try to review this PR an iteration every 1 or 2 days.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just reviewed a small part of this PR. I will try to do another iteration tomorrow.

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
resource := ResourceInitializers[i.Kind]()
if err := i.Get(context.TODO(), types.NamespacedName{Namespace: namespace, Name: i.Name}, resource); err == nil {
return true, nil
} else if errors.IsNotFound(err) && i.RecordTimestamp.Add(30*time.Second).Before(time.Now()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? Do you mean:

(1) The Pod is not found in the informer cache.
(2) KubeRay has already submitted a Create request to the K8s API server at t=RecordTimestamp. If the Create request was made more than 30 seconds ago, we assume it satisfies the expectation.

I can't understand (2). If we sent a request 30 seconds ago and the informer still hasn't received information about the Pod, there are two possibilities:

  • (a) There are delays between the K8s API server and the informer cache.
  • (b) The creation failed.

For case (a), it is OK for the function to say that the expectation is satisfied. However, for case (b), what will happen if the creation fails and we tell the KubeRay operator it is satisfied?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case(b) may not occur here? Because only after the pod is successfully created, the pod is expected to be in the cache.

if err := r.Create(ctx, &pod); err != nil {
	return err
}
rayClusterExpectation.ExpectCreateHeadPod(key, pod.Namespace, pod.Name)

@kevin85421
Copy link
Member

Btw, @Eikykun would you mind rebasing with the master branch and resolving the conflict? Thanks!

@Eikykun
Copy link
Author

Eikykun commented Jun 12, 2024

According to #715, the root cause is the stale informer cache, so I am wondering if the issue can be solved by fixing the cache, for example doing a manual Resync somehow.

Gotit. From a problem-solving standpoint, if we don't rely on an informer in the controller and directly query the ApiServer for pods, the cache consistency issue with etcd wouldn't occur. However, this approach would increase network traffic and affect reconciliation efficiency.
As far as I understand, the Resync() method in DeltaFIFO is not intended to ensure cache consistency with etcd, but rather to prevent event loss by means of periodic reconciliation.

@Eikykun
Copy link
Author

Eikykun commented Jun 12, 2024

Btw, @Eikykun would you mind rebasing with the master branch and resolving the conflict? Thanks!

thanks for your review, I will review the pr issue and resolve the conflicts later.

@kevin85421
Copy link
Member

@Eikykun would you mind installing pre-commit https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md and fixing the linter issues? Thanks!

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

@kevin85421
Copy link
Member

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

Follow up for ^

@Eikykun
Copy link
Author

Eikykun commented Jun 18, 2024

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

Sorry, I didn't have time to reply a few days ago.

ActiveExpectationItem is removed after fulfilling its expectations. Therefore, the memory usage depends on how many pods that are being created or deleted are not yet synchronized to the cache. It might not actually consume much memory? Also, ControllerExpectations caches each pod's UID: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/controller_utils.go#L364
Therefore, I'm not quite sure which one is lighter, ActiveExpectationItem or ControllerExpectations.

I started with ControllerExpectations in RayCluster from the beginning. But I'm a bit unsure why I switched to ActiveExpectationItem; perhaps it was more complicated. ControllerExpectations requires using PodEventHandler to handle Observed logic. RayCluster needs to implement PodEventHandler logic separately.

@Eikykun
Copy link
Author

Eikykun commented Oct 29, 2024

Could you help approve the workflow? cc @kevin85421
If there are any questions, concerns, or additional changes needed on my part, please let me know.
I am happy to assist in any way possible to expedite the process :)

@kevin85421
Copy link
Member

@Eikykun, thank you for following up! Sorry for the late review. I had concerns about merging such a large change before Ray Summit. Now, I have enough time to verify the correctness and stability of this PR. This is also one of the most important stability improvements in the post-summit roadmap.

https://docs.google.com/document/d/1YYuAQkHKz2UTFMnTDJLg4qnW2OAlYQqjvetP_nvt0nI/edit?tab=t.0

I will resume reviewing this PR this week.

@kevin85421
Copy link
Member

cc @MortalHappiness can you also give this PR a pass of review?

@MortalHappiness
Copy link
Member

A few questions I'd like to ask:

  • Is ExpectedResourceType = "RayCluster" actually used? So far, I've only seen "Pod" being used.
  • Your implementation and the code in the description differ quite a bit, so I'm curious:
    • If only "Pod" is being used, why separate it into ActiveExpectation and RayClusterExpectation?
    • Why have implementations for both, with RayClusterExpectation composing ActiveExpectation? From what I can see, it seems like there only needs to be one. It feels more similar to ScaleExpectations, which mainly expects two operations: "Create" and "Delete."

@Eikykun
Copy link
Author

Eikykun commented Nov 6, 2024

A few questions I'd like to ask:

  • Is ExpectedResourceType = "RayCluster" actually used? So far, I've only seen "Pod" being used.

  • Your implementation and the code in the description differ quite a bit, so I'm curious:

    • If only "Pod" is being used, why separate it into ActiveExpectation and RayClusterExpectation?
    • Why have implementations for both, with RayClusterExpectation composing ActiveExpectation? From what I can see, it seems like there only needs to be one. It feels more similar to ScaleExpectations, which mainly expects two operations: "Create" and "Delete."

This might be an issue that was left over after the last simplification. Initially, I added many types like RayCluster, Service, etc., considering that more than the scale pod might require expectations. If we only consider the scaling logic for each group, we can significantly simplify the code. In fact, I recently streamlined the code and reduced the scaling expectation code to around 100 lines. You can find it in the latest commit.

Comment on lines +17 to +18
fakeClient := clientFake.NewClientBuilder().WithRuntimeObjects().Build()
exp := NewRayClusterScaleExpectation(fakeClient)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t quite understand why our design needs to use a client and actually create an object. Take scale_expectations_test.go as an example: you can see that the test covers very simple things without involving any k8s operations. I believe our usage is almost identical, so we shouldn’t need to use the k8s client either.

https://github.com/openkruise/kruise/blob/c426ed9b1ea3565b1a18e2adba628a2d9b956d91/pkg/util/expectations/scale_expectations_test.go

Copy link
Author

@Eikykun Eikykun Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fundamental reason for adding a client was that we used GenerateName when creating a Pod, rather than knowing the PodName beforehand like with CloneSet. The ObserveScale() method relies on the PodName specified in ExpectScale().
https://github.com/openkruise/kruise/blob/c426ed9b1ea3565b1a18e2adba628a2d9b956d91/pkg/controller/cloneset/sync/cloneset_scale.go#L205
https://github.com/openkruise/kruise/blob/c426ed9b1ea3565b1a18e2adba628a2d9b956d91/pkg/controller/cloneset/cloneset_event_handler.go#L75

In the design of CloneSet's Expectations, the call to ScaleExpectations() must occur before Create or Delete a Pod (if ScaleExpectations() is called after Create or Delete, it may result in ObserveScale() being invoked in podEventHandler before ScaleExpectations()). Therefore, with GenerateName, we cannot obtain the PodName before Create. This requires modifying the Expectation design to allow for calling ScaleExpectations() after creating the Pod.

Similar issues can be observed in the design of ReplicaSet, which utilize UIDTrackingControllerExpectations to track pods, but currently do not address UID tracking during pod creation. Introducing expectations similar to those in ReplicaSets would complicate the code.
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/replicaset/replica_set.go#L582-L587
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/replicaset/replica_set.go#L632-L638

The addition of a client leads to simpler code implementation, as we no longer need to handle complex ObserveScale logic in the podEventHandler. We simply check if the pod in the informer meets the Expectation set during the last reconcile before proceeding with reconciliation and releasing the satisfied Expectation.
The concern here might be whether the client becomes a performance bottleneck. In the code, if there was a Create or Delete of a Pod in the last reconcile, the next reconcile will call client.Get() to check if the created or deleted Pod meets the conditions, and delete the Expectation upon success. Under low watch latency, the number of client calls is roughly in line with the number of successful pod Create/Delete. In cases of high watch latency, repeated checks may occur more frequently. However, our current practical validation shows that it does not cause a significant performance bottleneck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants