[RayCluster][Fix] Add expectations of RayCluster #2150

Eikykun · 2024-05-16T13:39:50Z

Why are these changes needed?

This PR attempts to address issues #715 and #1936 by adding expectation capabilities to ensure the pod is in the desired state during the next Reconcile following pod deletion/creation.

Similar solutions can be referred to at:

Related issue number

[Bug] RayCluster controller operator does not handle stale informer cache #715
[Bug] Ray cluster terminates more worker pods than the amount of replica scale down requested #1936

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

kevin85421 · 2024-05-18T23:06:49Z

Hi @Eikykun, thank you for the PR! I will review it next week. Are you on Ray Slack? We can iterate more quickly there since this is a large PR. My Slack handle is "Kai-Hsun Chen (ray team)". Thanks!

kevin85421 · 2024-05-28T05:10:48Z

I will review this PR tomorrow.

kevin85421 · 2024-05-29T03:58:22Z

cc @rueian Would you mind giving this PR a review? I think I don't have enough time to review it today. Thanks!

ray-operator/controllers/ray/expectations/active_expectation.go

ray-operator/controllers/ray/raycluster_controller.go

rueian · 2024-05-30T15:02:19Z

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

ray-operator/controllers/ray/raycluster_controller.go

Eikykun · 2024-06-03T04:12:54Z

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

Apologies, I'm not quite clear about what "related informer cache" refers to.

rueian · 2024-06-08T06:33:09Z

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

Apologies, I'm not quite clear about what "related informer cache" refers to.

According to #715, the root cause is the stale informer cache, so I am wondering if the issue can be solved by fixing the cache, for example doing a manual Resync somehow.

kevin85421 · 2024-06-11T06:20:48Z

I am reviewing this PR now. I will try to review this PR an iteration every 1 or 2 days.

kevin85421

I just reviewed a small part of this PR. I will try to do another iteration tomorrow.

ray-operator/controllers/ray/raycluster_controller.go

ray-operator/controllers/ray/expectations/active_expectation.go

kevin85421 · 2024-06-12T07:50:57Z

Btw, @Eikykun would you mind rebasing with the master branch and resolving the conflict? Thanks!

Eikykun · 2024-06-12T09:35:30Z

According to #715, the root cause is the stale informer cache, so I am wondering if the issue can be solved by fixing the cache, for example doing a manual Resync somehow.

Gotit. From a problem-solving standpoint, if we don't rely on an informer in the controller and directly query the ApiServer for pods, the cache consistency issue with etcd wouldn't occur. However, this approach would increase network traffic and affect reconciliation efficiency.
As far as I understand, the Resync() method in DeltaFIFO is not intended to ensure cache consistency with etcd, but rather to prevent event loss by means of periodic reconciliation.

Eikykun · 2024-06-12T09:41:05Z

Btw, @Eikykun would you mind rebasing with the master branch and resolving the conflict? Thanks!

thanks for your review, I will review the pr issue and resolve the conflicts later.

kevin85421 · 2024-06-12T17:14:45Z

@Eikykun would you mind installing pre-commit https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md and fixing the linter issues? Thanks!

kevin85421

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

kevin85421 · 2024-06-17T04:02:01Z

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

Follow up for ^

Eikykun · 2024-06-18T04:34:30Z

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

Sorry, I didn't have time to reply a few days ago.

ActiveExpectationItem is removed after fulfilling its expectations. Therefore, the memory usage depends on how many pods that are being created or deleted are not yet synchronized to the cache. It might not actually consume much memory? Also, ControllerExpectations caches each pod's UID: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/controller_utils.go#L364
Therefore, I'm not quite sure which one is lighter, ActiveExpectationItem or ControllerExpectations.

I started with ControllerExpectations in RayCluster from the beginning. But I'm a bit unsure why I switched to ActiveExpectationItem; perhaps it was more complicated. ControllerExpectations requires using PodEventHandler to handle Observed logic. RayCluster needs to implement PodEventHandler logic separately.

Eikykun · 2024-10-29T04:18:34Z

Could you help approve the workflow? cc @kevin85421
If there are any questions, concerns, or additional changes needed on my part, please let me know.
I am happy to assist in any way possible to expedite the process :)

kevin85421 · 2024-10-29T04:54:12Z

@Eikykun, thank you for following up! Sorry for the late review. I had concerns about merging such a large change before Ray Summit. Now, I have enough time to verify the correctness and stability of this PR. This is also one of the most important stability improvements in the post-summit roadmap.

https://docs.google.com/document/d/1YYuAQkHKz2UTFMnTDJLg4qnW2OAlYQqjvetP_nvt0nI/edit?tab=t.0

I will resume reviewing this PR this week.

kevin85421 · 2024-11-05T06:51:03Z

cc @MortalHappiness can you also give this PR a pass of review?

MortalHappiness · 2024-11-05T16:22:57Z

A few questions I'd like to ask:

Is ExpectedResourceType = "RayCluster" actually used? So far, I've only seen "Pod" being used.
Your implementation and the code in the description differ quite a bit, so I'm curious:
- If only "Pod" is being used, why separate it into ActiveExpectation and RayClusterExpectation?
- Why have implementations for both, with RayClusterExpectation composing ActiveExpectation? From what I can see, it seems like there only needs to be one. It feels more similar to ScaleExpectations, which mainly expects two operations: "Create" and "Delete."

Eikykun · 2024-11-06T10:49:45Z

A few questions I'd like to ask:

Is ExpectedResourceType = "RayCluster" actually used? So far, I've only seen "Pod" being used.

Your implementation and the code in the description differ quite a bit, so I'm curious:

If only "Pod" is being used, why separate it into ActiveExpectation and RayClusterExpectation?

Why have implementations for both, with RayClusterExpectation composing ActiveExpectation? From what I can see, it seems like there only needs to be one. It feels more similar to ScaleExpectations, which mainly expects two operations: "Create" and "Delete."

This might be an issue that was left over after the last simplification. Initially, I added many types like RayCluster, Service, etc., considering that more than the scale pod might require expectations. If we only consider the scaling logic for each group, we can significantly simplify the code. In fact, I recently streamlined the code and reduced the scaling expectation code to around 100 lines. You can find it in the latest commit.

ray-operator/controllers/ray/expectations/scale_expectation_test.go

MortalHappiness

Apart from the comments, could you rebase with the master branch?

ray-operator/controllers/ray/expectations/scale_expectation_test.go

ray-operator/controllers/ray/expectations/scale_expectation.go

ray-operator/controllers/ray/raycluster_controller.go

ray-operator/controllers/ray/expectations/scale_expectations_test.go

MortalHappiness · 2024-11-18T10:36:28Z

By the way, maybe you missed this because this comment is folded. #2150 (comment)

Could you either:

Change the ExpectScalePod to return an error, just like how kubernetes does.
Add some comments to say that errors are ignored because it panics instead of returning error.

Eikykun · 2024-11-18T11:01:12Z

By the way, maybe you missed this because this comment is folded. #2150 (comment)

Could you either:

Change the ExpectScalePod to return an error, just like how kubernetes does.

Add some comments to say that errors are ignored because it panics instead of returning error.

Thank you for your patient review. I have added some comments.

MortalHappiness

LGTM. Thanks for your hard work!

Eikykun · 2024-11-21T11:47:41Z

@kevin85421 Can you merge the PR?

kevin85421

Looks good! I am still reviewing it. I am looking forward to merging the PR.

ray-operator/controllers/ray/expectations/scale_expectations.go

kevin85421 · 2024-11-21T17:32:33Z

ray-operator/controllers/ray/expectations/scale_expectations.go

+				//   The first reconciliation created a Pod. If the Pod was quickly deleted from etcd by another component
+				//   before the second reconciliation. This would lead to never satisfying the expected condition.
+				//   Avoid this by setting a timeout.
+				isPodSatisfied = errors.IsNotFound(err) && rp.recordTimestamp.Add(ExpectationsTimeout).Before(time.Now())


This looks like the error is not IsNotFound. We will still cache it, even if it times out. Is it possible to cause some memory leak in some corner cases?

This looks like the error is not IsNotFound. We will still cache it, even if it times out. Is it possible to cause some memory leak in some corner cases?

If it is not an IsNotFound error, it indicates that there is an issue with the cache of the controllerManager. This means that the object could not be getted from the cache correctly. This should not be seen as a corner case; rather, it should be considered a critical error. As we can observe from the controller-runtime's CacheReader.Get(), these errors typically only occur when the cache is being used improperly.
https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/cache/internal/cache_reader.go#L57-L105

if a Pod is successfully stored in the cache, it should ultimately be getted from the cache as well. As long as the Pod is successfully getted, the expectation will be cleared. Therefore, there won't be any memory leaks in this process.

Considering it a critical error that causes KubeRay to fail or leaving it as a memory leak are too aggressive for me.
Most users can tolerate creating additional Pods and then scaling them down, but they will complain if KubeRay crashes.

In ReplicaSet's implementation, the function SatisfiedExpectations returns true if it expired.

https://github.com/kubernetes/kubernetes/blob/40f222b6201d5c6476e5b20f57a4a7d8b2d71845/pkg/controller/controller_utils.go#L191-L193

ray-operator/controllers/ray/expectations/scale_expectations_test.go

kevin85421

Some other issues (1, 2 don't need to be addressed in this PR):

We store rayPod in the indexer, whereas ReplicaSet only stores an integer, which is more memory-efficient. We can run some benchmarks after this PR is merged to check for any memory issues. If so, we can switch to ReplicaSet's implementation.
There might be some corner cases for suspend. Maybe we can only call deleteAllPods when there are no in-flight requests (i.e. expectation is satisfied).
Possible memory leak if KubeRay misses the resource event.

ray-operator/controllers/ray/raycluster_controller.go

kevin85421 · 2024-11-21T18:23:03Z

ray-operator/controllers/ray/raycluster_controller.go

@@ -184,6 +187,8 @@ func (r *RayClusterReconciler) Reconcile(ctx context.Context, request ctrl.Reque

 	// No match found
 	if errors.IsNotFound(err) {
+		// Clear all related expectations
+		rayClusterScaleExpectation.Delete(instance.Name, instance.Namespace)


Is it possible to cause a memory leak if KubeRay doesn't receive the resource event after the RayCluster CR is deleted? If it is possible, we should consider a solution, such as adding a finalizer to the RayCluster, to ensure the cleanup of the cache in a follow-up PR.

Is it possible to cause a memory leak if KubeRay doesn't receive the resource event after the RayCluster CR is deleted? If it is possible, we should consider a solution, such as adding a finalizer to the RayCluster, to ensure the cleanup of the cache in a follow-up PR.

Perhaps we can cleanup when rayCluster.DeletionTimestamp.IsZero() == false？This way, even if we lose the events of the RayCluster, we can still rely on the events from the Pods to trigger reconciliation.

Perhaps we can cleanup when rayCluster.DeletionTimestamp.IsZero() == false？

This still does not completely prevent the memory leak. Do I miss anything?

Perhaps we can cleanup when rayCluster.DeletionTimestamp.IsZero() == false？

This still does not completely prevent the memory leak. Do I miss anything?

There is indeed such a possibility. Not too complex. I will address the solution in a follow-up PR.

ray-operator/controllers/ray/raycluster_controller.go

kevin85421 · 2024-11-21T18:35:56Z

ray-operator/controllers/ray/raycluster_controller.go

@@ -804,6 +817,7 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv
 				}
 				logger.Info("reconcilePods", "The worker Pod has already been deleted", pod.Name)
 			} else {
+				rayClusterScaleExpectation.ExpectScalePod(pod.Namespace, instance.Name, worker.GroupName, pod.Name, expectations.Delete)


The deletion of Pods inside WorkersToDelete is idempotent. Expectations seem to be unnecessary in this case, but I am fine if we also use Expectations to track it.

The deletion of Pods inside WorkersToDelete is idempotent. Expectations seem to be unnecessary in this case, but I am fine if we also use Expectations to track it.

Using Expectation here can help avoid repeatedly calling the APIServer to delete the same Pod.

Eikykun force-pushed the 240516-exp branch 2 times, most recently from 169770d to 10120e3 Compare May 16, 2024 13:46

kevin85421 self-assigned this May 16, 2024

kevin85421 self-requested a review May 16, 2024 17:58

rueian reviewed May 29, 2024

View reviewed changes

rueian reviewed May 30, 2024

View reviewed changes

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved

kevin85421 assigned rueian Jun 1, 2024

Eikykun commented Jun 3, 2024

View reviewed changes

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved

kevin85421 reviewed Jun 12, 2024

View reviewed changes

kevin85421 reviewed Jun 13, 2024

View reviewed changes

kevin85421 unassigned rueian Jun 24, 2024

kevin85421 added the 1.3.0 label Sep 3, 2024

kevin85421 assigned MortalHappiness Nov 4, 2024

MortalHappiness reviewed Nov 7, 2024

View reviewed changes

ray-operator/controllers/ray/expectations/scale_expectation_test.go Outdated Show resolved Hide resolved

MortalHappiness requested changes Nov 13, 2024

View reviewed changes

Eikykun force-pushed the 240516-exp branch from b8489c7 to 26b9f3e Compare November 14, 2024 10:42

MortalHappiness reviewed Nov 15, 2024

View reviewed changes

ray-operator/controllers/ray/expectations/scale_expectations_test.go Outdated Show resolved Hide resolved

Eikykun force-pushed the 240516-exp branch from 6d3db8a to 63e4a51 Compare November 15, 2024 09:52

MortalHappiness reviewed Nov 15, 2024

View reviewed changes

ray-operator/controllers/ray/expectations/scale_expectations_test.go Show resolved Hide resolved

Eikykun force-pushed the 240516-exp branch from e55ef2c to d833b4f Compare November 18, 2024 07:29

MortalHappiness reviewed Nov 18, 2024

View reviewed changes

ray-operator/controllers/ray/expectations/scale_expectations_test.go Outdated Show resolved Hide resolved

Add expectations of RayCluster

20aa9f2

Eikykun force-pushed the 240516-exp branch from d833b4f to 20aa9f2 Compare November 18, 2024 10:59

MortalHappiness approved these changes Nov 18, 2024

View reviewed changes

kevin85421 reviewed Nov 21, 2024

View reviewed changes

fix issues

2abbcaa

Eikykun force-pushed the 240516-exp branch from 04be36b to 2abbcaa Compare November 22, 2024 06:45

[RayCluster][Fix] Add expectations of RayCluster #2150

Are you sure you want to change the base?

[RayCluster][Fix] Add expectations of RayCluster #2150

Conversation

Eikykun commented May 16, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

kevin85421 commented May 18, 2024

kevin85421 commented May 28, 2024

kevin85421 commented May 29, 2024

rueian commented May 30, 2024

Eikykun commented Jun 3, 2024

rueian commented Jun 8, 2024

kevin85421 commented Jun 11, 2024

kevin85421 left a comment

Choose a reason for hiding this comment

kevin85421 commented Jun 12, 2024

Eikykun commented Jun 12, 2024

Eikykun commented Jun 12, 2024

kevin85421 commented Jun 12, 2024

kevin85421 left a comment

Choose a reason for hiding this comment

kevin85421 commented Jun 17, 2024

Eikykun commented Jun 18, 2024

Eikykun commented Oct 29, 2024

kevin85421 commented Oct 29, 2024

kevin85421 commented Nov 5, 2024

MortalHappiness commented Nov 5, 2024

Eikykun commented Nov 6, 2024

MortalHappiness left a comment

Choose a reason for hiding this comment

MortalHappiness commented Nov 18, 2024

Eikykun commented Nov 18, 2024

MortalHappiness left a comment

Choose a reason for hiding this comment

Eikykun commented Nov 21, 2024

kevin85421 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Eikykun commented May 16, 2024 •

edited

Loading