Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayCluster][Fix] Add expectations of RayCluster #2150

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Eikykun
Copy link

@Eikykun Eikykun commented May 16, 2024

Why are these changes needed?

This PR attempts to address issues #715 and #1936 by adding expectation capabilities to ensure the pod is in the desired state during the next Reconcile following pod deletion/creation.

Similar solutions can be referred to at:

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@Eikykun Eikykun force-pushed the 240516-exp branch 2 times, most recently from 169770d to 10120e3 Compare May 16, 2024 13:46
@kevin85421 kevin85421 self-assigned this May 16, 2024
@kevin85421 kevin85421 self-requested a review May 16, 2024 17:58
@kevin85421
Copy link
Member

Hi @Eikykun, thank you for the PR! I will review it next week. Are you on Ray Slack? We can iterate more quickly there since this is a large PR. My Slack handle is "Kai-Hsun Chen (ray team)". Thanks!

@kevin85421
Copy link
Member

I will review this PR tomorrow.

@kevin85421
Copy link
Member

cc @rueian Would you mind giving this PR a review? I think I don't have enough time to review it today. Thanks!

@rueian
Copy link
Contributor

rueian commented May 30, 2024

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

@Eikykun
Copy link
Author

Eikykun commented Jun 3, 2024

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

Apologies, I'm not quite clear about what "related informer cache" refers to.

@rueian
Copy link
Contributor

rueian commented Jun 8, 2024

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

Apologies, I'm not quite clear about what "related informer cache" refers to.

According to #715, the root cause is the stale informer cache, so I am wondering if the issue can be solved by fixing the cache, for example doing a manual Resync somehow.

@kevin85421
Copy link
Member

I am reviewing this PR now. I will try to review this PR an iteration every 1 or 2 days.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just reviewed a small part of this PR. I will try to do another iteration tomorrow.

@kevin85421
Copy link
Member

Btw, @Eikykun would you mind rebasing with the master branch and resolving the conflict? Thanks!

@Eikykun
Copy link
Author

Eikykun commented Jun 12, 2024

According to #715, the root cause is the stale informer cache, so I am wondering if the issue can be solved by fixing the cache, for example doing a manual Resync somehow.

Gotit. From a problem-solving standpoint, if we don't rely on an informer in the controller and directly query the ApiServer for pods, the cache consistency issue with etcd wouldn't occur. However, this approach would increase network traffic and affect reconciliation efficiency.
As far as I understand, the Resync() method in DeltaFIFO is not intended to ensure cache consistency with etcd, but rather to prevent event loss by means of periodic reconciliation.

@Eikykun
Copy link
Author

Eikykun commented Jun 12, 2024

Btw, @Eikykun would you mind rebasing with the master branch and resolving the conflict? Thanks!

thanks for your review, I will review the pr issue and resolve the conflicts later.

@kevin85421
Copy link
Member

@Eikykun would you mind installing pre-commit https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md and fixing the linter issues? Thanks!

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

@kevin85421
Copy link
Member

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

Follow up for ^

@Eikykun
Copy link
Author

Eikykun commented Jun 18, 2024

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

Sorry, I didn't have time to reply a few days ago.

ActiveExpectationItem is removed after fulfilling its expectations. Therefore, the memory usage depends on how many pods that are being created or deleted are not yet synchronized to the cache. It might not actually consume much memory? Also, ControllerExpectations caches each pod's UID: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/controller_utils.go#L364
Therefore, I'm not quite sure which one is lighter, ActiveExpectationItem or ControllerExpectations.

I started with ControllerExpectations in RayCluster from the beginning. But I'm a bit unsure why I switched to ActiveExpectationItem; perhaps it was more complicated. ControllerExpectations requires using PodEventHandler to handle Observed logic. RayCluster needs to implement PodEventHandler logic separately.

@Eikykun
Copy link
Author

Eikykun commented Oct 29, 2024

Could you help approve the workflow? cc @kevin85421
If there are any questions, concerns, or additional changes needed on my part, please let me know.
I am happy to assist in any way possible to expedite the process :)

@kevin85421
Copy link
Member

@Eikykun, thank you for following up! Sorry for the late review. I had concerns about merging such a large change before Ray Summit. Now, I have enough time to verify the correctness and stability of this PR. This is also one of the most important stability improvements in the post-summit roadmap.

https://docs.google.com/document/d/1YYuAQkHKz2UTFMnTDJLg4qnW2OAlYQqjvetP_nvt0nI/edit?tab=t.0

I will resume reviewing this PR this week.

@kevin85421
Copy link
Member

cc @MortalHappiness can you also give this PR a pass of review?

@MortalHappiness
Copy link
Member

A few questions I'd like to ask:

  • Is ExpectedResourceType = "RayCluster" actually used? So far, I've only seen "Pod" being used.
  • Your implementation and the code in the description differ quite a bit, so I'm curious:
    • If only "Pod" is being used, why separate it into ActiveExpectation and RayClusterExpectation?
    • Why have implementations for both, with RayClusterExpectation composing ActiveExpectation? From what I can see, it seems like there only needs to be one. It feels more similar to ScaleExpectations, which mainly expects two operations: "Create" and "Delete."

@Eikykun
Copy link
Author

Eikykun commented Nov 6, 2024

A few questions I'd like to ask:

  • Is ExpectedResourceType = "RayCluster" actually used? So far, I've only seen "Pod" being used.

  • Your implementation and the code in the description differ quite a bit, so I'm curious:

    • If only "Pod" is being used, why separate it into ActiveExpectation and RayClusterExpectation?
    • Why have implementations for both, with RayClusterExpectation composing ActiveExpectation? From what I can see, it seems like there only needs to be one. It feels more similar to ScaleExpectations, which mainly expects two operations: "Create" and "Delete."

This might be an issue that was left over after the last simplification. Initially, I added many types like RayCluster, Service, etc., considering that more than the scale pod might require expectations. If we only consider the scaling logic for each group, we can significantly simplify the code. In fact, I recently streamlined the code and reduced the scaling expectation code to around 100 lines. You can find it in the latest commit.

Copy link
Member

@MortalHappiness MortalHappiness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the comments, could you rebase with the master branch?

@MortalHappiness
Copy link
Member

By the way, maybe you missed this because this comment is folded. #2150 (comment)

Could you either:

  • Change the ExpectScalePod to return an error, just like how kubernetes does.
  • Add some comments to say that errors are ignored because it panics instead of returning error.

@Eikykun
Copy link
Author

Eikykun commented Nov 18, 2024

By the way, maybe you missed this because this comment is folded. #2150 (comment)

Could you either:

  • Change the ExpectScalePod to return an error, just like how kubernetes does.
  • Add some comments to say that errors are ignored because it panics instead of returning error.

Thank you for your patient review. I have added some comments.

Copy link
Member

@MortalHappiness MortalHappiness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your hard work!

@Eikykun
Copy link
Author

Eikykun commented Nov 21, 2024

@kevin85421 Can you merge the PR?

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I am still reviewing it. I am looking forward to merging the PR.

// The first reconciliation created a Pod. If the Pod was quickly deleted from etcd by another component
// before the second reconciliation. This would lead to never satisfying the expected condition.
// Avoid this by setting a timeout.
isPodSatisfied = errors.IsNotFound(err) && rp.recordTimestamp.Add(ExpectationsTimeout).Before(time.Now())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the error is not IsNotFound. We will still cache it, even if it times out. Is it possible to cause some memory leak in some corner cases?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the error is not IsNotFound. We will still cache it, even if it times out. Is it possible to cause some memory leak in some corner cases?

If it is not an IsNotFound error, it indicates that there is an issue with the cache of the controllerManager. This means that the object could not be getted from the cache correctly. This should not be seen as a corner case; rather, it should be considered a critical error. As we can observe from the controller-runtime's CacheReader.Get(), these errors typically only occur when the cache is being used improperly.
https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/cache/internal/cache_reader.go#L57-L105

if a Pod is successfully stored in the cache, it should ultimately be getted from the cache as well. As long as the Pod is successfully getted, the expectation will be cleared. Therefore, there won't be any memory leaks in this process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering it a critical error that causes KubeRay to fail or leaving it as a memory leak are too aggressive for me.
Most users can tolerate creating additional Pods and then scaling them down, but they will complain if KubeRay crashes.

In ReplicaSet's implementation, the function SatisfiedExpectations returns true if it expired.

https://github.com/kubernetes/kubernetes/blob/40f222b6201d5c6476e5b20f57a4a7d8b2d71845/pkg/controller/controller_utils.go#L191-L193

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other issues (1, 2 don't need to be addressed in this PR):

  1. We store rayPod in the indexer, whereas ReplicaSet only stores an integer, which is more memory-efficient. We can run some benchmarks after this PR is merged to check for any memory issues. If so, we can switch to ReplicaSet's implementation.

  2. There might be some corner cases for suspend. Maybe we can only call deleteAllPods when there are no in-flight requests (i.e. expectation is satisfied).

  3. Possible memory leak if KubeRay misses the resource event.

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
@@ -184,6 +187,8 @@ func (r *RayClusterReconciler) Reconcile(ctx context.Context, request ctrl.Reque

// No match found
if errors.IsNotFound(err) {
// Clear all related expectations
rayClusterScaleExpectation.Delete(instance.Name, instance.Namespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to cause a memory leak if KubeRay doesn't receive the resource event after the RayCluster CR is deleted? If it is possible, we should consider a solution, such as adding a finalizer to the RayCluster, to ensure the cleanup of the cache in a follow-up PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to cause a memory leak if KubeRay doesn't receive the resource event after the RayCluster CR is deleted? If it is possible, we should consider a solution, such as adding a finalizer to the RayCluster, to ensure the cleanup of the cache in a follow-up PR.

Perhaps we can cleanup when rayCluster.DeletionTimestamp.IsZero() == false?This way, even if we lose the events of the RayCluster, we can still rely on the events from the Pods to trigger reconciliation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can cleanup when rayCluster.DeletionTimestamp.IsZero() == false?

This still does not completely prevent the memory leak. Do I miss anything?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can cleanup when rayCluster.DeletionTimestamp.IsZero() == false?

This still does not completely prevent the memory leak. Do I miss anything?

There is indeed such a possibility. Not too complex. I will address the solution in a follow-up PR.

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
@@ -804,6 +817,7 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv
}
logger.Info("reconcilePods", "The worker Pod has already been deleted", pod.Name)
} else {
rayClusterScaleExpectation.ExpectScalePod(pod.Namespace, instance.Name, worker.GroupName, pod.Name, expectations.Delete)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deletion of Pods inside WorkersToDelete is idempotent. Expectations seem to be unnecessary in this case, but I am fine if we also use Expectations to track it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deletion of Pods inside WorkersToDelete is idempotent. Expectations seem to be unnecessary in this case, but I am fine if we also use Expectations to track it.

Using Expectation here can help avoid repeatedly calling the APIServer to delete the same Pod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants