-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayCluster][Fix] Add expectations of RayCluster #2150
base: master
Are you sure you want to change the base?
Conversation
…iation with Expectations
Hi @Eikykun, thank you for the PR! I will review it next week. Are you on Ray Slack? We can iterate more quickly there since this is a large PR. My Slack handle is "Kai-Hsun Chen (ray team)". Thanks! |
I will review this PR tomorrow. |
cc @rueian Would you mind giving this PR a review? I think I don't have enough time to review it today. Thanks! |
ray-operator/controllers/ray/expectations/active_expectation.go
Outdated
Show resolved
Hide resolved
defer func() { | ||
if satisfied { | ||
ae.subjects.Delete(expectation) | ||
} | ||
}() | ||
|
||
satisfied, err = expectation.(*ActiveExpectation).isSatisfied() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many read-after-write operations in the ActiveExpectations
. Should we use a mutex to wrap these operations? For example, will the above ae.subjects.Delete(expectation)
delete an unsatisfied expectation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subjects of ActiveExpectations
are ThreadSafeStore
, utilizing store
provided by k8s.io/client-go/tools/cache. Therefore, operations on ActiveExpectations.subjects
are thread-safe.
For ActiveExpectations.subjects
items, the ActiveExpectation
also utilizes a ThreadSafeStore
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
return fmt.Errorf("fail to get active expectation item for %s when expecting: %s", key, err) | ||
} | ||
|
||
ae.recordTimestamp = time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use a mutex for updating the recordTimestamp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use a mutex for updating the
recordTimestamp
?
Thanks for your review. 😺
The use of a mutex for ActiveExpectation
's recordTimestamp
depends on the context of the ActiveExpectations
. Currently, it is only employed in the controller's Reconcile
func. Within the controller, multiple workers reconcile concurrently, but only one worker handles a given reconcile.Request
at any given time.
Whether the recordTimestamp
is used with a mutex in ActiveExpectation
depends on its usage context. It is currently only used within the Reconcile
func in the controller, where multiple workers are running in parallel. However, for the same ReconcileRequest
, only a single worker handles it at any given time.
As a result, only one goroutine handles the ActiveExpectation
s associated with the same RayCluster
, ensuring there won't be any concurrency issues with reading and writing.
However, this can cause issues if ActiveExpectations
are used externally by something like an EventHandler
. But we aren't seeing this use case currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: Rueian <[email protected]> Signed-off-by: Chaer <[email protected]>
Just wondering if the client-go's |
Co-authored-by: Rueian <[email protected]> Signed-off-by: Chaer <[email protected]>
Apologies, I'm not quite clear about what "related informer cache" refers to. |
Signed-off-by: Chaer <[email protected]>
According to #715, the root cause is the stale informer cache, so I am wondering if the issue can be solved by fixing the cache, for example doing a manual |
I am reviewing this PR now. I will try to review this PR an iteration every 1 or 2 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just reviewed a small part of this PR. I will try to do another iteration tomorrow.
resource := ResourceInitializers[i.Kind]() | ||
if err := i.Get(context.TODO(), types.NamespacedName{Namespace: namespace, Name: i.Name}, resource); err == nil { | ||
return true, nil | ||
} else if errors.IsNotFound(err) && i.RecordTimestamp.Add(30*time.Second).Before(time.Now()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? Do you mean:
(1) The Pod is not found in the informer cache.
(2) KubeRay has already submitted a Create
request to the K8s API server at t=RecordTimestamp
. If the Create
request was made more than 30 seconds ago, we assume it satisfies the expectation.
I can't understand (2). If we sent a request 30 seconds ago and the informer still hasn't received information about the Pod, there are two possibilities:
- (a) There are delays between the K8s API server and the informer cache.
- (b) The creation failed.
For case (a), it is OK for the function to say that the expectation is satisfied. However, for case (b), what will happen if the creation fails and we tell the KubeRay operator it is satisfied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Case(b) may not occur here? Because only after the pod is successfully created, the pod is expected to be in the cache.
if err := r.Create(ctx, &pod); err != nil {
return err
}
rayClusterExpectation.ExpectCreateHeadPod(key, pod.Namespace, pod.Name)
ray-operator/controllers/ray/expectations/active_expectation.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/expectations/active_expectation.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/expectations/active_expectation.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/expectations/active_expectation.go
Outdated
Show resolved
Hide resolved
Btw, @Eikykun would you mind rebasing with the master branch and resolving the conflict? Thanks! |
Gotit. From a problem-solving standpoint, if we don't rely on an informer in the controller and directly query the ApiServer for pods, the cache consistency issue with etcd wouldn't occur. However, this approach would increase network traffic and affect reconciliation efficiency. |
thanks for your review, I will review the pr issue and resolve the conflicts later. |
Signed-off-by: Chaer <[email protected]>
@Eikykun would you mind installing pre-commit https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md and fixing the linter issues? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.
Follow up for ^ |
Sorry, I didn't have time to reply a few days ago.
I started with |
Signed-off-by: Chaer <[email protected]>
Could you help approve the workflow? cc @kevin85421 |
@Eikykun, thank you for following up! Sorry for the late review. I had concerns about merging such a large change before Ray Summit. Now, I have enough time to verify the correctness and stability of this PR. This is also one of the most important stability improvements in the post-summit roadmap. https://docs.google.com/document/d/1YYuAQkHKz2UTFMnTDJLg4qnW2OAlYQqjvetP_nvt0nI/edit?tab=t.0 I will resume reviewing this PR this week. |
cc @MortalHappiness can you also give this PR a pass of review? |
A few questions I'd like to ask:
|
This might be an issue that was left over after the last simplification. Initially, I added many types like RayCluster, Service, etc., considering that more than the scale pod might require expectations. If we only consider the scaling logic for each group, we can significantly simplify the code. In fact, I recently streamlined the code and reduced the scaling expectation code to around 100 lines. You can find it in the latest commit. |
fakeClient := clientFake.NewClientBuilder().WithRuntimeObjects().Build() | ||
exp := NewRayClusterScaleExpectation(fakeClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t quite understand why our design needs to use a client and actually create an object. Take scale_expectations_test.go
as an example: you can see that the test covers very simple things without involving any k8s operations. I believe our usage is almost identical, so we shouldn’t need to use the k8s client either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fundamental reason for adding a client was that we used GenerateName
when creating a Pod, rather than knowing the PodName beforehand like with CloneSet. The ObserveScale()
method relies on the PodName specified in ExpectScale()
.
https://github.com/openkruise/kruise/blob/c426ed9b1ea3565b1a18e2adba628a2d9b956d91/pkg/controller/cloneset/sync/cloneset_scale.go#L205
https://github.com/openkruise/kruise/blob/c426ed9b1ea3565b1a18e2adba628a2d9b956d91/pkg/controller/cloneset/cloneset_event_handler.go#L75
In the design of CloneSet's Expectations, the call to ScaleExpectations()
must occur before Create or Delete a Pod (if ScaleExpectations()
is called after Create or Delete, it may result in ObserveScale()
being invoked in podEventHandler
before ScaleExpectations()
). Therefore, with GenerateName
, we cannot obtain the PodName before Create. This requires modifying the Expectation design to allow for calling ScaleExpectations()
after creating the Pod.
Similar issues can be observed in the design of ReplicaSet, which utilize UIDTrackingControllerExpectations
to track pods, but currently do not address UID tracking during pod creation. Introducing expectations similar to those in ReplicaSets would complicate the code.
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/replicaset/replica_set.go#L582-L587
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/replicaset/replica_set.go#L632-L638
The addition of a client leads to simpler code implementation, as we no longer need to handle complex ObserveScale
logic in the podEventHandler
. We simply check if the pod in the informer meets the Expectation set during the last reconcile before proceeding with reconciliation and releasing the satisfied Expectation.
The concern here might be whether the client becomes a performance bottleneck. In the code, if there was a Create or Delete of a Pod in the last reconcile, the next reconcile will call client.Get()
to check if the created or deleted Pod meets the conditions, and delete the Expectation upon success. Under low watch latency, the number of client calls is roughly in line with the number of successful pod Create/Delete. In cases of high watch latency, repeated checks may occur more frequently. However, our current practical validation shows that it does not cause a significant performance bottleneck.
Why are these changes needed?
This PR attempts to address issues #715 and #1936 by adding expectation capabilities to ensure the pod is in the desired state during the next Reconcile following pod deletion/creation.
Similar solutions can be referred to at:
Related issue number
Checks