-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
katib metrics collector solution #685
Comments
OverviewThis solution key ideas are:
See the architecture as below: Data Structure
Example
Implementation DetailConsidering that when run a trial whose worker is PytorchJob/TfJob, there may be multiple Pods which will run model train task and output metrics. That is, duplicate metrics may be generated. To avoid metrics duplication for a trial, metricsCollector as sidecar container will work to collect metrics and then report them into kabit-manger as below: MetricsCollector sidecar container should be only injected into corresponding Pod as above by MutatingWebhook for Pod level. However, since in a cluster, Pod level webhook will degrade cluster performance for frequent operations of Pod, so we will just inject MetricsCollector sidecar container into trial's Job level by TfJob/PytorchJob MutatingWebhook:
FutureSince metrics collector as above mentioned can not only be applied in Katib (persist metrics value into kabit data backend) but also useful in many other aspect, for exameple used to early stop for tainning, training monitor, training process visualization, etc.
|
LGTM, generally. Let's discuss it today |
For injecting metrics collector sidecar, @gaocegege and I have another implementation: In this new design,
|
Can you explain what component is referred to "MutatingWebHook of metrics collector"? |
@johnugeorge I think we should say |
Summary: In the worst case, we use a pod level webhook to inject sidecar into pod. In Kubernetes 1.5, the webhook supports objectSelector, then we do not have the performance problem. |
In istio solution, if a namespace has "istio-injection=enabled" label, all pods in the namespace will be injected by pod level webhook and its performance influence is little. So I think maybe we can handle it by like this:
|
Yeah, I think so. In the current version, it is the best way I think. |
then @wuchunghsuan please update sidecar injection solution in design doc in two levels as above. thanks |
OK, I got it. |
Per talk by slack, in this solution, we just focus on metricsCollector:
So periodic metrics will be discussed in early-stopping topic and |
Can you take the task @wuchunghsuan |
Can talk to apiserver, but have some problems
|
Hi folks what are you planning on delivering in 0.7? |
for this solution, we add a metricCollector sidecar container into worker Pod. metricCollector sidecar container need collect the metrics, it also need know that worker container has finished so that it can exit, too. otherwise the Pod will keep
@richardsliu @johnugeorge @gaocegege IMO, maybe we have to implement solution 3. but I need your suggestion, too. |
Thanks for the research!
I am not sure about it. Personally, I prefer the first option. |
Per talk with team, we decide to choose share process namespace solution to solve this problem |
Option 1 seems to the best option as it has no other dependency and it is specifically meant for the same purpose. |
@hougangliu @johnugeorge any update on this? |
@jlewi for now, metrics collector can work well by pod level metricsCollector container sidecar injection (also included in 0.7.0 release). |
I think we can safely remove 0.7 label and add 1.0 since high priority features are implemented now. |
@gaocegege and @hougangliu should we file more fine grained issues for the remaining work and then close this issue? |
The text was updated successfully, but these errors were encountered: