-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[manager & worker] Migrate dlk into worker interface #66
Comments
This might not be the right place to ask, but I think the question I have is loosely related to this issue. In google vizier, the API presented looks like this:
That is, client uses SDK to:
In katib, the workflow looks like this (correct me if I'm wrong):
study config is used for all these configuration knobs, e.g. training command to run, job configuration, tunable parameters, etc. It occurs to me that study config is doing more than it should do and this is not flexible in the long term. On the other hand, it seems we don't have any discussion around client SDK? I haven't thought throughly on this, but want to bring up some discussions around the topic. |
Yeah, I met the same problem when I tried to support tf-operator in katib. I find that it is hard to config the tf-job in the current design since we maintain only one configuration file studyconfig. |
@ddysher Yes. In Katib, the while loop is in the katib manager ( How about make API more flexible as below?
In this way, you can use katib more flexible way. @gaocegege Does it work for your problem? |
yeah, this looks promising and closely mirrors vizier api. we can use a 'bottom-up' approach to design the API - start with the lowest api where users are required to call individual functions themselves; then once we have a clearer pictures of how people are using the api, we can provide an ambassador component like what katib does today, to provide higher-level api for users. WDYT?
|
@ddysher Great. Then I'm going to try to break down the |
Is the proposal to have trainer code call Katib to get parameters? e.g. launch a TFJob that would call GetSuggestion? My expectation is that the loop
Is not part of the TFJob/PyTorch job itself. Any thought abut how metrics would be reported? A couple of things come to mind
Thoughts? |
What is the current status of this? Having things changed in the current release. I think I had a similar question to @ddysher in kubeflow/examples#162 I've created #138 requesting a design doc to figure out how the different pieces fit together. |
We already fixed this issue. |
@jlewi sorry for late reply The usage above is copied from vizier paper, but I agree that trainer code shouldn't be aware of such low level API. Ideally, launching pod and reporting metrics should be done in some other places (haven't had time to think through this yet). As you've already mentioned, I do think a design doc should come first before we dive into the details.
|
#46 (comment)
The text was updated successfully, but these errors were encountered: