-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] add early stopped trials in converter #2004
[Fix] add early stopped trials in converter #2004
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @shaowei-su!
/lgtm
/assign @johnugeorge @tenzen-y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shaowei-su Thanks for fixing this!
/lgtm
@kubeflow/wg-automl-leads Can you restart failure CI?
Due to the aysnc nature of the trials' updates, the suggestion service actually might be invoked without all the observed metrics available. I updated this logic too so that suggestion service return empty list if no new updates been made. |
@shaowei-su Thanks for updating! /assign @andreyvelich @johnugeorge |
logger.info("Succeeded Trials didn't change: {}\n".format(self.succeeded_trials)) | ||
logger.error("Succeeded Trials didn't change: {}\n".format(self.succeeded_trials)) | ||
logger.error("No new suggestions could be generated, return early..\n") | ||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shaowei-su What if Trial was failed for some reason (e.g. networking) and we ask for the new Trials ?
In that case succeeded_trials
list will stay the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great question @andreyvelich , I can think of two possible routes:
-
Add more conditional branches in the suggestions service, e.g is the service invoked as a result of failed trials? if so, suggestion service will return duplicated suggestions as before this PR change;
-
Retry for trials should be kept as a logic within trial templates themself i.e Job, TFJob, MPIJob etc.. HPO search experiment only make progress when there are succeeded (including early stopped ones) trials complete. In the case of multiple concurrent trials, suggestion service will not return new trials until next succeeded trial's ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add more conditional branches in the suggestions service, e.g is the service invoked as a result of failed trials?
But do we need conditional branch to the suggestion service ?
If succeeded_trials
= Succeeded Trials
+ Early Stopped Trials
, then if Trial was failed the Suggestion service will just ask for the new trials (e.g. with duplicated HPs) using current_request_number
.
Retry for trials should be kept as a logic within trial templates themself i.e Job, TFJob, MPIJob etc..
That's valid assumption. In that case, what should be the nature of MaxFailedTrialCount API ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But do we need conditional branch to the suggestion service ?
The reason is that suggestion service can't tell whether the current_request_number
is requested as a result of failed trails (in this case service should return duplicated HPs), or as a result of delay in trial status/observations updates (in this case service should wait until trials are fully updated).
The latter can be reproduced by this example and change the algo to bayesianoptimization
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaxFailedTrialCount
still serve its purpose as a way to manage lifecycle of experiments. The difference would be whether we add up duplicated HP trials for counting. If suggestion services are to generate those duplicated HP trials, then a single failed HP trial with N failures will cause an experiment to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when trial early stopped without observation updated in time
Should we avoid sending Early Stopped Trials without observation to the Suggestion service ? e.g. we don't send Trials with Metrics Unavailable condition. I think, Trials logically shouldn't be called "EarlyStopped" until Observation is ready for them. WDYT @shaowei-su @johnugeorge @tenzen-y ?
@shaowei-su you are right with your assumption about current_request_number
.
I know it is hacky that we set EarlyStopping trial condition from the Early Stopping service and only after that we add Observation Trial results. We might want to reconsider some early stopping design when we introduce push-based Metrics Collector: #577
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix this, we could try to add the early stopped but non-observation ready trials into the active trials count as well. wdyt?
Since Suggestion and Trial controller works in async manner, I think we should introduce another check to Suggestion Client as I mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually those trials are already filtered out on the suggestion service side.
What's missing here is the correct currentRequestNum, which should not be incremented before the early stopping metrics are fully updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shaowei-su But is that correct approach to filter them on the Suggestion side instead of Katib Controller side ?
Some of the suggestion services might not use internal
implementation and just forget about it.
I think the Katib controller should just invoke suggestion service with the correct statuses for the Trials.
E.g. these Trial statuses.
Which means if EarlyStopped Trial doesn't have observation, Controller should not send it to Suggestion Service.
The main purpose of the Suggestion service is just produce new HPs based on Trial results.
I doubt that Suggestion service should do any additional work with Trials from the Katib controller.
What do other think about it ? @johnugeorge @gaocegege @tenzen-y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich I made a quick updates on the controller side filters and suggestion request count logic, ptal! thanks
if !t.IsObservationAvailable() { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also remove this check for the Trial observation since we don't send Trials without observation to the Suggestion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to keep this filtering logic since others might run different version of controller
vs suggestion service
docker images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, we ask our users to run the same version of Katib Control Plane to avoid incompatibility issues.
Should we add TODO in this section to remove it in the future versions ?
What do others think @gaocegege @johnugeorge @tenzen-y ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, changed the filter logic to early stopped trials only and let's keep the service side check as it is.
@@ -343,6 +343,9 @@ func (g *General) ConvertTrials(ts []trialsv1beta1.Trial) []*suggestionapi.Trial | |||
if t.IsMetricsUnavailable() { | |||
continue | |||
} | |||
if !t.IsObservationAvailable() { | |||
continue | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what about Failed
Trials (I guess the observation is empty for them, right)? Do we want to send them to the Suggestion service ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it make more sense to filter the failed trails out as those won't provide any updates to the suggestion service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some Suggestion services record failed Trials on its own DataBase (e.g. Goptuna Suggestion)
@shaowei-su Do you have any concerns with sending Failed Trials to the Suggestion service ?
We add failed Trials to completed, so it won't ask Suggestion service to generate new Trials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I didn't realize that failed trials are tracked by other suggestion services. Updated in the latest commit to filter on early stopped & without observations
trials only, PTAL.
a05ff34
to
8190f9c
Compare
Thank you for these updates @shaowei-su! |
if !t.IsObservationAvailable() && t.IsEarlyStopped() { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shaowei-su Can we add a test for this condition to the TestSyncAssignments
func or to a new test function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tenzen-y ! I added few more trails in the mock so this branching logic is also validated.
Btw, could help restart the failed unit tests? Looks like a flaky test failed.
e923f75
to
f354fdd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shaowei-su Thanks for the updates! I appreciate your work in fixing this bug.
/lgtm
/cc @johnugeorge
/assign @johnugeorge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @shaowei-su!
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, shaowei-su The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Python suggestion services will convert the trials from request payload before passing in actual search algos. However, early stopped trials are filtered out in this process and thus no updates are provided to search algo and there are duplicated trials created.
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
#2002
Checklist: