[Fix] add early stopped trials in converter #2004

shaowei-su · 2022-11-08T20:41:23Z

Python suggestion services will convert the trials from request payload before passing in actual search algos. However, early stopped trials are filtered out in this process and thus no updates are provided to search algo and there are duplicated trials created.

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

#2002

Checklist:

Docs included if any changes are user facing

coveralls · 2022-11-08T20:44:54Z

Coverage increased (+0.03%) to 73.453% when pulling a05ff34 on shaowei-su:shaowei--fix-es-trials-parse into d97c8ae on kubeflow:master.

andreyvelich

Thank you @shaowei-su!
/lgtm
/assign @johnugeorge @tenzen-y

tenzen-y

@shaowei-su Thanks for fixing this!
/lgtm

@kubeflow/wg-automl-leads Can you restart failure CI?

shaowei-su · 2022-11-09T23:52:14Z

Due to the aysnc nature of the trials' updates, the suggestion service actually might be invoked without all the observed metrics available. I updated this logic too so that suggestion service return empty list if no new updates been made.
PTAL! @andreyvelich @tenzen-y Thanks!

pkg/suggestion/v1beta1/internal/trial.py

tenzen-y · 2022-11-10T21:28:59Z

@shaowei-su Thanks for updating!
/lgtm

/assign @andreyvelich @johnugeorge

andreyvelich · 2022-11-10T22:10:09Z

pkg/suggestion/v1beta1/skopt/base_service.py

-            logger.info("Succeeded Trials didn't change: {}\n".format(self.succeeded_trials))
+            logger.error("Succeeded Trials didn't change: {}\n".format(self.succeeded_trials))
+            logger.error("No new suggestions could be generated, return early..\n")
+            return []


@shaowei-su What if Trial was failed for some reason (e.g. networking) and we ask for the new Trials ?
In that case succeeded_trials list will stay the same.

That's a great question @andreyvelich , I can think of two possible routes:

Add more conditional branches in the suggestions service, e.g is the service invoked as a result of failed trials? if so, suggestion service will return duplicated suggestions as before this PR change;

Retry for trials should be kept as a logic within trial templates themself i.e Job, TFJob, MPIJob etc.. HPO search experiment only make progress when there are succeeded (including early stopped ones) trials complete. In the case of multiple concurrent trials, suggestion service will not return new trials until next succeeded trial's ready.

Add more conditional branches in the suggestions service, e.g is the service invoked as a result of failed trials?

But do we need conditional branch to the suggestion service ?
If succeeded_trials = Succeeded Trials + Early Stopped Trials, then if Trial was failed the Suggestion service will just ask for the new trials (e.g. with duplicated HPs) using current_request_number.

Retry for trials should be kept as a logic within trial templates themself i.e Job, TFJob, MPIJob etc..

That's valid assumption. In that case, what should be the nature of MaxFailedTrialCount API ?

cc @johnugeorge @tenzen-y

But do we need conditional branch to the suggestion service ?

The reason is that suggestion service can't tell whether the current_request_number is requested as a result of failed trails (in this case service should return duplicated HPs), or as a result of delay in trial status/observations updates (in this case service should wait until trials are fully updated).

The latter can be reproduced by this example and change the algo to bayesianoptimization.

MaxFailedTrialCount still serve its purpose as a way to manage lifecycle of experiments. The difference would be whether we add up duplicated HP trials for counting. If suggestion services are to generate those duplicated HP trials, then a single failed HP trial with N failures will cause an experiment to fail.

when trial early stopped without observation updated in time

Should we avoid sending Early Stopped Trials without observation to the Suggestion service ? e.g. we don't send Trials with Metrics Unavailable condition. I think, Trials logically shouldn't be called "EarlyStopped" until Observation is ready for them. WDYT @shaowei-su @johnugeorge @tenzen-y ?
@shaowei-su you are right with your assumption about current_request_number.

I know it is hacky that we set EarlyStopping trial condition from the Early Stopping service and only after that we add Observation Trial results. We might want to reconsider some early stopping design when we introduce push-based Metrics Collector: #577

To fix this, we could try to add the early stopped but non-observation ready trials into the active trials count as well. wdyt?

Since Suggestion and Trial controller works in async manner, I think we should introduce another check to Suggestion Client as I mentioned above.

Actually those trials are already filtered out on the suggestion service side.

What's missing here is the correct currentRequestNum, which should not be incremented before the early stopping metrics are fully updated.

@shaowei-su But is that correct approach to filter them on the Suggestion side instead of Katib Controller side ?

Some of the suggestion services might not use internal implementation and just forget about it.
I think the Katib controller should just invoke suggestion service with the correct statuses for the Trials.
E.g. these Trial statuses.
Which means if EarlyStopped Trial doesn't have observation, Controller should not send it to Suggestion Service.
The main purpose of the Suggestion service is just produce new HPs based on Trial results.

I doubt that Suggestion service should do any additional work with Trials from the Katib controller.

What do other think about it ? @johnugeorge @gaocegege @tenzen-y

@andreyvelich I made a quick updates on the controller side filters and suggestion request count logic, ptal! thanks

andreyvelich · 2022-11-22T15:38:44Z

pkg/controller.v1beta1/suggestion/suggestionclient/suggestionclient.go

+		if !t.IsObservationAvailable() {
+			continue
+		}


Should we also remove this check for the Trial observation since we don't send Trials without observation to the Suggestion ?

I'd suggest to keep this filtering logic since others might run different version of controller vs suggestion service docker images.

Usually, we ask our users to run the same version of Katib Control Plane to avoid incompatibility issues.
Should we add TODO in this section to remove it in the future versions ?
What do others think @gaocegege @johnugeorge @tenzen-y ?

Same here, changed the filter logic to early stopped trials only and let's keep the service side check as it is.

andreyvelich · 2022-11-22T15:39:16Z

pkg/controller.v1beta1/suggestion/suggestionclient/suggestionclient.go

@@ -343,6 +343,9 @@ func (g *General) ConvertTrials(ts []trialsv1beta1.Trial) []*suggestionapi.Trial
 		if t.IsMetricsUnavailable() {
 			continue
 		}
+		if !t.IsObservationAvailable() {
+			continue
+		}


Also, what about Failed Trials (I guess the observation is empty for them, right)? Do we want to send them to the Suggestion service ?

I think it make more sense to filter the failed trails out as those won't provide any updates to the suggestion service.

Some Suggestion services record failed Trials on its own DataBase (e.g. Goptuna Suggestion)
@shaowei-su Do you have any concerns with sending Failed Trials to the Suggestion service ?
We add failed Trials to completed, so it won't ask Suggestion service to generate new Trials.

Good call, I didn't realize that failed trials are tracked by other suggestion services. Updated in the latest commit to filter on early stopped & without observations trials only, PTAL.

andreyvelich · 2022-11-28T12:28:18Z

Thank you for these updates @shaowei-su!
/lgtm
/assign @johnugeorge @gaocegege @tenzen-y

tenzen-y · 2022-11-28T17:42:02Z

pkg/controller.v1beta1/suggestion/suggestionclient/suggestionclient.go

+		if !t.IsObservationAvailable() && t.IsEarlyStopped() {
+			continue
+		}


@shaowei-su Can we add a test for this condition to the TestSyncAssignments func or to a new test function?

Thanks @tenzen-y ! I added few more trails in the mock so this branching logic is also validated.

Btw, could help restart the failed unit tests? Looks like a flaky test failed.

Co-authored-by: Yuki Iwai <[email protected]>

tenzen-y

@shaowei-su Thanks for the updates! I appreciate your work in fixing this bug.
/lgtm

/cc @johnugeorge

tenzen-y · 2022-12-02T20:23:45Z

/assign @johnugeorge

andreyvelich

Thank you for the contribution @shaowei-su!
/approve

google-oss-prow · 2022-12-05T13:25:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, shaowei-su

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested review from anencore94 and sperlingxx November 8, 2022 20:41

google-oss-prow bot added the size/XS label Nov 8, 2022

shaowei-su mentioned this pull request Nov 8, 2022

Duplicated suggestions generated when early stopping is enabled #2002

Closed

andreyvelich reviewed Nov 9, 2022

View reviewed changes

google-oss-prow bot assigned johnugeorge, tenzen-y and andreyvelich Nov 9, 2022

google-oss-prow bot added the lgtm label Nov 9, 2022

tenzen-y reviewed Nov 9, 2022

View reviewed changes

google-oss-prow bot removed the lgtm label Nov 9, 2022

tenzen-y reviewed Nov 10, 2022

View reviewed changes

pkg/suggestion/v1beta1/internal/trial.py Outdated Show resolved Hide resolved

google-oss-prow bot added the lgtm label Nov 10, 2022

andreyvelich reviewed Nov 10, 2022

View reviewed changes

google-oss-prow bot added size/M and removed lgtm size/XS labels Nov 21, 2022

andreyvelich reviewed Nov 22, 2022

View reviewed changes

shaowei-su force-pushed the shaowei--fix-es-trials-parse branch 2 times, most recently from a05ff34 to 8190f9c Compare November 26, 2022 21:51

google-oss-prow bot assigned gaocegege Nov 28, 2022

google-oss-prow bot added the lgtm label Nov 28, 2022

tenzen-y reviewed Nov 28, 2022

View reviewed changes

google-oss-prow bot added size/L and removed lgtm size/M labels Nov 30, 2022

shaowei su and others added 8 commits November 30, 2022 15:54

add early stopped trials in converter

60752f3

error out early

d2231ab

Update pkg/suggestion/v1beta1/internal/trial.py

523dbf0

Co-authored-by: Yuki Iwai <[email protected]>

add incomplete trial filter

33f9bcd

fix ut

ed2cbca

more fixes

58ff940

filter on es

dfea589

enrich existing tests

f354fdd

shaowei-su force-pushed the shaowei--fix-es-trials-parse branch from e923f75 to f354fdd Compare November 30, 2022 23:54

tenzen-y reviewed Dec 2, 2022

View reviewed changes

google-oss-prow bot requested a review from johnugeorge December 2, 2022 20:18

google-oss-prow bot added the lgtm label Dec 2, 2022

andreyvelich approved these changes Dec 5, 2022

View reviewed changes

google-oss-prow bot added the approved label Dec 5, 2022

google-oss-prow bot merged commit 1e4df8d into kubeflow:master Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] add early stopped trials in converter #2004

[Fix] add early stopped trials in converter #2004

shaowei-su commented Nov 8, 2022

coveralls commented Nov 8, 2022 •

edited

Loading

andreyvelich left a comment

tenzen-y left a comment

shaowei-su commented Nov 9, 2022

tenzen-y commented Nov 10, 2022

andreyvelich Nov 10, 2022

shaowei-su Nov 10, 2022

andreyvelich Nov 11, 2022

shaowei-su Nov 11, 2022

shaowei-su Nov 11, 2022

andreyvelich Nov 18, 2022 •

edited

Loading

andreyvelich Nov 18, 2022

shaowei-su Nov 18, 2022

andreyvelich Nov 18, 2022 •

edited

Loading

shaowei-su Nov 21, 2022

andreyvelich Nov 22, 2022

shaowei-su Nov 23, 2022

andreyvelich Nov 23, 2022 •

edited

Loading

shaowei-su Nov 26, 2022 •

edited

Loading

andreyvelich Nov 22, 2022

shaowei-su Nov 23, 2022

andreyvelich Nov 23, 2022

shaowei-su Nov 26, 2022

andreyvelich commented Nov 28, 2022

tenzen-y Nov 28, 2022

shaowei-su Dec 1, 2022

tenzen-y left a comment

tenzen-y commented Dec 2, 2022

andreyvelich left a comment

google-oss-prow bot commented Dec 5, 2022

[Fix] add early stopped trials in converter #2004

[Fix] add early stopped trials in converter #2004

Conversation

shaowei-su commented Nov 8, 2022

coveralls commented Nov 8, 2022 • edited Loading

andreyvelich left a comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

shaowei-su commented Nov 9, 2022

tenzen-y commented Nov 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Nov 23, 2022 • edited Loading

Choose a reason for hiding this comment

shaowei-su Nov 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich commented Nov 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

tenzen-y commented Dec 2, 2022

andreyvelich left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Dec 5, 2022

coveralls commented Nov 8, 2022 •

edited

Loading

andreyvelich Nov 18, 2022 •

edited

Loading

andreyvelich Nov 18, 2022 •

edited

Loading

andreyvelich Nov 23, 2022 •

edited

Loading

shaowei-su Nov 26, 2022 •

edited

Loading