-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chocolate service db exhausted #1122
Comments
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
Thanks for the issue, I will have a look /cc @andreyvelich |
@StefanoFioravanzo Thank you for the issue. |
Writing here (after #1124 ) as it is more related:
|
@StefanoFioravanzo Can you show me your yaml file with Experiment, where you got this error? I think we should check it in Validate Algorithm Settings step. You are right, let's discuss in this issue. |
Here it is: apiVersion: kubeflow.org/v1alpha3
kind: Experiment
metadata:
labels:
controller-tools.k8s.io: '1.0'
name: katib-test
spec:
algorithm:
algorithmName: grid
parallelTrialCount: 1
maxFailedTrialCount: 6
maxTrialCount: 30
objective:
additionalMetricNames:
goal: 100
objectiveMetricName: result
type: maximize
parameters:
- feasibleSpace:
max: '3'
min: '1'
name: b
parameterType: int
- name: c
parameterType: categorical
feasibleSpace:
list:
- "1"
- "9"
- "15"
trialTemplate:
goTemplate:
rawTemplate: |
apiVersion: batch/v1
kind: Job
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
backoffLimit: 0
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
restartPolicy: Never
containers:
- name: {{.Trial}}
image: <image>
command:
- python3 -u -c "<fn>" This is the experiment, the search space is silly, just for testing. I omitted the image and the python command, since they are in a private repository. What I have is just a python function that returns the sum of the input arguments, that's it. So as you can see the maxTrialCount is 30 and the result is 100, that could never be reaches. After all the trials are completed the suggestions pod raises the db exhausted error, when I would expect it to say that it has no more trials to run and exit gracefully. Maybe there could be a special experiment state for this case, when the goal could not be reached. |
@StefanoFioravanzo I believe for your search space, 30 Trials should be enough. |
Exactly! I get that error when the search space is fully explored, and the goal has not been reached |
@StefanoFioravanzo I will test your search space with this amount of Trials and get back to you. |
@StefanoFioravanzo I just figure out that your Experiment has |
So what you are saying is that this behaviour is expected, when the |
Yes, it is expected. |
Hi @andreyvelich, We've seen that the controller speaks gRPC with the algorithm service, but their communication is limited to passing suggestions and successful answers independently to what actually happens inside the algorithm service. To tackle this, and potentially more issues, we have thought of letting the controller having a better insight into the state of the algorithm service. Let's get to our proposal using an example: A time comes when there are no more suggestions to generate. When this happens, the algorithm service returns zero suggestions. However, the corresponding experiment and suggestion will stay in
However, the algorithm service knows when there are no more new suggestions to generate. This is true for the That's why we thought of two things:
We have all this implemented, tested in specific use cases, and we can create a PR so that you can see for yourself. Of course, we can iterate on it. The rationale for this proposal is that users should define a search space and they shouldn't care about anything else. So, although failing early if max trials > possible suggestions would indeed solve the issue, we believe that it's not the right approach. Data scientists shouldn't be counting the exact number of combinations before sumbitting an Experiment. |
Generally LGTM. It works only if the algorithm has the |
Hello @gaocegege, Yes, this change shouldn't affect other algorithms, except if they also speak the same code. Another good thing of the approach is this: in the future, we may want to communicate something new between the service and the controller. All it would need is:
|
Gotcha. SGTM. Maybe it is related to #1061 |
Thank you for writing this @elikatsis. We try to design Katib controller, states, etc. independent of Suggestion services. Is that correct to run Experiment that in any situation will be not Succeeded? If user decided to use Grid algorithm, he/she knows in advance how many Trials can be generated. What do you think @johnugeorge ? |
Unfortunately, I'm not familiar with the Hyperband algorithm. Could you give some more context? What's the main idea of the algorithm? What are Reading it diagonally and seeing the following lines, it looks like Hyperband could also use an katib/pkg/suggestion/v1beta1/hyperband/service.py Lines 32 to 35 in c2c5288
Whether the experiment succeeds or not is based on whether the goal is reached or not. katib/pkg/controller.v1alpha3/experiment/util/status_util.go Lines 153 to 158 in 5c4624f
While this stands true and the data scientist can calculate the combinations, we believe this is something they shouldn't be bothered with. We will open a PR early next week so that you can have a better look. We first need to smooth some rough edges, rebase on current master, and verify it still functions properly, in our use cases at least. |
I believe Hyperband has strictly restrictions on parallel Trials: katib/pkg/suggestion/v1beta1/hyperband/service.py Lines 215 to 216 in c2c5288
So Smax controls Parallel Trial Count.
That is not true every time. We support running Experiment without the goal (#1065). So when |
Thanks for this, indeed I had missed that. I changed the code to take it into consideration. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
/kind bug
What steps did you take and what happened:
When creating experiments that use the
grid
search algorithm sometimes trials stop being generated, even though the experiment is still seen as running.By printing the logs of the suggestions pods I can see the following:
Note that this is happening just sometimes, I was not able to pin down a specific configuration that causes this. With some parameters configurations it happens, with others the experiment completes by exploring the search space as expected.
What did you expect to happen:
I would at least expect to see the experiment failing (related to #1120 ), but a more informative message would be expected. The message
Chocolate db is exhausted, increase Search Space or decrease maxTrialCount!
doesn't make sense as themaxTrialCount
is still not reaches and there are some more configurations of the search space that have not been tried.Environment:
The text was updated successfully, but these errors were encountered: