Chocolate service db exhausted #1122

StefanoFioravanzo · 2020-04-03T10:33:46Z

/kind bug

What steps did you take and what happened:
When creating experiments that use the grid search algorithm sometimes trials stop being generated, even though the experiment is still seen as running.
By printing the logs of the suggestions pods I can see the following:

ERROR:sqlalchemy.pool.impl.NullPool:Exception during reset or similar
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 680, in _finalize_fairy
    fairy._reset(pool)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 867, in _reset
    pool._dialect.do_rollback(self)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 502, in do_rollback
    dbapi_connection.rollback()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140636731225856 and this is thread id 140636053288704.
ERROR:sqlalchemy.pool.impl.NullPool:Exception closing connection <sqlite3.Connection object at 0x7fe8b4397650>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 680, in _finalize_fairy
    fairy._reset(pool)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 867, in _reset
    pool._dialect.do_rollback(self)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 502, in do_rollback
    dbapi_connection.rollback()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140636731225856 and this is thread id 140636053288704.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 270, in _close_connection
    self._dialect.do_close(connection)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 508, in do_close
    dbapi_connection.close()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140636731225856 and this is thread id 140636053288704.
DEBUG:filelock:Attempting to release lock 140636764717688 on my_db.db.lock
INFO:filelock:Lock 140636764717688 released on my_db.db.lock
INFO:BaseChocolateService:{'_chocolate_id': 0, '_loss': -83.971292, 'bm9kZXNfbnVtYmVy': 512.0}
INFO:BaseChocolateService:{'_chocolate_id': 1, '_loss': -82.655502, 'bm9kZXNfbnVtYmVy': 768.0}
DEBUG:filelock:Attempting to acquire lock 140636764717688 on my_db.db.lock
INFO:filelock:Lock 140636764717688 acquired on my_db.db.lock
DEBUG:filelock:Attempting to release lock 140636764717688 on my_db.db.lock
INFO:filelock:Lock 140636764717688 released on my_db.db.lock
INFO:BaseChocolateService:Chocolate db is exhausted, increase Search Space or decrease maxTrialCount!
INFO:BaseChocolateService:Chocolate db is exhausted, increase Search Space or decrease maxTrialCount!
INFO:BaseChocolateService:Chocolate db is exhausted, increase Search Space or decrease maxTrialCount!

Note that this is happening just sometimes, I was not able to pin down a specific configuration that causes this. With some parameters configurations it happens, with others the experiment completes by exploring the search space as expected.

What did you expect to happen:
I would at least expect to see the experiment failing (related to #1120 ), but a more informative message would be expected. The message Chocolate db is exhausted, increase Search Space or decrease maxTrialCount! doesn't make sense as the maxTrialCount is still not reaches and there are some more configurations of the search space that have not been tried.

Environment:

Kubeflow version: 1.0
Minikube version: 1.2.0 (MiniKF Latest)
Kubernetes version: 1.14

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-04-03T10:33:52Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
bug	0.97

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

gaocegege · 2020-04-07T02:22:15Z

Thanks for the issue, I will have a look

/cc @andreyvelich

andreyvelich · 2020-04-07T09:06:20Z

@StefanoFioravanzo Thank you for the issue.
Try to use latest image for the Grid. This PR: #1116, should fix some of these problems.
Also, can you show your Search Space and maxTrialCount?

StefanoFioravanzo · 2020-04-08T15:39:32Z

Writing here (after #1124 ) as it is more related:

So, now when the search space is exhausted, but the goal is not reached and also max trial count is not reaches, the suggestions pod still outputs db is exhausted, increase Search Space or decrease maxTrialCount!. It would be ideal to have this being reflected in the status of the experiment. And it should be simple to differentiate between when the maxTrialCount has been reached and when the search space was fully explored. WDYT?

I am willing to help for this and send PR, maybe can you give me some pointers?

andreyvelich · 2020-04-08T18:46:47Z

@StefanoFioravanzo Can you show me your yaml file with Experiment, where you got this error?

I think we should check it in Validate Algorithm Settings step.
I am not sure that using Grid user should start Experiment if maxTrialCount < count of all possible combinations of parameters.

You are right, let's discuss in this issue.

StefanoFioravanzo · 2020-04-08T19:24:10Z

Here it is:

apiVersion: kubeflow.org/v1alpha3
kind: Experiment
metadata:
  labels:
    controller-tools.k8s.io: '1.0'
  name: katib-test
spec:
  algorithm:
    algorithmName: grid
  parallelTrialCount: 1
  maxFailedTrialCount: 6
  maxTrialCount: 30
  objective:
    additionalMetricNames:
    goal: 100
    objectiveMetricName: result
    type: maximize
  parameters:
  - feasibleSpace:
      max: '3'
      min: '1'
    name: b
    parameterType: int
  - name: c
    parameterType: categorical
    feasibleSpace:
      list:
        - "1"
        - "9"
        - "15"
  trialTemplate:
    goTemplate:
      rawTemplate: |
        apiVersion: batch/v1
        kind: Job
        metadata:
          name: {{.Trial}}
          namespace: {{.NameSpace}}
        spec:
          backoffLimit: 0
          template:
            metadata:
              annotations:
                sidecar.istio.io/inject: "false"
            spec:
              restartPolicy: Never
              containers:
                - name: {{.Trial}}
                  image: <image>
                  command:
                    - python3 -u -c "<fn>"

This is the experiment, the search space is silly, just for testing. I omitted the image and the python command, since they are in a private repository. What I have is just a python function that returns the sum of the input arguments, that's it.

So as you can see the maxTrialCount is 30 and the result is 100, that could never be reaches. After all the trials are completed the suggestions pod raises the db exhausted error, when I would expect it to say that it has no more trials to run and exit gracefully. Maybe there could be a special experiment state for this case, when the goal could not be reached.

andreyvelich · 2020-04-09T15:02:08Z

@StefanoFioravanzo I believe for your search space, 30 Trials should be enough.
After using the latest image for the Grid with this Experiment you are still getting DB exhausted error?

StefanoFioravanzo · 2020-04-09T15:13:03Z

Exactly! I get that error when the search space is fully explored, and the goal has not been reached

andreyvelich · 2020-04-09T15:40:14Z

@StefanoFioravanzo I will test your search space with this amount of Trials and get back to you.

andreyvelich · 2020-04-10T19:05:41Z

@StefanoFioravanzo I just figure out that your Experiment has maxTrialCount more than all combinations of parameters.
I believe in your search space the maximum amount of trials is 9.

StefanoFioravanzo · 2020-04-11T08:42:38Z

So what you are saying is that this behaviour is expected, when the maxTrialCount is greater than all combinations?

andreyvelich · 2020-04-13T12:35:17Z

Yes, it is expected.
As I said above, I think we will validate it in ValidateAlgorithmSettings.

elikatsis · 2020-06-05T09:12:59Z

Hi @andreyvelich,
thank you for holding #1205.
I'm writing to describe our (Arrikto) proposal to tackle this issue.

We've seen that the controller speaks gRPC with the algorithm service, but their communication is limited to passing suggestions and successful answers independently to what actually happens inside the algorithm service.

To tackle this, and potentially more issues, we have thought of letting the controller having a better insight into the state of the algorithm service.

Let's get to our proposal using an example: A time comes when there are no more suggestions to generate. When this happens, the algorithm service returns zero suggestions. However, the corresponding experiment and suggestion will stay in Running state.
There are two problems with that:

They are in false state. They shouldn't be Running as no new Trials will be created and nothing will change in general
They get in an infinite reconciliation loop

However, the algorithm service knows when there are no more new suggestions to generate. This is true for the grid algorithm - since we have worked with that. (Can you provide insight into what happens with other algorithms?)

That's why we thought of two things:

Make the controller check the gRPC response status
Have the grid algorithm return a NotFound gRPC error when it cannot produce any more suggestions
Introduce a new (final) Suggestion state for that case: Exhausted
If an Experiment has its Suggestion Exhausted and the goal is not reached, then the Experiment should get to Failed state

We have all this implemented, tested in specific use cases, and we can create a PR so that you can see for yourself. Of course, we can iterate on it.

The rationale for this proposal is that users should define a search space and they shouldn't care about anything else.
If you think about it, the decision for max-trials-allowed isn't a data scientist's job. It is a setting that doesn't have much to do with the research part.
On the contrary, it's a system configuration and it's the cluster admin that should have the last word on that, because it's a resource allocation-related matter (it could even be modified by a webhook, who knows).

So, although failing early if max trials > possible suggestions would indeed solve the issue, we believe that it's not the right approach. Data scientists shouldn't be counting the exact number of combinations before sumbitting an Experiment.

gaocegege · 2020-06-05T09:34:38Z

Generally LGTM. It works only if the algorithm has the Exhausted state like grid search, right?

elikatsis · 2020-06-05T09:44:31Z

Hello @gaocegege,
thanks for reading the proposal!

Yes, this change shouldn't affect other algorithms, except if they also speak the same code.
What I mean is that we will decide a way for the service to notify the controller that it is exhausted. What we propose is a gRPC error with a specific code.
If this happens with other services as well, then they can be extended and return the same thing.

Another good thing of the approach is this: in the future, we may want to communicate something new between the service and the controller. All it would need is:

extend a service to return this via gRPC
enable handling this on the controller side

gaocegege · 2020-06-05T09:47:36Z

Gotcha. SGTM. Maybe it is related to #1061

gaocegege · 2020-06-05T09:48:32Z

/cc @andreyvelich @sperlingxx @johnugeorge

andreyvelich · 2020-06-05T09:58:39Z

Thank you for writing this @elikatsis.
I am just worried that this change will be specific to Grid. I can't find problems in any other algorithms when maxTrialCount must be modified.
The same story for Hyperband, parallelTrialCount should be < maxParallelCount depends on AlgorithmSettings: https://github.com/kubeflow/katib/blob/master/pkg/suggestion/v1beta1/hyperband/service.py#L217. In that case, what can we provide for the data scientist ?

We try to design Katib controller, states, etc. independent of Suggestion services.
So Katib users can create their own Suggestions and use Katib controller logic.

Is that correct to run Experiment that in any situation will be not Succeeded? If user decided to use Grid algorithm, he/she knows in advance how many Trials can be generated.
In addition we provide this information from ValidateAlgorithmSettings.

What do you think @johnugeorge ?

elikatsis · 2020-06-05T13:21:29Z

The same story for Hyperband, parallelTrialCount should be < maxParallelCount depends on AlgorithmSettings: https://github.com/kubeflow/katib/blob/master/pkg/suggestion/v1beta1/hyperband/service.py#L217. In that case, what can we provide for the data scientist ?

Unfortunately, I'm not familiar with the Hyperband algorithm. Could you give some more context? What's the main idea of the algorithm? What are eta, r_l, and maxParallelCount?

Reading it diagonally and seeing the following lines, it looks like Hyperband could also use an Exhausted state, but I may be wrong:

katib/pkg/suggestion/v1beta1/hyperband/service.py

Lines 32 to 35 in c2c5288

    
           param = HyperBandParam.convert(alg_settings) 
        
           if param.current_s < 0: 
        
               # Hyperband outlerloop has finished 
        
               return reply

Is that correct to run Experiment that in any situation will be not Succeeded?

Whether the experiment succeeds or not is based on whether the goal is reached or not.
We believe that this is independent to whether it should start or not exactly because it may succeed before even reaching the point of exhaustion.
For example, if the first trial reaches the goal then there will be no more trials and the experiment is marked as Succeeded, goal reached, etc..

katib/pkg/controller.v1alpha3/experiment/util/status_util.go

Lines 153 to 158 in 5c4624f

    
           if isObjectiveGoalReached { 
        
           	msg := "Experiment has succeeded because Objective goal has reached" 
        
           	instance.MarkExperimentStatusSucceeded(ExperimentGoalReachedReason, msg) 
        
           	instance.Status.CompletionTime = &now 
        
           	collector.IncreaseExperimentsSucceededCount(instance.Namespace) 
        
           	return

If user decided to use Grid algorithm, he/she knows in advance how many Trials can be generated.

While this stands true and the data scientist can calculate the combinations, we believe this is something they shouldn't be bothered with.

We will open a PR early next week so that you can have a better look. We first need to smooth some rough edges, rebase on current master, and verify it still functions properly, in our use cases at least.

andreyvelich · 2020-06-05T14:09:50Z

I believe Hyperband has strictly restrictions on parallel Trials:

katib/pkg/suggestion/v1beta1/hyperband/service.py

Lines 215 to 216 in c2c5288

    
           smax = int(math.log(rl)/math.log(eta)) 
        
           max_parallel = int(math.ceil(eta**smax))

.

So Smax controls Parallel Trial Count.

Whether the experiment succeeds or not is based on whether the goal is reached or not.

That is not true every time. We support running Experiment without the goal (#1065). So when maxTrialCount has been reached Experiment is succeeded.

elikatsis · 2020-06-09T15:49:24Z

That is not true every time. We support running Experiment without the goal (#1065). So when maxTrialCount has been reached Experiment is succeeded.

Thanks for this, indeed I had missed that. I changed the code to take it into consideration.
I opened #1213. The PR could be split into distinct PRs. For starters we wanted to show you the full picture of the implementation in an easy/compact way.

stale · 2020-11-24T03:07:01Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-12-19T05:57:06Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

k8s-ci-robot added the kind/bug label Apr 3, 2020

issue-label-bot bot added the bug label Apr 3, 2020

jlewi removed the bug label Apr 3, 2020

andreyvelich mentioned this issue Jun 3, 2020

Validate possible operations for Grid suggestion #1205

Merged

elikatsis mentioned this issue Jun 9, 2020

Extend gRPC controller-algorithm communication & Add Suggestion state Exhausted #1213

Closed

stale bot added the lifecycle/stale label Nov 24, 2020

stale bot closed this as completed Dec 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chocolate service db exhausted #1122

Chocolate service db exhausted #1122

StefanoFioravanzo commented Apr 3, 2020

issue-label-bot bot commented Apr 3, 2020

gaocegege commented Apr 7, 2020

andreyvelich commented Apr 7, 2020

StefanoFioravanzo commented Apr 8, 2020

andreyvelich commented Apr 8, 2020

StefanoFioravanzo commented Apr 8, 2020

andreyvelich commented Apr 9, 2020

StefanoFioravanzo commented Apr 9, 2020

andreyvelich commented Apr 9, 2020

andreyvelich commented Apr 10, 2020

StefanoFioravanzo commented Apr 11, 2020

andreyvelich commented Apr 13, 2020

elikatsis commented Jun 5, 2020

gaocegege commented Jun 5, 2020 •

edited

Loading

elikatsis commented Jun 5, 2020

gaocegege commented Jun 5, 2020

gaocegege commented Jun 5, 2020

andreyvelich commented Jun 5, 2020 •

edited

Loading

elikatsis commented Jun 5, 2020

andreyvelich commented Jun 5, 2020

elikatsis commented Jun 9, 2020

stale bot commented Nov 24, 2020

stale bot commented Dec 19, 2020

Chocolate service db exhausted #1122

Chocolate service db exhausted #1122

Comments

StefanoFioravanzo commented Apr 3, 2020

issue-label-bot bot commented Apr 3, 2020

gaocegege commented Apr 7, 2020

andreyvelich commented Apr 7, 2020

StefanoFioravanzo commented Apr 8, 2020

andreyvelich commented Apr 8, 2020

StefanoFioravanzo commented Apr 8, 2020

andreyvelich commented Apr 9, 2020

StefanoFioravanzo commented Apr 9, 2020

andreyvelich commented Apr 9, 2020

andreyvelich commented Apr 10, 2020

StefanoFioravanzo commented Apr 11, 2020

andreyvelich commented Apr 13, 2020

elikatsis commented Jun 5, 2020

gaocegege commented Jun 5, 2020 • edited Loading

elikatsis commented Jun 5, 2020

gaocegege commented Jun 5, 2020

gaocegege commented Jun 5, 2020

andreyvelich commented Jun 5, 2020 • edited Loading

elikatsis commented Jun 5, 2020

andreyvelich commented Jun 5, 2020

elikatsis commented Jun 9, 2020

stale bot commented Nov 24, 2020

stale bot commented Dec 19, 2020

gaocegege commented Jun 5, 2020 •

edited

Loading

andreyvelich commented Jun 5, 2020 •

edited

Loading