Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chocolate service db exhausted #1122

Closed
StefanoFioravanzo opened this issue Apr 3, 2020 · 23 comments · Fixed by #1205
Closed

Chocolate service db exhausted #1122

StefanoFioravanzo opened this issue Apr 3, 2020 · 23 comments · Fixed by #1205

Comments

@StefanoFioravanzo
Copy link
Member

/kind bug

What steps did you take and what happened:
When creating experiments that use the grid search algorithm sometimes trials stop being generated, even though the experiment is still seen as running.
By printing the logs of the suggestions pods I can see the following:

ERROR:sqlalchemy.pool.impl.NullPool:Exception during reset or similar
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 680, in _finalize_fairy
    fairy._reset(pool)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 867, in _reset
    pool._dialect.do_rollback(self)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 502, in do_rollback
    dbapi_connection.rollback()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140636731225856 and this is thread id 140636053288704.
ERROR:sqlalchemy.pool.impl.NullPool:Exception closing connection <sqlite3.Connection object at 0x7fe8b4397650>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 680, in _finalize_fairy
    fairy._reset(pool)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 867, in _reset
    pool._dialect.do_rollback(self)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 502, in do_rollback
    dbapi_connection.rollback()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140636731225856 and this is thread id 140636053288704.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 270, in _close_connection
    self._dialect.do_close(connection)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 508, in do_close
    dbapi_connection.close()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140636731225856 and this is thread id 140636053288704.
DEBUG:filelock:Attempting to release lock 140636764717688 on my_db.db.lock
INFO:filelock:Lock 140636764717688 released on my_db.db.lock
INFO:BaseChocolateService:{'_chocolate_id': 0, '_loss': -83.971292, 'bm9kZXNfbnVtYmVy': 512.0}
INFO:BaseChocolateService:{'_chocolate_id': 1, '_loss': -82.655502, 'bm9kZXNfbnVtYmVy': 768.0}
DEBUG:filelock:Attempting to acquire lock 140636764717688 on my_db.db.lock
INFO:filelock:Lock 140636764717688 acquired on my_db.db.lock
DEBUG:filelock:Attempting to release lock 140636764717688 on my_db.db.lock
INFO:filelock:Lock 140636764717688 released on my_db.db.lock
INFO:BaseChocolateService:Chocolate db is exhausted, increase Search Space or decrease maxTrialCount!
INFO:BaseChocolateService:Chocolate db is exhausted, increase Search Space or decrease maxTrialCount!
INFO:BaseChocolateService:Chocolate db is exhausted, increase Search Space or decrease maxTrialCount!

Note that this is happening just sometimes, I was not able to pin down a specific configuration that causes this. With some parameters configurations it happens, with others the experiment completes by exploring the search space as expected.

What did you expect to happen:
I would at least expect to see the experiment failing (related to #1120 ), but a more informative message would be expected. The message Chocolate db is exhausted, increase Search Space or decrease maxTrialCount! doesn't make sense as the maxTrialCount is still not reaches and there are some more configurations of the search space that have not been tried.

Environment:

  • Kubeflow version: 1.0
  • Minikube version: 1.2.0 (MiniKF Latest)
  • Kubernetes version: 1.14
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
bug 0.97

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@issue-label-bot issue-label-bot bot added the bug label Apr 3, 2020
@jlewi jlewi removed the bug label Apr 3, 2020
@gaocegege
Copy link
Member

Thanks for the issue, I will have a look

/cc @andreyvelich

@andreyvelich
Copy link
Member

@StefanoFioravanzo Thank you for the issue.
Try to use latest image for the Grid. This PR: #1116, should fix some of these problems.
Also, can you show your Search Space and maxTrialCount?

@StefanoFioravanzo
Copy link
Member Author

Writing here (after #1124 ) as it is more related:

So, now when the search space is exhausted, but the goal is not reached and also max trial count is not reaches, the suggestions pod still outputs db is exhausted, increase Search Space or decrease maxTrialCount!. It would be ideal to have this being reflected in the status of the experiment. And it should be simple to differentiate between when the maxTrialCount has been reached and when the search space was fully explored. WDYT?

I am willing to help for this and send PR, maybe can you give me some pointers?

@andreyvelich
Copy link
Member

@StefanoFioravanzo Can you show me your yaml file with Experiment, where you got this error?

I think we should check it in Validate Algorithm Settings step.
I am not sure that using Grid user should start Experiment if maxTrialCount < count of all possible combinations of parameters.

You are right, let's discuss in this issue.

@StefanoFioravanzo
Copy link
Member Author

Here it is:

apiVersion: kubeflow.org/v1alpha3
kind: Experiment
metadata:
  labels:
    controller-tools.k8s.io: '1.0'
  name: katib-test
spec:
  algorithm:
    algorithmName: grid
  parallelTrialCount: 1
  maxFailedTrialCount: 6
  maxTrialCount: 30
  objective:
    additionalMetricNames:
    goal: 100
    objectiveMetricName: result
    type: maximize
  parameters:
  - feasibleSpace:
      max: '3'
      min: '1'
    name: b
    parameterType: int
  - name: c
    parameterType: categorical
    feasibleSpace:
      list:
        - "1"
        - "9"
        - "15"
  trialTemplate:
    goTemplate:
      rawTemplate: |
        apiVersion: batch/v1
        kind: Job
        metadata:
          name: {{.Trial}}
          namespace: {{.NameSpace}}
        spec:
          backoffLimit: 0
          template:
            metadata:
              annotations:
                sidecar.istio.io/inject: "false"
            spec:
              restartPolicy: Never
              containers:
                - name: {{.Trial}}
                  image: <image>
                  command:
                    - python3 -u -c "<fn>"

This is the experiment, the search space is silly, just for testing. I omitted the image and the python command, since they are in a private repository. What I have is just a python function that returns the sum of the input arguments, that's it.

So as you can see the maxTrialCount is 30 and the result is 100, that could never be reaches. After all the trials are completed the suggestions pod raises the db exhausted error, when I would expect it to say that it has no more trials to run and exit gracefully. Maybe there could be a special experiment state for this case, when the goal could not be reached.

@andreyvelich
Copy link
Member

@StefanoFioravanzo I believe for your search space, 30 Trials should be enough.
After using the latest image for the Grid with this Experiment you are still getting DB exhausted error?

@StefanoFioravanzo
Copy link
Member Author

Exactly! I get that error when the search space is fully explored, and the goal has not been reached

@andreyvelich
Copy link
Member

@StefanoFioravanzo I will test your search space with this amount of Trials and get back to you.

@andreyvelich
Copy link
Member

@StefanoFioravanzo I just figure out that your Experiment has maxTrialCount more than all combinations of parameters.
I believe in your search space the maximum amount of trials is 9.

@StefanoFioravanzo
Copy link
Member Author

So what you are saying is that this behaviour is expected, when the maxTrialCount is greater than all combinations?

@andreyvelich
Copy link
Member

Yes, it is expected.
As I said above, I think we will validate it in ValidateAlgorithmSettings.

@elikatsis
Copy link
Member

Hi @andreyvelich,
thank you for holding #1205.
I'm writing to describe our (Arrikto) proposal to tackle this issue.

We've seen that the controller speaks gRPC with the algorithm service, but their communication is limited to passing suggestions and successful answers independently to what actually happens inside the algorithm service.

To tackle this, and potentially more issues, we have thought of letting the controller having a better insight into the state of the algorithm service.

Let's get to our proposal using an example: A time comes when there are no more suggestions to generate. When this happens, the algorithm service returns zero suggestions. However, the corresponding experiment and suggestion will stay in Running state.
There are two problems with that:

  1. They are in false state. They shouldn't be Running as no new Trials will be created and nothing will change in general
  2. They get in an infinite reconciliation loop

However, the algorithm service knows when there are no more new suggestions to generate. This is true for the grid algorithm - since we have worked with that. (Can you provide insight into what happens with other algorithms?)

That's why we thought of two things:

  1. Make the controller check the gRPC response status
  2. Have the grid algorithm return a NotFound gRPC error when it cannot produce any more suggestions
  3. Introduce a new (final) Suggestion state for that case: Exhausted
  4. If an Experiment has its Suggestion Exhausted and the goal is not reached, then the Experiment should get to Failed state

We have all this implemented, tested in specific use cases, and we can create a PR so that you can see for yourself. Of course, we can iterate on it.

The rationale for this proposal is that users should define a search space and they shouldn't care about anything else.
If you think about it, the decision for max-trials-allowed isn't a data scientist's job. It is a setting that doesn't have much to do with the research part.
On the contrary, it's a system configuration and it's the cluster admin that should have the last word on that, because it's a resource allocation-related matter (it could even be modified by a webhook, who knows).

So, although failing early if max trials > possible suggestions would indeed solve the issue, we believe that it's not the right approach. Data scientists shouldn't be counting the exact number of combinations before sumbitting an Experiment.

@gaocegege
Copy link
Member

gaocegege commented Jun 5, 2020

Generally LGTM. It works only if the algorithm has the Exhausted state like grid search, right?

@elikatsis
Copy link
Member

Hello @gaocegege,
thanks for reading the proposal!

Yes, this change shouldn't affect other algorithms, except if they also speak the same code.
What I mean is that we will decide a way for the service to notify the controller that it is exhausted. What we propose is a gRPC error with a specific code.
If this happens with other services as well, then they can be extended and return the same thing.

Another good thing of the approach is this: in the future, we may want to communicate something new between the service and the controller. All it would need is:

  1. extend a service to return this via gRPC
  2. enable handling this on the controller side

@gaocegege
Copy link
Member

Gotcha. SGTM. Maybe it is related to #1061

@gaocegege
Copy link
Member

@andreyvelich
Copy link
Member

andreyvelich commented Jun 5, 2020

Thank you for writing this @elikatsis.
I am just worried that this change will be specific to Grid. I can't find problems in any other algorithms when maxTrialCount must be modified.
The same story for Hyperband, parallelTrialCount should be < maxParallelCount depends on AlgorithmSettings: https://github.com/kubeflow/katib/blob/master/pkg/suggestion/v1beta1/hyperband/service.py#L217. In that case, what can we provide for the data scientist ?

We try to design Katib controller, states, etc. independent of Suggestion services.
So Katib users can create their own Suggestions and use Katib controller logic.

Is that correct to run Experiment that in any situation will be not Succeeded? If user decided to use Grid algorithm, he/she knows in advance how many Trials can be generated.
In addition we provide this information from ValidateAlgorithmSettings.

What do you think @johnugeorge ?

@elikatsis
Copy link
Member

The same story for Hyperband, parallelTrialCount should be < maxParallelCount depends on AlgorithmSettings: https://github.com/kubeflow/katib/blob/master/pkg/suggestion/v1beta1/hyperband/service.py#L217. In that case, what can we provide for the data scientist ?

Unfortunately, I'm not familiar with the Hyperband algorithm. Could you give some more context? What's the main idea of the algorithm? What are eta, r_l, and maxParallelCount?

Reading it diagonally and seeing the following lines, it looks like Hyperband could also use an Exhausted state, but I may be wrong:

param = HyperBandParam.convert(alg_settings)
if param.current_s < 0:
# Hyperband outlerloop has finished
return reply

Is that correct to run Experiment that in any situation will be not Succeeded?

Whether the experiment succeeds or not is based on whether the goal is reached or not.
We believe that this is independent to whether it should start or not exactly because it may succeed before even reaching the point of exhaustion.
For example, if the first trial reaches the goal then there will be no more trials and the experiment is marked as Succeeded, goal reached, etc..

if isObjectiveGoalReached {
msg := "Experiment has succeeded because Objective goal has reached"
instance.MarkExperimentStatusSucceeded(ExperimentGoalReachedReason, msg)
instance.Status.CompletionTime = &now
collector.IncreaseExperimentsSucceededCount(instance.Namespace)
return

If user decided to use Grid algorithm, he/she knows in advance how many Trials can be generated.

While this stands true and the data scientist can calculate the combinations, we believe this is something they shouldn't be bothered with.

We will open a PR early next week so that you can have a better look. We first need to smooth some rough edges, rebase on current master, and verify it still functions properly, in our use cases at least.

@andreyvelich
Copy link
Member

I believe Hyperband has strictly restrictions on parallel Trials:

smax = int(math.log(rl)/math.log(eta))
max_parallel = int(math.ceil(eta**smax))
.
Screenshot 2020-06-05 at 15 04 45

So Smax controls Parallel Trial Count.

Whether the experiment succeeds or not is based on whether the goal is reached or not.

That is not true every time. We support running Experiment without the goal (#1065). So when maxTrialCount has been reached Experiment is succeeded.

@elikatsis
Copy link
Member

That is not true every time. We support running Experiment without the goal (#1065). So when maxTrialCount has been reached Experiment is succeeded.

Thanks for this, indeed I had missed that. I changed the code to take it into consideration.
I opened #1213. The PR could be split into distinct PRs. For starters we wanted to show you the full picture of the implementation in an easy/compact way.

@stale
Copy link

stale bot commented Nov 24, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale
Copy link

stale bot commented Dec 19, 2020

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@stale stale bot closed this as completed Dec 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants