Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quickstart pipeline API having problems with stress tests #129

Closed
Tomcli opened this issue Jul 7, 2021 · 8 comments
Closed

Quickstart pipeline API having problems with stress tests #129

Tomcli opened this issue Jul 7, 2021 · 8 comments

Comments

@Tomcli
Copy link
Member

Tomcli commented Jul 7, 2021

Describe the bug

@yhwang can you describe the errors that you found?

To Reproduce

Steps to reproduce the behavior:

  1. Deploy read only MLX at Add MLX readonly install k8s manifests #126
  2. Run stress tests against the MLX pipelines API

Expected behavior

A clear and concise description of what you expected to happen.

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • MLX Version [e.g. 22]

Additional context

Add any other context about the problem here.

@yhwang
Copy link
Member

yhwang commented Jul 7, 2021

Here is the error message:

Traceback (most recent call last):
  File "/usr/src/app/swagger_server/util.py", line 259, in invoke_controller_impl
    results = impl_func(**parameters)
  File "/usr/src/app/swagger_server/controllers_impl/pipeline_service_controller_impl.py", line 194, in list_pipelines
    api_pipelines: [ApiPipeline] = load_data(ApiPipelineExtended, filter_dict=filter_dict, sort_by=sort_by,
  File "/usr/src/app/swagger_server/data_access/mysql_client.py", line 678, in load_data
    _verify_or_create_table(table_name, swagger_class)
  File "/usr/src/app/swagger_server/data_access/mysql_client.py", line 359, in _verify_or_create_table
    _validate_schema(table_name, swagger_class)
  File "/usr/src/app/swagger_server/data_access/mysql_client.py", line 440, in _validate_schema
    raise ApiError(err_msg)
swagger_server.util.ApiError: The MySQL table 'mlpipeline.pipelines_extended' does not match Swagger class 'ApiPipelineExtended'.
 Found table with columns:
  - 'UUID' varchar(255)
  - 'CreatedAtInSec' bigint(20)
  - 'Name' varchar(255)
  - 'Description' varchar(255)
  - 'Parameters' longtext
  - 'Status' varchar(255)
  - 'DefaultVersionId' varchar(255)
  - 'Namespace' varchar(255)
  - 'Annotations' longtext
  - 'Featured' tinyint(1)
  - 'PublishApproved' tinyint(1).
 Expected table with columns:
  - 'UUID' varchar(255)
  - 'CreatedAtInSec' bigint(20)
  - 'Name' varchar(255)
  - 'Description' longtext
  - 'Parameters' longtext
  - 'Status' varchar(255)
  - 'DefaultVersionId' varchar(255)
  - 'Namespace' varchar(63)
  - 'Annotations' longtext
  - 'Featured' tinyint(1)
  - 'PublishApproved' tinyint(1).
 Delete and recreate the table by calling the API endpoint 'DELETE /pipelines_extended/*' (500)

After importing the quickstart catalog, the pipelines url is good. I can see all pipeline cards. The stress test is sending requests to get 2 of the pipeline cards repeatedly. After I ran the test for a while, the /apis/v1alpha1/pipelines api started to sending back 500: internal server error and I saw the error message above in the mlx-api pod. I always start with 1 pod for mlx-api, after importing the quickstart catalog, I scale up to 3 or more pods. Not sure if this is related to the issue.

@ckadner
Copy link
Member

ckadner commented Jul 7, 2021

Could there have been some pods that crashed? There is a code path in the MLX API that creates the pipelines table if it does not exists. That code path was never used since we always find the pipelines table already created by KFP or by the init_db.sh script I wrote for the quickstart with Docker Compose.

@Tomcli
Copy link
Member Author

Tomcli commented Jul 7, 2021

@ckadner when I rerun the init_db.sh job, the tables are recreated and everything works fine. But once we ran the stress test again, then the above error will pop up.

@ckadner
Copy link
Member

ckadner commented Jul 7, 2021

@ckadner when I rerun the init_db.sh job, the tables are recreated and everything works fine. But once we ran the stress test again, then the above error will pop up.

that seems to indicate that the MLX API pod does not find the pipelines table and creates it with the wrong column length for the namespace column. This should not happen unless there is a new MySQL instance which does not get initialized in time before the first call the the MLX API to GET /apis/v1alpha1/pipelines

@ckadner
Copy link
Member

ckadner commented Jul 7, 2021

This may be an instance of inopportune timing due to the stress test scenario. If we need to support that, I can make changes to the MLX API. (In the Docker Compose setup I made the catalog upload service dependent on the MySQL service having finished the initialization.)

@yhwang
Copy link
Member

yhwang commented Jul 8, 2021

@ckadner I guess the problem is caused by the second or third pod when we scale up the mlx-api. Like I mentioned, we always do the quickstart import when the replicas=1, the 1st pod. Then I scale up the mlx-api to replicas=2 or 3. And this error will show up in 2nd and 3rd pod.

@ckadner
Copy link
Member

ckadner commented Jul 8, 2021

@ckadner I guess the problem is caused by the second or third pod when we scale up the mlx-api. Like I mentioned, we always do the quickstart import when the replicas=1, the 1st pod. Then I scale up the mlx-api to replicas=2 or 3. And this error will show up in 2nd and 3rd pod.

The 2nd or 3rd replica of MLX-API are connecting to the same (already initialized) MySQL database.

  • The init_db.sql was not being run. In Docker Compose the mysql service gets initialized via "magic" volume:
    volumes:
        - ./init_db.sql:/docker-entrypoint-initdb.d/init_db.sql
    MySQL uses this volume to find any initialization scripts and runs it, anything under /docker-entrypoint-initdb.d/ will be executed at startup of MySQL (PR #126)
  • The 1st mlx-api pod will not find the pipelines table and CREATE TABLE with the incorrect namespace column and internally remember that it created it
  • The 2nd and 3rd mlx-api pod with check and find the pipelines table exists, but then they go on to verify the table schema and complain about the incorrect namespace column length

@ckadner ckadner closed this as completed in 3c10fda Jul 8, 2021
@ckadner
Copy link
Member

ckadner commented Jul 16, 2021

The MLX API is not designed to be running with multiple replicas:

  • database schema initialization and/or schema verification is done once (and cached) at API startup (this issue)
  • API settings are currently file based and stored in the API pod, multiple pods can toggle on/off each others settings (i.e. the Inference Services were either on or off depending on which API instance was chosen, see Disable Inference Services by default in read-only deployments #135)
  • GET request caching assumes single instance API deployment PR #140
  • there are likely more issues to be listed here :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants