Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Provision failure returns incorrect error message #640

Closed
ohltyler opened this issue Apr 2, 2024 · 9 comments · Fixed by #642
Closed

[BUG] Provision failure returns incorrect error message #640

ohltyler opened this issue Apr 2, 2024 · 9 comments · Fixed by #642
Labels
bug Something isn't working v2.14.0

Comments

@ohltyler
Copy link
Member

ohltyler commented Apr 2, 2024

If a workflow has a runtime failure when provisioning (e.g., text_embedding processor doesn't exist when running create_ingest_pipeline step on a cluster with no neural-search plugin installed), the state is correctly set to FAILED with a relevant error message. Example:

{
  "workflow_id": "LGKGn44B51-soDq-mga1",
  "error": "org.opensearch.flowframework.exception.WorkflowStepException during step create_ingest_pipeline, restStatus: BAD_REQUEST",
  "state": "FAILED"
}

However, when updating the cluster to address the runtime failure, and when re-running the provision API, it fails and returns an incorrect message:

{
  "error": "The template has already been provisioned: LGKGn44B51-soDq-mga1"
}

This should be updated to a more relevant message, along the lines of "The workflow failed provisioning, please deprovision and try again".

Note that after manually running the deprovision API, and re-running provision, it works as expected. But feels weird to have to run deprovision when it's doing nothing besides reset the state, since there was no created resources to clean up to begin with.

@ohltyler ohltyler added bug Something isn't working untriaged labels Apr 2, 2024
@ohltyler
Copy link
Member Author

ohltyler commented Apr 2, 2024

But feels weird to have to run deprovision when it's doing nothing besides reset the state, since there was no created resources to clean up to begin with.

I don't have a good solution for this part off the top of my head. But from a user perspective this is a bit confusing. Worth exploring this edge case from usability standpoint

@ohltyler ohltyler removed the untriaged label Apr 2, 2024
@dbwiddis
Copy link
Member

dbwiddis commented Apr 2, 2024

See #537 which requires the same solution.

@ohltyler
Copy link
Member Author

ohltyler commented Apr 2, 2024

Additional issue found: the error field looks like it is not cleared, even when deprovisioning and reprovisioning:

{
  "workflow_id": "LGKGn44B51-soDq-mga1",
  "error": "org.opensearch.flowframework.exception.WorkflowStepException during step create_ingest_pipeline, restStatus: BAD_REQUEST",
  "state": "COMPLETED",
  "provisioning_progress": "DONE",
  "provision_start_time": 1712074192234,
  "provision_end_time": 1712074192545,
  "resources_created": [
    {
      "resource_type": "pipeline_id",
      "resource_id": "test-pipeline",
      "workflow_step_name": "create_ingest_pipeline",
      "workflow_step_id": "create_ingest_pipeline"
    },
    {
      "resource_type": "index_name",
      "resource_id": "my-knn-index",
      "workflow_step_name": "create_index",
      "workflow_step_id": "create_index"
    }
  ]
}

@dbwiddis
Copy link
Member

dbwiddis commented Apr 2, 2024

Additional issue found: the error field looks like it is not cleared, even when deprovisioning and reprovisioning:

If you review and approve #635 that problem will go away :-)

@ohltyler
Copy link
Member Author

ohltyler commented Apr 2, 2024

Additional issue found: the error field looks like it is not cleared, even when deprovisioning and reprovisioning:

If you review and approve #635 that problem will go away :-)

Ha! Perfect

@amitgalitz
Copy link
Member

on a similar note should we improve to add more details in the error message itself: "org.opensearch.flowframework.exception.WorkflowStepException during step create_ingest_pipeline, restStatus: BAD_REQUEST" we use WorkflowStepException to let us know the step itself is the reason for the error, but we don't propagate outside the logs the fact that the issue is that we don't have text_embedding processor available in the cluster.

Today if a user hits the create ingest pipeline API directly they get more information without having to look at the logs:

{
    "error": {
        "root_cause": [
            {
                "type": "parse_exception",
                "reason": "No processor type exists with name [text_embeddinssg]",
                "processor_type": "text_embeddinssg"
            }
        ],
        "type": "parse_exception",
        "reason": "No processor type exists with name [text_embeddinssg]",
        "processor_type": "text_embeddinssg"
    },
    "status": 400
}

@amitgalitz
Copy link
Member

on a similar note should we improve to add more details in the error message itself: "org.opensearch.flowframework.exception.WorkflowStepException during step create_ingest_pipeline, restStatus: BAD_REQUEST" we use WorkflowStepException to let us know the step itself is the reason for the error, but we don't propagate outside the logs the fact that the issue is that we don't have text_embedding processor available in the cluster.

Today if a user hits the create ingest pipeline API directly they get more information without having to look at the logs:

{
    "error": {
        "root_cause": [
            {
                "type": "parse_exception",
                "reason": "No processor type exists with name [text_embeddinssg]",
                "processor_type": "text_embeddinssg"
            }
        ],
        "type": "parse_exception",
        "reason": "No processor type exists with name [text_embeddinssg]",
        "processor_type": "text_embeddinssg"
    },
    "status": 400
}

@dbwiddis @ohltyler @owaiskazi19 @jackiehanyang @joshpalis
Can i get some input on this, I feel like its a very big deal actually, since there are many other examples that if users used the current API they are missing on the exception that is given to the user. Another example if we register a local model in ML without changing settings we get this back:

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "No eligible node found to execute this request. It's best practice to provision ML nodes to serve your models. You can disable this setting to serve the model on your data node for development purposes by disabling the \"plugins.ml_commons.only_run_on_ml_node\" configuration using the _cluster/setting api"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "No eligible node found to execute this request. It's best practice to provision ML nodes to serve your models. You can disable this setting to serve the model on your data node for development purposes by disabling the \"plugins.ml_commons.only_run_on_ml_node\" configuration using the _cluster/setting api"
    },
    "status": 400
}

this is hidden on our error message.

@amitgalitz
Copy link
Member

on a similar note should we improve to add more details in the error message itself: "org.opensearch.flowframework.exception.WorkflowStepException during step create_ingest_pipeline, restStatus: BAD_REQUEST" we use WorkflowStepException to let us know the step itself is the reason for the error, but we don't propagate outside the logs the fact that the issue is that we don't have text_embedding processor available in the cluster.
Today if a user hits the create ingest pipeline API directly they get more information without having to look at the logs:

{
    "error": {
        "root_cause": [
            {
                "type": "parse_exception",
                "reason": "No processor type exists with name [text_embeddinssg]",
                "processor_type": "text_embeddinssg"
            }
        ],
        "type": "parse_exception",
        "reason": "No processor type exists with name [text_embeddinssg]",
        "processor_type": "text_embeddinssg"
    },
    "status": 400
}

@dbwiddis @ohltyler @owaiskazi19 @jackiehanyang @joshpalis Can i get some input on this, I feel like its a very big deal actually, since there are many other examples that if users used the current API they are missing on the exception that is given to the user. Another example if we register a local model in ML without changing settings we get this back:

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "No eligible node found to execute this request. It's best practice to provision ML nodes to serve your models. You can disable this setting to serve the model on your data node for development purposes by disabling the \"plugins.ml_commons.only_run_on_ml_node\" configuration using the _cluster/setting api"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "No eligible node found to execute this request. It's best practice to provision ML nodes to serve your models. You can disable this setting to serve the model on your data node for development purposes by disabling the \"plugins.ml_commons.only_run_on_ml_node\" configuration using the _cluster/setting api"
    },
    "status": 400
}

this is hidden on our error message.

created new issue for this: #670

@dbwiddis
Copy link
Member

This issue was completed in #642. New issue raised in reopen comment is being tracked in #670.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2.14.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants