[BUG] Provision failure returns incorrect error message #640

ohltyler · 2024-04-02T16:20:19Z

If a workflow has a runtime failure when provisioning (e.g., text_embedding processor doesn't exist when running create_ingest_pipeline step on a cluster with no neural-search plugin installed), the state is correctly set to FAILED with a relevant error message. Example:

{
  "workflow_id": "LGKGn44B51-soDq-mga1",
  "error": "org.opensearch.flowframework.exception.WorkflowStepException during step create_ingest_pipeline, restStatus: BAD_REQUEST",
  "state": "FAILED"
}

However, when updating the cluster to address the runtime failure, and when re-running the provision API, it fails and returns an incorrect message:

{
  "error": "The template has already been provisioned: LGKGn44B51-soDq-mga1"
}

This should be updated to a more relevant message, along the lines of "The workflow failed provisioning, please deprovision and try again".

Note that after manually running the deprovision API, and re-running provision, it works as expected. But feels weird to have to run deprovision when it's doing nothing besides reset the state, since there was no created resources to clean up to begin with.

The text was updated successfully, but these errors were encountered:

ohltyler · 2024-04-02T16:23:04Z

But feels weird to have to run deprovision when it's doing nothing besides reset the state, since there was no created resources to clean up to begin with.

I don't have a good solution for this part off the top of my head. But from a user perspective this is a bit confusing. Worth exploring this edge case from usability standpoint

dbwiddis · 2024-04-02T16:38:01Z

See #537 which requires the same solution.

ohltyler · 2024-04-02T16:38:33Z

Additional issue found: the error field looks like it is not cleared, even when deprovisioning and reprovisioning:

{
  "workflow_id": "LGKGn44B51-soDq-mga1",
  "error": "org.opensearch.flowframework.exception.WorkflowStepException during step create_ingest_pipeline, restStatus: BAD_REQUEST",
  "state": "COMPLETED",
  "provisioning_progress": "DONE",
  "provision_start_time": 1712074192234,
  "provision_end_time": 1712074192545,
  "resources_created": [
    {
      "resource_type": "pipeline_id",
      "resource_id": "test-pipeline",
      "workflow_step_name": "create_ingest_pipeline",
      "workflow_step_id": "create_ingest_pipeline"
    },
    {
      "resource_type": "index_name",
      "resource_id": "my-knn-index",
      "workflow_step_name": "create_index",
      "workflow_step_id": "create_index"
    }
  ]
}

dbwiddis · 2024-04-02T19:06:56Z

Additional issue found: the error field looks like it is not cleared, even when deprovisioning and reprovisioning:

If you review and approve #635 that problem will go away :-)

ohltyler · 2024-04-02T19:17:56Z

Additional issue found: the error field looks like it is not cleared, even when deprovisioning and reprovisioning:

If you review and approve #635 that problem will go away :-)

Ha! Perfect

amitgalitz · 2024-04-02T21:20:22Z

on a similar note should we improve to add more details in the error message itself: "org.opensearch.flowframework.exception.WorkflowStepException during step create_ingest_pipeline, restStatus: BAD_REQUEST" we use WorkflowStepException to let us know the step itself is the reason for the error, but we don't propagate outside the logs the fact that the issue is that we don't have text_embedding processor available in the cluster.

Today if a user hits the create ingest pipeline API directly they get more information without having to look at the logs:

{
    "error": {
        "root_cause": [
            {
                "type": "parse_exception",
                "reason": "No processor type exists with name [text_embeddinssg]",
                "processor_type": "text_embeddinssg"
            }
        ],
        "type": "parse_exception",
        "reason": "No processor type exists with name [text_embeddinssg]",
        "processor_type": "text_embeddinssg"
    },
    "status": 400
}

amitgalitz · 2024-04-18T22:44:29Z

on a similar note should we improve to add more details in the error message itself: "org.opensearch.flowframework.exception.WorkflowStepException during step create_ingest_pipeline, restStatus: BAD_REQUEST" we use WorkflowStepException to let us know the step itself is the reason for the error, but we don't propagate outside the logs the fact that the issue is that we don't have text_embedding processor available in the cluster.

Today if a user hits the create ingest pipeline API directly they get more information without having to look at the logs:
{
    "error": {
        "root_cause": [
            {
                "type": "parse_exception",
                "reason": "No processor type exists with name [text_embeddinssg]",
                "processor_type": "text_embeddinssg"
            }
        ],
        "type": "parse_exception",
        "reason": "No processor type exists with name [text_embeddinssg]",
        "processor_type": "text_embeddinssg"
    },
    "status": 400
}

@dbwiddis @ohltyler @owaiskazi19 @jackiehanyang @joshpalis
Can i get some input on this, I feel like its a very big deal actually, since there are many other examples that if users used the current API they are missing on the exception that is given to the user. Another example if we register a local model in ML without changing settings we get this back:

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "No eligible node found to execute this request. It's best practice to provision ML nodes to serve your models. You can disable this setting to serve the model on your data node for development purposes by disabling the \"plugins.ml_commons.only_run_on_ml_node\" configuration using the _cluster/setting api"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "No eligible node found to execute this request. It's best practice to provision ML nodes to serve your models. You can disable this setting to serve the model on your data node for development purposes by disabling the \"plugins.ml_commons.only_run_on_ml_node\" configuration using the _cluster/setting api"
    },
    "status": 400
}

this is hidden on our error message.

amitgalitz · 2024-04-18T22:53:07Z

on a similar note should we improve to add more details in the error message itself: "org.opensearch.flowframework.exception.WorkflowStepException during step create_ingest_pipeline, restStatus: BAD_REQUEST" we use WorkflowStepException to let us know the step itself is the reason for the error, but we don't propagate outside the logs the fact that the issue is that we don't have text_embedding processor available in the cluster.
Today if a user hits the create ingest pipeline API directly they get more information without having to look at the logs:
{
    "error": {
        "root_cause": [
            {
                "type": "parse_exception",
                "reason": "No processor type exists with name [text_embeddinssg]",
                "processor_type": "text_embeddinssg"
            }
        ],
        "type": "parse_exception",
        "reason": "No processor type exists with name [text_embeddinssg]",
        "processor_type": "text_embeddinssg"
    },
    "status": 400
}
@dbwiddis @ohltyler @owaiskazi19 @jackiehanyang @joshpalis Can i get some input on this, I feel like its a very big deal actually, since there are many other examples that if users used the current API they are missing on the exception that is given to the user. Another example if we register a local model in ML without changing settings we get this back:
{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "No eligible node found to execute this request. It's best practice to provision ML nodes to serve your models. You can disable this setting to serve the model on your data node for development purposes by disabling the \"plugins.ml_commons.only_run_on_ml_node\" configuration using the _cluster/setting api"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "No eligible node found to execute this request. It's best practice to provision ML nodes to serve your models. You can disable this setting to serve the model on your data node for development purposes by disabling the \"plugins.ml_commons.only_run_on_ml_node\" configuration using the _cluster/setting api"
    },
    "status": 400
}
this is hidden on our error message.

created new issue for this: #670

dbwiddis · 2024-04-23T00:38:29Z

This issue was completed in #642. New issue raised in reopen comment is being tracked in #670.

ohltyler added bug Something isn't working untriaged labels Apr 2, 2024

minalsha added the v2.14.0 label Apr 2, 2024

ohltyler removed the untriaged label Apr 2, 2024

dbwiddis mentioned this issue Apr 2, 2024

Improve error messages for workflow states other than NOT_STARTED #642

Merged

dbwiddis closed this as completed in #642 Apr 6, 2024

amitgalitz reopened this Apr 18, 2024

github-actions bot added the untriaged label Apr 18, 2024

minalsha removed the untriaged label Apr 22, 2024

dbwiddis closed this as completed Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Provision failure returns incorrect error message #640

[BUG] Provision failure returns incorrect error message #640

ohltyler commented Apr 2, 2024

ohltyler commented Apr 2, 2024

dbwiddis commented Apr 2, 2024

ohltyler commented Apr 2, 2024

dbwiddis commented Apr 2, 2024

ohltyler commented Apr 2, 2024

amitgalitz commented Apr 2, 2024

amitgalitz commented Apr 18, 2024

amitgalitz commented Apr 18, 2024

dbwiddis commented Apr 23, 2024

[BUG] Provision failure returns incorrect error message #640

[BUG] Provision failure returns incorrect error message #640

Comments

ohltyler commented Apr 2, 2024

ohltyler commented Apr 2, 2024

dbwiddis commented Apr 2, 2024

ohltyler commented Apr 2, 2024

dbwiddis commented Apr 2, 2024

ohltyler commented Apr 2, 2024

amitgalitz commented Apr 2, 2024

amitgalitz commented Apr 18, 2024

amitgalitz commented Apr 18, 2024

dbwiddis commented Apr 23, 2024