Pipeline logs are disappearing after 24h #1120

AxoyTO · 2024-10-18T06:40:45Z

Bug Description

After 24 hours of creation, all logs belonging to a pipeline run disappear from the Charmed Kubeflow UI, despite the logs still being present in MinIO/mlpipeline (AWS S3). This leads to difficulty in troubleshooting and tracking the progress or failures of pipeline runs after the 24-hour period.

!aws --endpoint-url $MINIO_ENDPOINT_URL s3 ls s3://mlpipeline
                           [...]
                           PRE addition-pipeline-4g94d/
                           PRE addition-pipeline-4qwt4/
                           PRE download-preprocess-train-deploy-pipeline-8wjv9/
                           PRE mnist-pipeline-fcmgr/
                           [...]
!aws --endpoint-url $MINIO_ENDPOINT_URL s3 ls s3://mlpipeline/download-preprocess-train-deploy-pipeline-8wjv9/download-preprocess-train-deploy-pipeline-8wjv9-system-container-impl-1190848556/
2024-10-15 15:27:49      10796 main.log

To Reproduce

Deploy Charmed Kubeflow 1.9 using Juju.
Create a pipeline and run it.
After the run completes, observe that logs are available in the Kubeflow UI.
Wait for 24 hours after the pipeline run completes.
Attempt to view the pipeline logs in the UI again.
Expected: Logs should still be accessible.
Actual: Logs are no longer visible in the UI, but are still present in the underlying MinIO/mlpipeline (AWS S3).

Environment

CKF: 1.9/stable
minio: ckf-1.9/stable
argo-controller: 3.4/stable
Juju: 3.5.4
See the full bundle on: https://paste.ubuntu.com/p/NXXFhDqmVn/

Relevant Log Output

<none>

Additional Context

Notebook that is used to create a pipeline, which was ran on a notebook server with a GPU:

import kfp
from kfp import dsl, kubernetes

@dsl.component(
    base_image="tensorflow/tensorflow:latest-gpu",
    # packages_to_install=["tensorflow"]
)
def foo():
    '''Calculates sum of two arguments'''
    print("GPU Test")
    import tensorflow as tf
    print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
    print("GPU Test")


@dsl.pipeline(
    name='Addition pipeline',
    description='An example pipeline that performs addition calculations.')
def foo_pipeline():
    task = (foo()
            .set_cpu_request(str(2))
            .set_cpu_limit(str(4))
            .set_memory_request("2Gi")
            .set_memory_limit("4Gi")
            .set_gpu_limit("1")
            .set_accelerator_type("nvidia.com/gpu")
           )

    task = kubernetes.set_image_pull_policy(task=task, policy="Always")

    task = kubernetes.add_toleration(
        task=task,
        key="sku",
        operator="Equal",
        value="gpu",
        effect="NoSchedule",
    )

namespace = "admin"

client = kfp.Client()

run = client.create_run_from_pipeline_func(
    run_name="gpu_test",
    pipeline_func=foo_pipeline,
    namespace=namespace,
    experiment_name="gpu-foo")

Could be related to upstream: kubeflow/pipelines#7617

The text was updated successfully, but these errors were encountered:

syncronize-issues-to-jira · 2024-10-18T06:40:58Z

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6494.

This message was autogenerated

kimwnasptd · 2024-10-30T14:36:49Z

First of all I managed to reproduce this by

Creating a run
Waiting for about a day

The underlying Argo Workflow was fully deleted, alongside all the Pipeline Pods in the user namespace. Then in the UI I would see the error above

kimwnasptd · 2024-10-30T14:39:46Z

Looking at the kfp-persistence logs I saw the following:

2024-10-29T08:51:37.111Z [persistenceagent] time="2024-10-29T08:51:37Z" level=info msg="Syncing Workflow (tutorial-data-passing-6nvfj): success, processing complete." Workflow=tutorial-data-passing-6nvfj
2024-10-29T08:51:37.111Z [persistenceagent] time="2024-10-29T08:51:37Z" level=info msg="Success while syncing resource (admin/tutorial-data-passing-6nvfj)"
2024-10-29T08:51:40.103Z [persistenceagent] time="2024-10-29T08:51:40Z" level=info msg="Syncing Workflow (tutorial-data-passing-6nvfj): success, processing complete." Workflow=tutorial-data-passing-6nvfj
2024-10-29T08:51:40.103Z [persistenceagent] time="2024-10-29T08:51:40Z" level=info msg="Success while syncing resource (admin/tutorial-data-passing-6nvfj)"
2024-10-29T08:51:55.600Z [persistenceagent] time="2024-10-29T08:51:55Z" level=info msg="Syncing Workflow (tutorial-data-passing-6nvfj): success, processing complete." Workflow=tutorial-data-passing-6nvfj

But in upstream logs from ml-pipeline-persistenceagent pod I see the following:

time="2024-10-30T14:33:37Z" level=info msg="Wait for shut down"
time="2024-10-30T14:33:38Z" level=info msg="Syncing Workflow (tutorial-data-passing-xgj6p): success, processing complete." Workflow=tutorial-data-passing-xgj6p
time="2024-10-30T14:33:38Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:33:40Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:33:40Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:34:08Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:34:08Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:34:38Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:34:38Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:35:08Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:35:08Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"

What is weird is that in the Charmed pod we see multiple times the Syncing Workflow... message, while in upstream component we only see this once. But am not entirely sure if this is related to the issue.

kimwnasptd · 2024-10-30T14:45:18Z

Also with a quick research in the upstream project I see a similar issue raised
kubeflow/pipelines#8935

But that seems to be resolved, and the PR of the fix is merged in KFP 2.2.0, which is used by Kubeflow 1.9.

kimwnasptd · 2024-11-05T12:50:17Z

I've managed to reproduce this with upstream Kubeflow 1.9 and 1.9.1. We can track its progress in kubeflow/pipelines#11357

A potential temporary workaround will be to play with the TTL_SECONDS_AFTER_WORKFLOW_FINISH ENV var of the persistence agent (will need to create a charm config option) to set the GC deadline to something more than a day.

kimwnasptd · 2024-11-08T12:53:56Z

Looking a bit more into the source code of KFP I found this ARGO_ARCHIVE_LOGS env var that is set by default to false (!)
https://github.com/kubeflow/pipelines/blob/2.2.0/frontend/server/configs.ts#L89

Enabling this env var results in now the logs to fail with a different error

Error response: Could not get main container logs: S3Error: The specified key does not exist.

Which means now the ml-pipeline-ui Pod (kubeflow namespace) is trying also to reach MinIO. This makes sense since after inspecting more the source code it looks like if this setting is enabled then the logs handler will try to talk to MinIO

(From the above also we deduce that the ml-pipeline-ui is not proxying the requests via the ml-pipeline-ui-artifacts deployment in the user namespace, as it does for the artifacts of a pipeline)

kimwnasptd · 2024-11-08T13:52:55Z

After looking at the logs from ml-pipeline-ui I see the following:

Getting logs for pod:tutorial-data-passing-8dntv-system-container-impl-2572032214 from mlpipeline/logs/tutorial-data-passing-8dntv-system-container-impl/tutorial-data-passing-8dntv-system-container-impl-2572032214/main.log

But this is wrong, since the path that the logs are exposed in MinIO is the following:

mlpipeline/artifacts/tutorial-data-passing-8dntv/2024/11/08/tutorial-data-passing-8dntv-system-container-impl-2572032214/main.log

So the kfp-ui is not calculating the path for the logs in KFP 2.2.0

kimwnasptd · 2024-11-08T16:17:33Z

After looking closer at the source code it looks like this was a bug in V2 that exists all the way to KFP v2.2.0. This is resolved though in upstream in the following kubeflow/pipelines#11010

Although after trying the v2.3.0 image then the UI pod server process was restarting and I was seeing a lot of errors like the following:

/server/node_modules/node-fetch/lib/index.js:1491
                        reject(new FetchError(`request to ${request.url} failed, reason: ${err.message}`, 'system', err));
                               ^
FetchError: request to http://metadata/computeMetadata/v1/instance/attributes/cluster-name failed, reason: getaddrinfo EAI_AGAIN metadata
    at ClientRequest.<anonymous> (/server/node_modules/node-fetch/lib/index.js:1491:11)
    at ClientRequest.emit (node:events:517:28)
    at Socket.socketErrorListener (node:_http_client:501:9)
    at Socket.emit (node:events:517:28)
    at emitErrorNT (node:internal/streams/destroy:151:8)
    at emitErrorCloseNT (node:internal/streams/destroy:116:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  type: 'system',
  errno: 'EAI_AGAIN',
  code: 'EAI_AGAIN'
}

And in the UI I see

Error response: upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 111

This looks like to be from a specific GKE handler of KFP caused from
https://github.com/kubeflow/pipelines/blob/2.3.0/frontend/server/app.ts#L237-L238, and reported in upstream in kubeflow/pipelines#11247

EDIT: We also have a dedicated bug in canonical/kfp-operators#584

kimwnasptd · 2024-11-08T16:23:16Z

In the end I managed to get the logs working by:

Updating the image of the UI to gcr.io/ml-pipeline/frontend:2.3.0
Adding the env var ARGO_ARCHIVE_LOGS: "true"
Adding the env var DISABLE_GKE_METADATA: "true"

In order to resolve the above issue for CKF we'll need to bump KFP to 2.3.0 and also allow the kfp-ui charm to dynamically set those env vars as well

AxoyTO added the bug Something isn't working label Oct 18, 2024

This was referenced Nov 13, 2024

Upgrade to KFP 2.3.0 canonical/kfp-operators#579

Open

Ensure kfp-ui can show logs from Argo canonical/kfp-operators#582

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline logs are disappearing after 24h #1120

Pipeline logs are disappearing after 24h #1120

AxoyTO commented Oct 18, 2024 •

edited

Loading

syncronize-issues-to-jira bot commented Oct 18, 2024

kimwnasptd commented Oct 30, 2024

kimwnasptd commented Oct 30, 2024 •

edited

Loading

kimwnasptd commented Oct 30, 2024

kimwnasptd commented Nov 5, 2024

kimwnasptd commented Nov 8, 2024 •

edited

Loading

kimwnasptd commented Nov 8, 2024

kimwnasptd commented Nov 8, 2024 •

edited

Loading

kimwnasptd commented Nov 8, 2024 •

edited

Loading

Pipeline logs are disappearing after 24h #1120

Pipeline logs are disappearing after 24h #1120

Comments

AxoyTO commented Oct 18, 2024 • edited Loading

Bug Description

To Reproduce

Environment

Relevant Log Output

Additional Context

syncronize-issues-to-jira bot commented Oct 18, 2024

kimwnasptd commented Oct 30, 2024

kimwnasptd commented Oct 30, 2024 • edited Loading

kimwnasptd commented Oct 30, 2024

kimwnasptd commented Nov 5, 2024

kimwnasptd commented Nov 8, 2024 • edited Loading

kimwnasptd commented Nov 8, 2024

kimwnasptd commented Nov 8, 2024 • edited Loading

kimwnasptd commented Nov 8, 2024 • edited Loading

AxoyTO commented Oct 18, 2024 •

edited

Loading

kimwnasptd commented Oct 30, 2024 •

edited

Loading

kimwnasptd commented Nov 8, 2024 •

edited

Loading

kimwnasptd commented Nov 8, 2024 •

edited

Loading

kimwnasptd commented Nov 8, 2024 •

edited

Loading