Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Can't see logs from the UI when Argo Workflow is deleted by Persistence Agent #11357

Closed
kimwnasptd opened this issue Nov 5, 2024 · 3 comments · Fixed by #11552
Closed

Comments

@kimwnasptd
Copy link
Member

kimwnasptd commented Nov 5, 2024

Environment

Steps to reproduce

  1. Create an Experiment and a Run from the Data Passing pipeline
  2. Update the TTL_SECONDS_AFTER_WORKFLOW_FINISH env var in ml-pipeline-persistence Deployment to be something short, like 60
  3. Wait for the Argo Workflow to succeed
  4. After the configured time the persistence agent will mark the workflow as completed and remove the Argo Workflow
  5. Access the UI and try to see the logs from one pod
  6. It'll fail with Failed to retrieve pod logs.
  7. When clicking on Details there's a popup with Error response: Could not get main container logs: Error: Unable to retrieve workflow status: [object Object].

image

Expected result

I would expect to see the Pod logs in the case the workflow is deleted, which are stored in MinIO as part of Argo.

Materials and reference

I see some references of the same error in #11010 and #11339 but am not entirely sure if it's the same.

Note that I'm only using upstream manifests and their example installation, that doesn't deviate the MinIO/Argo installation from what's provided in this repo.

Also, when looking at the requests, even if the Workflow is GCed I see requests for logs to the following URL:
http://localhost:8080/pipeline/k8s/pod/logs?podname=tutorial-data-passing-h2c74-system-container-impl-306858994&runid=8542d9b2-89ee-47bc-a8fc-7210978115eb&podnamespace=kubeflow-user-example-com&createdat=2024-11-05

Not sure if this is expected, but seemed weird that it tries to get k8s pod logs when we know the pod doesn't exist in the cluster

Labels

/area frontend
/area backend


Impacted by this bug? Give it a 👍.

@kimwnasptd
Copy link
Member Author

After looking more around, I managed to see the logs persisted in MinIO when the Argo Workflow was deleted by:

  1. Deploying KFP 2.3.0, which has the fix from fix(frontend): retrieve archived logs from correct location #11010 included
  2. Setting the following env vars in the ml-pipeline-ui Deployment
    1. ARGO_ARCHIVE_LOGS: "true"
    2. DISABLE_GKE_METADATA: "true"

Shouldn't we set these env vars in the manifests, also used by the default installation, by default?
cc @juliusvonkohout

@juliusvonkohout
Copy link
Member

Yes we should. I am wondering why they are not set by default in the upstream KFP manifests. I prefer to change them there.

Also many people change ARGO_KEYFORMAT = 'artifacts/{{workflow.name}}/{{workflow.creationTimestamp.Y}}/{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}/{{pod.name}}', in the argo workflow controller configmap, so its good if it is an environment variable directly exposed in the upstream KFP manifests and not just somewhere in the code.

@droctothorpe
Copy link
Contributor

FWIW, it can be added as an env var to the UI deployment so you don't have to tweak the application code. Long term, I'd like to make logs first class output artifacts and store them in MLMD, which will make all of this guess work on the UI side unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants