Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] display main.log as artifact for each step #10036

Open
Linchin opened this issue Sep 26, 2023 · 53 comments
Open

[feature] display main.log as artifact for each step #10036

Linchin opened this issue Sep 26, 2023 · 53 comments
Labels
area/backend area/frontend kind/feature lifecycle/stale The issue / pull request is stale, any activities remove this label.

Comments

@Linchin
Copy link
Contributor

Linchin commented Sep 26, 2023

Feature Area

/area frontend
/area backend

What feature would you like to see?

A feature similar to v1 behavior. In the details page of each step, the log appears as one of the artifacts that users can view:
image

In v2, this main log artifact is no longer displayed. It would be great if we could add a similar section to show the logs, or use the log artifact as a source for the "logs" panel after the pod is deleted, instead of directly pulling from the (already deleted) pod.

What is the use case or pain point?

Because all the logs of completed pods are auto deleted by Kubernetes after 24 hours, the users can no longer access the logs of earlier pipeline runs.

Is there a workaround currently?

No workaround.

/cc @kromanow94


Love this idea? Give it a 👍.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Dec 26, 2023
@revolutionisme
Copy link

I read that this issue was not included in the previous release as mentioned in this comment , just wanted to ask whether we have an update on this or not.

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jan 29, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 30, 2024
Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@AnnKatrinBecker
Copy link

/reopen

Is readding this feature to v2 planned for future releases? It would be very helpful.

Copy link

@AnnKatrinBecker: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Is readding this feature to v2 planned for future releases? It would be very helpful.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@HumairAK
Copy link
Collaborator

I'm a bit unfamiliar with the v1 version of this, did this come from argo's archive logging feature?

In v2 kfp aims to be agnostic to the backing engine, so if this is an argo specific feature, we may need to consider a separate feature/proposal on how to handle persistent logging in general for kfp.

@TobiasGoerke
Copy link
Contributor

I'm a bit unfamiliar with the v1 version of this, did this come from argo's archive logging feature?

In v2 kfp aims to be agnostic to the backing engine, so if this is an argo specific feature, we may need to consider a separate feature/proposal on how to handle persistent logging in general for kfp.

Yes, Argo Workflows stores the logs separately and given only kfp metadata, there currently seems to be no way to obtain the logs' locations. You'd need to access the Argo Workflow object directly to obtain the info.

Including this feature right into kfp would make sense and imo is much needed, as otherwise, users aren't able to view logs after the step's pod has been deleted.

@sanchesoon
Copy link

Hi @HumairAK What do you think about this ?
Points:

  • DS engineer would be interesting to see logs as artifact, as part of models producer investigation
  • logs can have different retention against artifacts: user can clean them to reduce space usage
  • it is better to have interactive logs

Thoughts:

  • in launcher POD: use extra container for waiting logs and publish or use other approaches: fluentd.... it is up to user how to collect logs and put logs. But by default we will do by sidecar container. For this we should disable argoworflow logs.
  • during run we subscribe real container logs stream (this logic already implemented)
  • allow launcher use separate bucket for logs or use the same as artifact
  • in sdk we should be able disable logs

@HumairAK
Copy link
Collaborator

/reopen

Copy link

@HumairAK: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@google-oss-prow google-oss-prow bot reopened this May 27, 2024
@HumairAK
Copy link
Collaborator

I agree persisted logging is very important, but we should clarify if we need them as artifacts, or do users just want to see logs after pods have been removed?

For example, if we have persisted logs, the log viewer could instead fetch the logs from the persisted location instead of pods (I believe this is the logic currently present in ui, though it may not be utilized in v2). Could it be confusing to see 2 different locations where logs are being shown?

DS engineer would be interesting to see logs as artifact, as part of models producer investigation

Not disagreeing, but could you elaborate on specifics?

I like the idea of decoupling implementation to existing solutions (or at least giving the user the option to, if we opt to natively implement),

in sdk we should be able disable logs

I think we should also allow users to disable/enable not only in the SDK, but via api & ui as well.

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label May 27, 2024
@sanchesoon
Copy link

sanchesoon commented May 27, 2024

Why engineers may interesting in this:

  • some frameworks for ML logging kind of metrics (I took this picture from the Internet )
    image
  • kubeflow run is snapshot of inputs/intermediate-outputs/target-outputs/code/config which can help to reproduce or compare, so for the same combinations of inputs/intermediate/code/config, it is possible to have different logs due to different conditions in cluster: gpu, cpu, drivers.... that is why logs is part of this snapshot

v1 version was implemented correctly from concept view, because it combine logs interaction during run (life time stream from pod) and logs persistent for long term.

Saving to object storage can be default, but it is also possible to use other solutions for logs streaming like, for example ELK, kafaka, or Google Logging.... so this can be optional

Current workaround with argoworkflow backend is using keyfomat like /artifacts/{workflow_id}/{pod_id}/main.log via patching artifact_ui split function and use v1 handler.

@tom-pavz
Copy link

Current workaround with argoworkflow backend is using keyfomat like /artifacts/{workflow_id}/{pod_id}/main.log via patching artifact_ui split function and use v1 handler.

@sanchesoon Can you give more details on this workaround? I am currently using KFP 2.1.0 with the normal argoworkflow backend. I really want a way to provide users the ability to view their pipeline tasks' logs simply by clicking through the KFP UI.

@droctothorpe
Copy link
Contributor

Another argument in favor of storing the log as an output artifact is what if an admin elects to update the keyFormat? The logs for historical runs will no longer be accessible, and there won't be a way for the UI to support both the old and new patterns at once.

@kromanow94
Copy link

My personal preference for maturity and stability would be to go with the MLMD and google review or an alternative store if possible. I'd even investigate if creating additional table in the mlpipeline db makes sense just to store the log artifacts URIs for given Step ID. Considering all the discussion, I don't feel competent enough to suggest if the workaround with the env var in the config.ts is suitable for now... If I were to make this decision myself, I wouldn't do it.

@droctothorpe
Copy link
Contributor

droctothorpe commented Jul 10, 2024

I'm starting to lean in that direction as well, but what the heck do I know, I am just a plebe user.

Do ya'll think it make sense to submit a smaller, discrete PR that corrects the UI's incorrect guess for the time being until we can implement a broader fix that adds the log URI to MLMD? We can aim to submit the PR tomorrow (at AWS Summit all day today) to give you a sense of the scope.

@kromanow94
Copy link

If this can be added to the KF 1.9 then maybe it makes sense to unblock this with the idea you've proposed with config.ts. If it's not going to be a part of the v1.9 release because how short on time we are, maybe not so much.

@thesuperzapper
Copy link
Member

I just want to highlight what I said in #10036 (comment) again.

This is very clearly an accidental regression, because KFP V1 stores the logs in an MLMD artifact (it just did not automatically forward the logs tab to the appropriate artifacts and required users to click the step and view it).

The only additional complexity (which is probably why this was overlooked) is that in KFP v2 now each step is more than one container, each with its own logs.

Assuming we store the logs for all steps as an artifact (even the non-executor steps) we should correctly show them separately in the logs tab probably with some kind of sub-tab for each container.

@droctothorpe
Copy link
Contributor

@chensun @zijianjoy @connor-mccarthy Do you have any objections to adding the driver and executor logs to the list of output artifacts for each component? I just want to make sure that removing that functionality was not a deliberate design choice before we re-implement it in v2. Thank you!

@droctothorpe
Copy link
Contributor

I think the executor outputs are being extracted from the IR, and currently the IR doesn't include information about the container logs. We need to decide if we want to include the container logs as executor outputs, which is quite complicated because (a) the path is dynamically inferred by keyFormat (for AWF) and not available client-side at IR compilation time, and (b) is extremely AWF specific in a way that could undermine workflow orchestration agnosticism in the IR.

@droctothorpe
Copy link
Contributor

POC for the interim solution: #11010. This will make it possible for anyone to implement a solution in a way that accommodates their unique keyFormat syntax without having to modify the actual code.

I plan to join the community call tomorrow to hear from Google about the proposed target-state solution of storing logs as output artifacts.

@droctothorpe
Copy link
Contributor

#11010 was just merged 🎉 . It might make make sense to keep this issue open to continue discussion about wether or not to / how to surface driver and executor logs as explicit outputs in the GUI.

@kromanow94
Copy link

Hey, I noticed the PR is already included in the 2.3.0. I'm trying to run some tests but I get the following error and I'm not able to get the logs from the Pod Step. I'm having two issues:

The ml-pipeline-ui Pod is exiting the main process with the following error:

/server/node_modules/node-fetch/lib/index.js:1491
                        reject(new FetchError(`request to ${request.url} failed, reason: ${err.message}`, 'system', err));
                               ^
FetchError: request to http://metadata/computeMetadata/v1/instance/attributes/cluster-name failed, reason: getaddrinfo ENOTFOUND metadata
    at ClientRequest.<anonymous> (/server/node_modules/node-fetch/lib/index.js:1491:11)
    at ClientRequest.emit (node:events:517:28)
    at Socket.socketErrorListener (node:_http_client:501:9)
    at Socket.emit (node:events:517:28)
    at emitErrorNT (node:internal/streams/destroy:151:8)
    at emitErrorCloseNT (node:internal/streams/destroy:116:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  type: 'system',
  errno: 'ENOTFOUND',
  code: 'ENOTFOUND'
}

In the short period of time when the ml-pipeline-ui is responsive before it's exiting with the above mentioned error code, I'm getting this error:

Could not get main container logs: Error: Unable to retrieve workflow status: [object Object].

Do you know what might be causing the issue?

@kromanow94
Copy link

FYI, I was testing with kubeflow/manifests 1.9.0 and only changed the container images tags for pipeline components to 2.3.0.

@droctothorpe
Copy link
Contributor

@kromanow94, someone in the (deprecated) Kubeflow slack workspace recently DMed me reporting what looks like a very similar error and a corresponding solution:

image
Could not get main container logs: Error: Unable to retrieve workflow status: [object Object].

That error is raised here. It is raised when the workflow corresponding to the run has been garbage collected.

When that fails, it should move on to trying to retrieve the logs directly from the artifact store. That sequencing (K8s API lookup -> WF manifest lookup -> artifact store lookup) is defined here. Note that the third step only executes if archiveLogs is true.

@kromanow94
Copy link

Thanks, I solved the issue with the ml-pipeline-ui with the following workaround:

$ kubectl -n kubeflow expose deployment ml-pipeline --name metadata --dry-run=client -oyaml | yq 'del(.metadata.labels)' -y | kubectl apply -f -

It seems that I forgot about the env ARGO_ARCHIVE_LOGS="true" as well.

Now with those two issues resolved, there is some other issue I'm having right now. It seems the frontend uses the workflow pod name in place of the workflow name to get the logs.

The Argo Workflows and ml-pipeline-ui are set with the following key format:

artifacts/{{workflow.name}}/{{workflow.creationTimestamp.Y}}/{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}/{{pod.name}}

Now, I'm trying a scenario when I manually delete just one Workflow Pod and see if I can get the logs. I'm running that with curl with the following command and error:

$ curl "https://kubeflow.example.com/pipeline/k8s/pod/logs?podname=addition-pipeline-pb72t-3709849268&runid=62911214-2e5c-46b4-ae26-d660809b8a5b&podnamespace=glo-k8s-admin&createdat=2024-09-10" -H "Authorization: Bearer $(kubectl -n glo-k8s-admin create token default-editor)"
Could not get main container logs: S3Error: 503

Looking at logs from the ml-pipeline-ui, I see the following line:

GET /pipeline/k8s/pod/logs?podname=addition-pipeline-pb72t-3709849268&runid=62911214-2e5c-46b4-ae26-d660809b8a5b&podnamespace=glo-k8s-admin&createdat=2024-09-10
Getting logs for pod, addition-pipeline-pb72t-3709849268, from mlpipeline/artifacts/addition-pipeline-pb72t-3709849268/2024/09/10/addition-pipeline-pb72t-3709849268/main.log.

But, having a look at the storage in minio, the path seems to be different in the workflow name:

mlpipeline/artifacts/addition-pipeline-pb72t/2024/09/10/addition-pipeline-pb72t-3709849268/main.log

I also run the same test when both the ArgoWF and its Pods were deleted with the same result.

@kromanow94
Copy link

I also run another scenario with one of the default pipelines. Using the Run of [Tutorial] DSL - Control structures will create a set of Pods with the following names:

tutorial-control-flows-gbjpn-system-dag-driver-1941995042
tutorial-control-flows-gbjpn-system-dag-driver-2529298331
tutorial-control-flows-gbjpn-system-container-driver-3821465773
tutorial-control-flows-gbjpn-system-container-driver-467613593
tutorial-control-flows-gbjpn-system-container-impl-2018360503
tutorial-control-flows-gbjpn-system-dag-driver-3793006146
tutorial-control-flows-gbjpn-system-container-driver-1781539373
tutorial-control-flows-gbjpn-system-dag-driver-1653906037
tutorial-control-flows-gbjpn-system-dag-driver-477249553
tutorial-control-flows-gbjpn-system-dag-driver-3564565339
tutorial-control-flows-gbjpn-system-dag-driver-511401279
tutorial-control-flows-gbjpn-system-container-impl-1647980155
tutorial-control-flows-gbjpn-system-container-driver-133111245
tutorial-control-flows-gbjpn-system-container-impl-902191899
tutorial-control-flows-gbjpn-system-dag-driver-2964226956
tutorial-control-flows-gbjpn-system-dag-driver-820485803
tutorial-control-flows-gbjpn-system-container-driver-2311769198
tutorial-control-flows-gbjpn-system-container-impl-3092937256

With the pipeline above, I confirm the mechanism works and I was able to get to the logs through the artifacts when the Pod was no longer available on the cluster.

For my other test I used pipeline created with the following script:

from kfp import dsl
import kfp

client = kfp.Client()

experiment_name = "my-experiment"
experiment_namespace = "glo-k8s-admin"


@dsl.component
def add(a: float, b: float) -> float:
    """Calculates sum of two arguments"""
    return a + b


@dsl.pipeline(
    name="Addition pipeline",
    description="An example pipeline that performs addition calculations.",
)
def add_pipeline(
    a: float = 1.0,
    b: float = 7.0,
):
    first_add_task = add(a=a, b=4.0)
    second_add_task = add(a=first_add_task.output, b=b)


try:
    print("getting experiment...")
    experiment = client.get_experiment(
        experiment_name=experiment_name, namespace=experiment_namespace
    )
    print("got experiment!")
except Exception:
    print("creating experiment...")
    experiment = client.create_experiment(
        name=experiment_name, namespace=experiment_namespace
    )
    print("created experiment!")

client.create_run_from_pipeline_func(
    add_pipeline,
    arguments={"a": 7.0, "b": 8.0},
    experiment_id=experiment.experiment_id,
    enable_caching=False,
)

Which creates Pods with following names (the name differs from example in previous comment because it's a new run):

addition-pipeline-8fbx5-1329381822
addition-pipeline-8fbx5-2476489665
addition-pipeline-8fbx5-2909346888
addition-pipeline-8fbx5-579177391
addition-pipeline-8fbx5-822281223

If I understand correctly, this is because of how the Workflow Name is being provided by formatting the pod name:

https://github.com/kubeflow/pipelines/blob/2.3.0/frontend/server/workflow-helper.ts#L169

https://github.com/kubeflow/pipelines/blob/2.3.0/frontend/server/workflow-helper.ts#L214

Is this because of the difference in how the KFPv1 and KFPv2 works?

@droctothorpe
Copy link
Contributor

droctothorpe commented Sep 10, 2024

v2 pod names include suffixes like the first example you shared. Is the Addition pipeline pipeline you linked authored / submitted with KFP v1? If so, the logs should be stored as proper output artifacts that you can access in the Inputs/Outputs tab when you click on the node. That being said, it shouldn't be too difficult to add support for this alternative log retrieval mechanism in v1 if there is a demand for it. It's a long-standing issue in v1, v1 is in somewhat of a frozen state, and there's an alternative solution in place (the Inputs/Outputs tab), so this PR didn't fix the logs tab in v1.

@droctothorpe
Copy link
Contributor

Also, in case it's related, please make note of this issue: #11019.

@droctothorpe
Copy link
Contributor

That being said, continuing to do some extra validation on my end to see if I can repro the error.

@kromanow94
Copy link

Hey @droctothorpe , were you able to have a closer look here?

TBH, I'm not exactly sure if the Pipeline I created using the script from #10036 (comment) is KFP v1 or KFP v2. I deployed it using kfp==2.8.0 so I assumed it's KFP v2. Do you know if I should change anything in the script to explicitly configure it for KFP v2?

@droctothorpe
Copy link
Contributor

When I ran that same pipeline, my pods had different names (i.e. they adhered to the v2 nomenclature):
image

Are you positive that you submitted the run with v2?

@droctothorpe
Copy link
Contributor

Here are the logs in the UI for that pipeline even after the pods and workflow CR have been GCed:

image

@droctothorpe
Copy link
Contributor

droctothorpe commented Sep 24, 2024

Hey, I noticed the PR is already included in the 2.3.0. I'm trying to run some tests but I get the following error and I'm not able to get the logs from the Pod Step. I'm having two issues:

The ml-pipeline-ui Pod is exiting the main process with the following error:

/server/node_modules/node-fetch/lib/index.js:1491
                        reject(new FetchError(`request to ${request.url} failed, reason: ${err.message}`, 'system', err));
                               ^
FetchError: request to http://metadata/computeMetadata/v1/instance/attributes/cluster-name failed, reason: getaddrinfo ENOTFOUND metadata
    at ClientRequest.<anonymous> (/server/node_modules/node-fetch/lib/index.js:1491:11)
    at ClientRequest.emit (node:events:517:28)
    at Socket.socketErrorListener (node:_http_client:501:9)
    at Socket.emit (node:events:517:28)
    at emitErrorNT (node:internal/streams/destroy:151:8)
    at emitErrorCloseNT (node:internal/streams/destroy:116:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  type: 'system',
  errno: 'ENOTFOUND',
  code: 'ENOTFOUND'
}

In the short period of time when the ml-pipeline-ui is responsive before it's exiting with the above mentioned error code, I'm getting this error:

Could not get main container logs: Error: Unable to retrieve workflow status: [object Object].

Do you know what might be causing the issue?

FYI adding the following env var to the ml-pipeline-ui might be an easier fix for this error:

- name: DISABLE_GKE_METADATA
   value: "true"

Shout out to @boarder7395 for sharing it.

@boarder7395
Copy link
Contributor

boarder7395 commented Sep 24, 2024

@kromanow94 I think you might be having the same issue I did.

The two changes required for everything to work are

  • Update environment variable ARGO_KEYFORMAT on deployment ml-pipeline-ui to any format you want (we include {{workflow.namespace}} to ensure separation of tenant data assets).
  • Update the workflow-controller configmap value of keyFormat to be the same format as above. Make sure to restart the deployment after updating the configmap.

What I missed was the second item which resulted in the issues you had above.

@kromanow94
Copy link

Hey, I was able to get the main.log artifact after the Workflow and Pods were deleted with latest version from kubeflow/manifests v1.9.1 which uses the Kubeflow Pipelines version 2.3.0.

I only had to put those two env variables for the ml-pipeline-ui Deployment and all works:

- name: ARGO_ARCHIVE_LOGS
  value: "true"
- name: DISABLE_GKE_METADATA
  value: "true"

Though still a workaround in code, it works.

Are there any plans for the proper solution that registers the main.log as artifact?

@droctothorpe
Copy link
Contributor

I started working on turning the outputs into proper artifacts. There are a lot of complex design obstacles. There's no simple way to do it that preserves backend agnosticism. We have to decide if we want to rely on AWF to archive the logs and just record the relationship between the execution and the log files in MLMD, which is extremely backend-specific, or just have the launcher capture and publish logs directly, which is kind of reinventing something that AWF already does, and also doesn't account for driver logs. Bandwidth permitting, I may author a design doc. Maybe we can discuss it in the community call.

@kromanow94
Copy link

Amazing, thank you so much for your time and effort!

Copy link

github-actions bot commented Jan 5, 2025

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend area/frontend kind/feature lifecycle/stale The issue / pull request is stale, any activities remove this label.
Projects
Status: Needs triage
Development

No branches or pull requests