Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Panic while connection to default cache endpoint ml-pipeline.kubeflow:8887 #9702

Open
andre-lx opened this issue Jul 5, 2023 · 32 comments

Comments

@andre-lx
Copy link

andre-lx commented Jul 5, 2023

Environment

  • How did you deploy Kubeflow Pipelines (KFP)?
    Manifests
  • KFP version:
    2.0.0
  • KFP SDK version:
kfp                   2.0.1
kfp-pipeline-spec     0.2.2
kfp-server-api        2.0.0

Steps to reproduce

Hello, we are trying the migration from pipelines 1.8.5 to 2.0.0 but after the apply we are aheving some issues.

Running the "hello world" example from the jupyerlab:

from kfp import dsl
import kfp


from kfp import dsl

@dsl.component
def say_hello(name: str) -> str:
    hello_text = f'Hello, {name}!'
    print(hello_text)
    return hello_text

@dsl.pipeline
def hello_pipeline(recipient: str) -> str:
    hello_task = say_hello(name=recipient)
    return hello_task.output

from kfp import compiler

compiler.Compiler().compile(hello_pipeline, 'pipeline.yaml')

from kfp.client import Client

client = Client()
run = client.create_run_from_pipeline_package(
    'pipeline.yaml',
    arguments={
        'recipient': 'World',
    },
)

Or running the generated pipeline.yaml from the result directly though the UI, we always get the following error on the third pod that is started:

time="2023-07-05T14:19:23.912Z" level=info msg="capturing logs" argo=true
time="2023-07-05T14:19:23.945Z" level=info msg="capturing logs" argo=true
I0705 14:19:23.966873      51 launcher_v2.go:90] input ComponentSpec:{
  "inputDefinitions": {
    "parameters": {
      "name": {
        "parameterType": "STRING"
      }
    }
  },
  "outputDefinitions": {
    "parameters": {
      "Output": {
        "parameterType": "STRING"
      }
    }
  },
  "executorLabel": "exec-say-hello"
}
I0705 14:19:23.967498      51 cache.go:139] Cannot detect ml-pipeline in the same namespace, default to ml-pipeline.kubeflow:8887 as KFP endpoint.
I0705 14:19:23.967512      51 cache.go:116] Connecting to cache endpoint ml-pipeline.kubeflow:8887
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x941c29]

goroutine 1 [running]:
github.com/kubeflow/pipelines/backend/src/v2/metadata.(*Client).PublishExecution(0xc000b29920, {0x20a4878, 0xc000058040}, 0x0, 0x0, {0x0, 0x0, 0xc000b60000?}, 0x4)
	/go/src/github.com/kubeflow/pipelines/backend/src/v2/metadata/client.go:388 +0x69
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).publish(0x1d3c167?, {0x20a4878?, 0xc000058040?}, 0x1?, 0x1?, {0x0?, 0x1a51660?, 0xc0006a63a0?}, 0xc73bb0?)
	/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:266 +0x9b
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute.func2()
	/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:144 +0x65
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute(0xc00028e540, {0x20a4878, 0xc000058040})
	/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:156 +0x91e
main.run()
	/go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:98 +0x3ed
main.main()
	/go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:47 +0x19
time="2023-07-05T14:19:24.950Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 2
time="2023-07-05T14:19:25.918Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 2

The service ml-pipeline.kubeflow:8887 exists.

Everything works great on version 1.8.5.

If you need the logs from the others two pods please let me know. I also check the logs in all the kubeflow services and I can't find any issue.

Impacted by this bug? Give it a 👍.

@zijianjoy
Copy link
Collaborator

/assign @Linchin

@Linchin
Copy link
Contributor

Linchin commented Jul 13, 2023

Hi @andre-lx, thank you for bringing up this issue. I tried the same pipeline on a newly deployed 2.0.0 cluster, and the run finished without issue. looking at the log you provided, we have

github.com/kubeflow/pipelines/backend/src/v2/metadata.(*Client).PublishExecution(0xc000b29920, {0x20a4878, 0xc000058040}, 0x0, 0x0, {0x0, 0x0, 0xc000b60000?}, 0x4)
/go/src/github.com/kubeflow/pipelines/backend/src/v2/metadata/client.go:388 +0x69

The metadata client seems to come from version 2.0.0-rc.2 instead of version 2.0.0. Could you double check if you applied the manifest of version 2.0.0? Try apply the manifest again (here) and see if the issue persists.

@Linchin
Copy link
Contributor

Linchin commented Jul 14, 2023

Also, could you let me know which way you used to deploy KFP, standalone or via kubeflow?

@andre-lx
Copy link
Author

Hi @Linchin, I just checked and we are using the following image:

images:
- name: gcr.io/ml-pipeline/metadata-envoy
newTag: 2.0.0

The deployment was done using the follwing file: https://github.com/kubeflow/pipelines/blob/2.0.0/manifests/kustomize/env/platform-agnostic-multi-user/kustomization.yaml

Thanks

@nithin8702
Copy link

Hi @andre-lx @Linchin
Same issue we are also facing. Did you get a chance to fix it?

@andre-lx
Copy link
Author

Hi @andre-lx @Linchin Same issue we are also facing. Did you get a chance to fix it?

I had to revert it to 1.8.5 for now.

@nithin8702
Copy link

@halilagin
Copy link

I have the same error. Here are the details.

  1. Running in standalone mode
  2. Running in virtual cluster (everything is working but cannot run pipelines)
  3. All pods are working
  4. I can upload and run pipelines on UI, but the pod is failing
  5. Using the pipelines version 2.0.0
  6. Generating the pipeline with the command below
    kfp dsl compile --py v2/hello_world.py --output hello_world.pipeline.json

@chensun chensun self-assigned this Aug 8, 2023
@chensun chensun moved this to P1 in KFP v2 Aug 8, 2023
@Linchin Linchin removed their assignment Aug 28, 2023
Copy link

github-actions bot commented Nov 7, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Nov 7, 2023
@pffijt
Copy link

pffijt commented Dec 7, 2023

I also have this issue in my Kubeflow 1.8 environment.
Kubeflow 1.8 is using the pipelines backend 2.0.3

I released my environment with the kubeflow manifest 1.8.

Can someone fix this issue?

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Dec 7, 2023
@taiynlee
Copy link

the same issue on kubeflow 1.8

@svn123
Copy link

svn123 commented May 14, 2024

I have faced a similar issue. I have full Kubeflow 1.8 environment installed and the pipeline backend metadata envoy is 2.0.3 version. Is this issue resolved?

@umka1332
Copy link

umka1332 commented Jun 1, 2024

I've faced similar issue, and it was due to proxy setting on the pod/step. After removing proxy setting the issue was gone.

@pschoen-itsc
Copy link

@umka1332 This solved the problem for me also. But do you know a way how I can still set proxy env vars to connect to the internet?

@pschoen-itsc
Copy link

pschoen-itsc commented Jun 24, 2024

Just tested successfully that setting NO_PROXY to '*.kubeflow,*.local' seems to work together with http(s)_proxy.
It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.

@gregsheremeta
Copy link
Contributor

If anyone following this can reliably reproduce this issue...

we always get the following error on the third pod that is started

I also need to see the log on the second pod (driver) that is started. Thanks.

@suanshs
Copy link

suanshs commented Aug 28, 2024

@umka1332 This solved the problem for me also. But do you know a way how I can still set proxy env vars to connect to the internet?

Just tested successfully that setting NO_PROXY to '.kubeflow,.local' seems to work together with http(s)_proxy. It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.

How did you solve this? I tried to set the no_proxy environment variables but it did not work for me. @umka1332

@pschoen-itsc
Copy link

@umka1332 This solved the problem for me also. But do you know a way how I can still set proxy env vars to connect to the internet?

Just tested successfully that setting NO_PROXY to '.kubeflow,.local' seems to work together with http(s)_proxy. It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.

How did you solve this? I tried to set the no_proxy environment variables but it did not work for me. @umka1332

Important is to set NO_PROXY (so all uppercase). Also I had to add the kube api-server IP to NO_PROXY.

@stevenkitter
Copy link

1.8.1 kubeflow has the same problem....

@stevenkitter
Copy link

I solved this problem by delete proxy, you guys must delete proxy, if you need packages you need make a image that you can use.

@suanshs
Copy link

suanshs commented Aug 28, 2024

from kfp import dsl
from kfp import compiler

@dsl.component()
def say_hello() :
    import time
    time.sleep(1900)
    hello_text = f'Hello!'
    print(hello_text)

@dsl.pipeline
def hello_pipeline():
    hello_task = say_hello()
    hello_task.set_env_variable(name='NO_PROXY', value='*.kubeflow,*.local')
    hello_task.set_env_variable(name='no_proxy', value='*.kubeflow,*.local')
    hello_task.set_caching_options(False)
    

compiler.Compiler().compile(hello_pipeline, package_path='pipeline.yaml')

I tried running this but it did not work for me. Is there somethin I am missing here. @pschoen-itsc @umka1332

@pschoen-itsc
Copy link

@suanshs Seems like you are having a different problem. If you don't have any proxies set to begin with, then you also should not need the NO_PROXY settings. Can you provide logs of all the containers of the failing pod?

@suanshs
Copy link

suanshs commented Aug 28, 2024

@pschoen-itsc
Following are the logs from main container of the failing pod

time="2024-08-28T14:19:16.866Z" level=info msg="capturing logs" argo=true
time="2024-08-28T14:19:16.900Z" level=info msg="capturing logs" argo=true
I0828 14:19:16.922099      53 launcher_v2.go:90] input ComponentSpec:{
  "executorLabel": "exec-say-hello"
}
I0828 14:19:16.922671      53 cache.go:116] Connecting to cache endpoint ml-pipeline.kubeflow:8887
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x941c29]

goroutine 1 [running]:
github.com/kubeflow/pipelines/backend/src/v2/metadata.(*Client).PublishExecution(0xc000afc720, {0x20a4878, 0xc000196000}, 0x0, 0x0, {0x0, 0x0, 0xc0004dc000?}, 0x4)
	/go/src/github.com/kubeflow/pipelines/backend/src/v2/metadata/client.go:388 +0x69
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).publish(0x467387?, {0x20a4878?, 0xc000196000?}, 0x1?, 0x1?, {0x0?, 0x1a51660?, 0xc0004c6060?}, 0xbbfbb0?)
	/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:266 +0x9b
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute.func2()
	/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:144 +0x65
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute(0xc000306460, {0x20a4878, 0xc000196000})
	/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:156 +0x91e
main.run()
	/go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:98 +0x3ed
main.main()
	/go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:47 +0x19
time="2024-08-28T14:19:17.903Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 2
time="2024-08-28T14:19:18.871Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 2

Following are the logs from wait container

time="2024-08-28T14:19:16.138Z" level=info msg="Starting Workflow Executor" executorType=emissary version=v3.3.10
time="2024-08-28T14:19:16.141Z" level=info msg="Creating a emissary executor"
time="2024-08-28T14:19:16.141Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-08-28T14:19:16.141Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=kubeflow podName=hello-pipeline-2clrb-1334336905 template="{\"name\":\"system-container-impl\",\"inputs\":{\"parameters\":[{\"name\":\"pod-spec-patch\",\"value\":\"{\\\"containers\\\":[{\\\"name\\\":\\\"main\\\",\\\"image\\\":\\\"docker-dev-artifactory.workday.com/ml/kubeflow/python-3.7:latest\\\",\\\"command\\\":[\\\"/var/run/argo/argoexec\\\",\\\"emissary\\\",\\\"--\\\",\\\"/kfp-launcher/launch\\\",\\\"--pipeline_name\\\",\\\"hello-pipeline\\\",\\\"--run_id\\\",\\\"5610709d-50b9-4833-8e2d-7e72a19a97ec\\\",\\\"--execution_id\\\",\\\"91\\\",\\\"--executor_input\\\",\\\"{\\\\\\\"inputs\\\\\\\":{},\\\\\\\"outputs\\\\\\\":{\\\\\\\"outputFile\\\\\\\":\\\\\\\"/tmp/kfp_outputs/output_metadata.json\\\\\\\"}}\\\",\\\"--component_spec\\\",\\\"{\\\\\\\"executorLabel\\\\\\\":\\\\\\\"exec-say-hello\\\\\\\"}\\\",\\\"--pod_name\\\",\\\"$(KFP_POD_NAME)\\\",\\\"--pod_uid\\\",\\\"$(KFP_POD_UID)\\\",\\\"--mlmd_server_address\\\",\\\"$(METADATA_GRPC_SERVICE_HOST)\\\",\\\"--mlmd_server_port\\\",\\\"tcp://10.100.242.77:8080\\\",\\\"--\\\"],\\\"args\\\":[\\\"sh\\\",\\\"-c\\\",\\\"\\\\nif ! [ -x \\\\\\\"$(command -v pip)\\\\\\\" ]; then\\\\n    python3 -m ensurepip || python3 -m ensurepip --user || apt-get install python3-pip\\\\nfi\\\\n\\\\nPIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet     --no-warn-script-location 'kfp==2.0.1' \\\\u0026\\\\u0026 \\\\\\\"$0\\\\\\\" \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"sh\\\",\\\"-ec\\\",\\\"program_path=$(mktemp -d)\\\\nprintf \\\\\\\"%s\\\\\\\" \\\\\\\"$0\\\\\\\" \\\\u003e \\\\\\\"$program_path/ephemeral_component.py\\\\\\\"\\\\npython3 -m kfp.components.executor_main                         --component_module_path                         \\\\\\\"$program_path/ephemeral_component.py\\\\\\\"                         \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"\\\\nimport kfp\\\\nfrom kfp import dsl\\\\nfrom kfp.dsl import *\\\\nfrom typing import *\\\\n\\\\ndef say_hello() :\\\\n    import time\\\\n    time.sleep(1900)\\\\n    hello_text = f'Hello, Suansh!'\\\\n    print(hello_text)\\\\n\\\\n\\\",\\\"--executor_input\\\",\\\"{{$}}\\\",\\\"--function_to_execute\\\",\\\"say_hello\\\"],\\\"env\\\":[{\\\"name\\\":\\\"NO_PROXY\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"},{\\\"name\\\":\\\"no_proxy\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"}],\\\"resources\\\":{}}]}\"}]},\"outputs\":{},\"metadata\":{\"annotations\":{\"sidecar.istio.io/inject\":\"false\"}},\"container\":{\"name\":\"\",\"image\":\"gcr.io/ml-pipeline/should-be-overridden-during-runtime\",\"command\":[\"should-be-overridden-during-runtime\"],\"envFrom\":[{\"configMapRef\":{\"name\":\"metadata-grpc-configmap\",\"optional\":true}}],\"env\":[{\"name\":\"KFP_POD_NAME\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.name\"}}},{\"name\":\"KFP_POD_UID\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.uid\"}}}],\"resources\":{},\"volumeMounts\":[{\"name\":\"kfp-launcher\",\"mountPath\":\"/kfp-launcher\"}]},\"volumes\":[{\"name\":\"kfp-launcher\",\"emptyDir\":{}}],\"initContainers\":[{\"name\":\"kfp-launcher\",\"image\":\"gcr.io/ml-pipeline/kfp-launcher@sha256:80cf120abd125db84fa547640fd6386c4b2a26936e0c2b04a7d3634991a850a4\",\"command\":[\"launcher-v2\",\"--copy\",\"/kfp-launcher/launch\"],\"resources\":{\"limits\":{\"cpu\":\"500m\",\"memory\":\"128Mi\"},\"requests\":{\"cpu\":\"100m\"}},\"volumeMounts\":[{\"name\":\"kfp-launcher\",\"mountPath\":\"/kfp-launcher\"}]}],\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio.kubeflow:9000\",\"bucket\":\"mlpipeline\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"secretkey\"},\"key\":\"artifacts/kubeflow/hello-pipeline-2clrb/2024-08-28/hello-pipeline-2clrb-1334336905\"}},\"podSpecPatch\":\"{\\\"containers\\\":[{\\\"name\\\":\\\"main\\\",\\\"image\\\":\\\"docker-dev-artifactory.workday.com/ml/kubeflow/python-3.7:latest\\\",\\\"command\\\":[\\\"/var/run/argo/argoexec\\\",\\\"emissary\\\",\\\"--\\\",\\\"/kfp-launcher/launch\\\",\\\"--pipeline_name\\\",\\\"hello-pipeline\\\",\\\"--run_id\\\",\\\"5610709d-50b9-4833-8e2d-7e72a19a97ec\\\",\\\"--execution_id\\\",\\\"91\\\",\\\"--executor_input\\\",\\\"{\\\\\\\"inputs\\\\\\\":{},\\\\\\\"outputs\\\\\\\":{\\\\\\\"outputFile\\\\\\\":\\\\\\\"/tmp/kfp_outputs/output_metadata.json\\\\\\\"}}\\\",\\\"--component_spec\\\",\\\"{\\\\\\\"executorLabel\\\\\\\":\\\\\\\"exec-say-hello\\\\\\\"}\\\",\\\"--pod_name\\\",\\\"$(KFP_POD_NAME)\\\",\\\"--pod_uid\\\",\\\"$(KFP_POD_UID)\\\",\\\"--mlmd_server_address\\\",\\\"$(METADATA_GRPC_SERVICE_HOST)\\\",\\\"--mlmd_server_port\\\",\\\"tcp://10.100.242.77:8080\\\",\\\"--\\\"],\\\"args\\\":[\\\"sh\\\",\\\"-c\\\",\\\"\\\\nif ! [ -x \\\\\\\"$(command -v pip)\\\\\\\" ]; then\\\\n    python3 -m ensurepip || python3 -m ensurepip --user || apt-get install python3-pip\\\\nfi\\\\n\\\\nPIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet     --no-warn-script-location 'kfp==2.0.1' \\\\u0026\\\\u0026 \\\\\\\"$0\\\\\\\" \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"sh\\\",\\\"-ec\\\",\\\"program_path=$(mktemp -d)\\\\nprintf \\\\\\\"%s\\\\\\\" \\\\\\\"$0\\\\\\\" \\\\u003e \\\\\\\"$program_path/ephemeral_component.py\\\\\\\"\\\\npython3 -m kfp.components.executor_main                         --component_module_path                         \\\\\\\"$program_path/ephemeral_component.py\\\\\\\"                         \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"\\\\nimport kfp\\\\nfrom kfp import dsl\\\\nfrom kfp.dsl import *\\\\nfrom typing import *\\\\n\\\\ndef say_hello() :\\\\n    import time\\\\n    time.sleep(1900)\\\\n    hello_text = f'Hello, Suansh!'\\\\n    print(hello_text)\\\\n\\\\n\\\",\\\"--executor_input\\\",\\\"{{$}}\\\",\\\"--function_to_execute\\\",\\\"say_hello\\\"],\\\"env\\\":[{\\\"name\\\":\\\"NO_PROXY\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"},{\\\"name\\\":\\\"no_proxy\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"}],\\\"resources\\\":{}}]}\"}" version="&Version{Version:v3.3.10,BuildDate:2022-11-29T18:18:30Z,GitCommit:b19870d737a14b21d86f6267642a63dd14e5acd5,GitTag:v3.3.10,GitTreeState:clean,GoVersion:go1.17.13,Compiler:gc,Platform:linux/amd64,}"
time="2024-08-28T14:19:16.141Z" level=info msg="Starting deadline monitor"
time="2024-08-28T14:19:18.142Z" level=info msg="Main container completed"
time="2024-08-28T14:19:18.142Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-08-28T14:19:18.142Z" level=info msg="Saving logs"
time="2024-08-28T14:19:18.142Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/kubeflow/hello-pipeline-2clrb/2024-08-28/hello-pipeline-2clrb-1334336905/main.log"
time="2024-08-28T14:19:18.142Z" level=info msg="Creating minio client using static credentials" endpoint="minio.kubeflow:9000"
time="2024-08-28T14:19:18.142Z" level=info msg="Saving file to s3" bucket=mlpipeline endpoint="minio.kubeflow:9000" key=artifacts/kubeflow/hello-pipeline-2clrb/2024-08-28/hello-pipeline-2clrb-1334336905/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-08-28T14:19:18.151Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-08-28T14:19:18.151Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-08-28T14:19:18.151Z" level=info msg="No output parameters"
time="2024-08-28T14:19:18.151Z" level=info msg="No output artifacts"
time="2024-08-28T14:19:18.168Z" level=info msg="Create workflowtaskresults 201"
time="2024-08-28T14:19:18.169Z" level=info msg="Killing sidecars []"
time="2024-08-28T14:19:18.169Z" level=info msg="Alloc=6749 TotalAlloc=12722 Sys=24786 NumGC=4 Goroutines=9"

Following are the logs from

@pschoen-itsc
Copy link

@suanshs Do you also have logs of the istio sidecar or do you have no istio deployed?

@mmazurekgda
Copy link

Just tested successfully that setting NO_PROXY to '.kubeflow,.local' seems to work together with http(s)_proxy. It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.

Thanks! This helped me a lot!

@cybernagle
Copy link
Contributor

@suanshs Do you also have logs of the istio sidecar or do you have no istio deployed?

Hi I'm facing the same issue when using istio-proxy sidecar injected. and with NO_PROXY environment setup not able to fix such issue. :(

@cybernagle
Copy link
Contributor

cybernagle commented Nov 14, 2024

Hi Folks,

I was able to resolve the issue. The root cause was that I was using Istio sidecar injection for the workflow pods. However, during the init container stage, the kfp-launcherattempts to connect to the endpoint metadata-grpc-service.kubeflow:8080 before the Istio-proxy is ready.

I found a related issue here: istio/istio#23802. As suggested, adding the following label to the container resolved the issue:

traffic.sidecar.istio.io/excludeOutboundPorts: "8080"

@umka1332
Copy link

Sorry for late response (fortunately I've gathered more knowladge about the topic now).
There are multiple issues with proxy, and it depends on what I was trying to do.
One way is to not adding proxy, but then you need to use custom base_image for components that already includes kfp sdk installed and to explicitly tell components to not install kfp sdk. By default a pure python:3.7 image is used and kfp sdk is installed in runtime.
Other way is to add proxy and add appropriate for your cluster no_proxy, but you also need to additionally include ,.kubeflow and probably ,.kubeflow,local there. Please note that in this case kubeflow is the namespace, where kubeflow and/or ml-pipelines are installed.
Also whenever you set proxy - always set both upper case and lowercase variants of all http_proxy, https_proxy and no_proxy env vars just to be sure.

@Naegionn
Copy link

I have the same issue on kubeflow 1.9 with KFP API 2.2 installed using charmed kubeflow.
It happens "randomly" roughly every 1/20 pods.

@arunbenoyv
Copy link

arunbenoyv commented Jan 22, 2025

Hi I have been, facing the same issue and it's still not quite clear to me why this is happening. Is there any update on when it will be fixed or what is a workaround in the meantime.

I have tried setting no_proxy that hasn't worked

Kfp was installed with kubeflow manifest 1.9.1

@mprahl
Copy link
Contributor

mprahl commented Jan 22, 2025

/assign @dandawg

Copy link

@mprahl: GitHub didn't allow me to assign the following users: dandawg.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @dandawg

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: P1
Development

No branches or pull requests