Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ml-pipeline-metadata-grpc-server and MySQL connection with mysql_native_password? #13

Open
JWebDev opened this issue Nov 11, 2024 · 12 comments

Comments

@JWebDev
Copy link

JWebDev commented Nov 11, 2024

Hi Krzysztof,

I'm trying to install Kubeflow on my Kubernetes cluster (Bare Metal).
I've been looking for helm charts for a long time and realized that they are almost non-existent. Stumbled upon your development and decided to give it a try.
But I have some difficulties, so I decided to ask a question/suggestion here, hopefully, you will take a look and be able to help me or at least answer.

The first thing I really didn't like, but I understand that no one has any influence on this as it is historically. It's a rigid tie-in to Istio.
I can understand the need to use Istio. But this is very limited thinking for the future, as not everyone needs it and not everyone supports Istio.

I am installing step by step, not including all components to start with, only Dashboard and Pipelines so far.

The first problem I encountered is the inability to start “ml-pipeline-metadata-grpc-server” with this error.
I'm ruling out the possibility of lack of communication or rights, since I'm passing the root user rights and the connectin exists.

Most likely the problem is in the “MySQL authentication plugin” as I am using MySQL 8 and the mysql_native_password plugin is disabled there.
Is there any way to customize this parameter in “ml-pipeline-metadata-grpc-server” or is it impossible?

WARNING: Logging before InitGoogleLogging() is written to STDERR
E1111 01:42:47.463356 1 mysql_metadata_source.cc:174] MySQL database was not initialized. Please ensure your MySQL server is running. Also, this error might be caused by starting from MySQL 8.0, mysql_native_password used by MLMD is not supported as a default for authentication plugin. Please follow <https://dev.mysql.com/blog-archive/upgrading-to-mysql-8-0-default-authentication-plugin-considerations/>to fix this issue.
F1111 01:42:47.463472 1 metadata_store_server_main.cc:555] Check failed: absl::OkStatus() == status (OK vs. INTERNAL: mysql_real_connect failed: errno: , error: [mysql-error-info='']) MetadataStore cannot be created with the given connection config.
*** Check failure stack trace: ***

My second question concerns istio-gateway and certificates.
I've already accepted that I can't start Kubeflow without it and will install all the necessary components. But is it possible to somehow configure that I don't need to assign a certificate via certificate-manager?
I don't have a public installation and I don't want to expose Kubeflow to the outside to protect it later. But the way it's done there, you can't start istio-gateway without a certificate and for that you have to assign a domain and get a LetsEncrypt certificate and so on, which is additional and unnecessary work.
I saw your comment that at the moment installation without certificate-manager is not possible.
What is the best way to get around this?

Thanks!

@kromanow94
Copy link
Owner

Hey, about the mysql_native_password, can you check if this helps?

kubeflow/pipelines#9549 (comment)

About the istio and certificates:

  • I too would like to see Kubeflow without Istio one day but that's currently not supported and it's not an issue with this Helm Chart or the examples. Some of the Kubeflow components deploy Istio resources in response to Kubeflow CRDs (I think it's related to models and notebooks). I also raised this topic in the Kubeflow Community to try and decouple istio and kubeflow but there is little incentive to do so.
  • The cert-manager is needed mostly for internal components for the Admission Controller Webhooks. Do you configure the Istio Gateway CRD with Certs? If so, most probably you don't need this.

@JWebDev
Copy link
Author

JWebDev commented Nov 20, 2024

Hi @kromanow94 ,

Thank you very much for the reply.

Regarding mysql_native_password. I decided to separately install MySQL 8.0 with explicit default_authentication_plugin=mysql_native_password because in version 9 I can't even enable this plugin already.

Regarding Istio. I understood you.
As I wrote. No, I don't want to make kubeflow visible externally. It has to be kubeflov locally with the UI being forward to my local compute.

And in addition.
You have one TODO kfam => TODO: check if this is used in values.yaml. I think this is used. Especially when you initialize the central dashboard it make a call like this

Using Profiles service at http://profiles-controller-kfam.kubeflow.svc.cluster.local:8081/kfam
Server listening on port http://localhost:8082 (in production mode)

I continued the installation and got a little further, but got stuck again. If you don't mind, I'll ask questions here so I don't have to create a new ticket. But if you say, I will create a new ticket.

So, the installation went well, I didn't notice any serious errors. Before that I installed all istio crd-s, istiod and istio-gateway.

In istio-system CronJob kubeflow-m2m-oidc-configurator-* is constantly running and it is successful.

The problem now is with the certificates. I'm not really sure if I need the admissionWebhook component? Do I(kubeflow) need it?

Currently the pod can't start because of.
MountVolume.SetUp failed for volume “webhook-cert” : secret “admission-webhook-tls-certs” not found

Because of the same error pvcviewer-* does not start.
MountVolume.SetUp failed for volume “cert” : secret “pvcviewer-controller-tls-certs” not found

And also ml-pipeline-persistenceagent-* something marvelous. I don't understand why it needs argoproj.

W1120 04:08:49.076675 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2024-11-20T04:08:49Z" level=info msg="Setting up event handlers"
time="2024-11-20T04:08:49Z" level=info msg="Setting up event handlers"
time="2024-11-20T04:08:49Z" level=info msg="Setting up event handlers"
time="2024-11-20T04:08:49Z" level=info msg="Starting The persistence agent"
time="2024-11-20T04:08:49Z" level=info msg="Waiting for informer caches to sync"
W1120 04:08:49.094855 1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
E1120 04:08:49.094890 1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
W1120 04:08:49.928145 1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)

I can forward the gui, but I can't see anything there. Only Homepage. All other menu links don't work. And it crashes with the same error. I attach a screenshot.
image

But if I forward ml-pipelines-ui separately, it seems to be working.
image

I install kubeflow with these values.

knativeIntegration:
  enabled: false

pipelines:
  enabled: true
  config:
    db:
      user:
        value: root
      password:
        value: kubeflow
      host:
        value: mysql
      mlmdDatabaseName:
        value: metadb
      pipelineDatabaseName:
        value: mlpipeline
      cacheDatabaseName:
        value: cachedb
    objectStore:
      existingSecretName: mlpipeline-minio-artifact
  cache:
    enabled: false
  ui:
    enabled: true

katib:
  enabled: false
  dbmanager:
    enabled: false
  istioIntegration:
    create: false
    enabled: false

Sorry for the trouble. Hope this helps someone in the future.
Thanks for your help!

@JWebDev
Copy link
Author

JWebDev commented Dec 3, 2024

Hi @kromanow94 . Are you still available for help or should I look for other ways to install. Thanks for the reply.

@kromanow94
Copy link
Owner

Hey @JWebDev , sorry for the late reply. Just yesterday I got back from my holidays :).

Thank you for all of the details. I'm happy you were able to move forward with the issue.

In this scenario, do you install cert-manager at all? Is it healthy? In the past I noticed such errors when there were issues with the cert-manager. It doesn't have to be configured with any provider like Lets Encrypt but it has to exist and be healthy on the cluster because Kubeflow relies on a few self-signed certificates it uses mostly for Admission Webhooks.

The cert-manager is used to provide a few certificates for the Admission Webhooks, and those are using self-signed certificates. For example:

  1. Self-signed issuer admission-webhook-selfsigned-issuer is created on the cluster.
    • This is created by the kubeflow helm chart if the certManagerIntegration.enabled: true.
  2. Certificate admission-webhook-cert is created and it requests certificate using self-signed issuer admission-webhook-selfsigned-issuer. It points to the admission-webhook-tls-certs as the target K8s Secret that should hold the certificate.
    • This is created by the helm chart if the certManagerIntegration.enabled: true.
  3. cert-manager creates secret admission-webhook-tls-certs.
  4. The secret admission-webhook-tls-certs is used in admission-webhook Deployment and in the admission-webhook MutatingWebhookConfiguration through annotation cert-manager.io/inject-ca-from: kubeflow/admission-webhook-cert.

The ml-pipeline-persistenceagent needs Argo WF CRDs because it will actively look for the Workflows and synchronize them with the Kubeflow Database (MySQL) so the resulting Workflows (which are created as a response for the Kubeflow Pipelines Runs) are saved with their status.

Having that in mind, are you using any of the quickstarts scripts available under example/helm/quickstart.*? In your scenario it seems you might use the example/helm/quickstart.helm.local.sh. It installs cert-manager, argo workflows and all other dependencies. It serves as a reference point on how to install Kubeflow and how the dependencies should be configured. Please have a look at this reference as it may help in finding the issues with the target configuration.

About the kfam in the values.yaml, thanks for pointing that out. I agree it needs some cleanup.

Let me know if that helps and if you have any other issues.

@JWebDev
Copy link
Author

JWebDev commented Dec 3, 2024

Hi @kromanow94

Glad to hear from you. Hope you had a good vacation. )))

I looked at the error again after my post and realized that I was missing cert-manager-webhook Mutation WebHook for some reason. After the update of cert-manager the error with certificates disappeared. All services started.
Also what is important and I have not found this information anywhere. Kubeflow requires Argo Workflows CRDs installed. I installed them additionally.

kubectl apply -f https://raw.githubusercontent.com/argoproj/argo-workflows/refs/heads/main/manifests/base/crds/minimal/argoproj.io_clusterworkflowtemplates.yaml
kubectl apply -f https://raw.githubusercontent.com/argoproj/argo-workflows/refs/heads/main/manifests/base/crds/minimal/argoproj.io_cronworkflows.yaml
kubectl apply -f https://raw.githubusercontent.com/argoproj/argo-workflows/refs/heads/main/manifests/base/crds/minimal/argoproj.io_workflows.yaml
kubectl apply -f https://raw.githubusercontent.com/argoproj/argo-workflows/refs/heads/main/manifests/base/crds/minimal/argoproj.io_workflowtemplates.yaml

May be useful to someone. Is this enough or do I need to install Argo Workflows and somehow specify Kubeflow installation method? I would be grateful if you could clarify this point.

Regarding your question. Yes, I have cert-manager installed. And as I described above - it seems to have solved the problem with certificates.

The same problem remains with Sorry, /pipeline/ is not a valid page. I have attached the screenshot above.
After researching the issue, I realize that there is probably one last problem with istio-gateway. I understand that the frontend will only work through the istio-gateway. I will now see how I can get around the issue to make the service with ClusterIP. Because istio-gateway is deployed by default with LoadBalancer. And I don't want Kubeflow to be visible through LoadBalancer.

I just need kubectl forward to localhost.

Maybe you know how I can get around this?

Thanks for the help!

@kromanow94
Copy link
Owner

Thank you, I really enjoyed my holiday :).

About the Argo Workflows, yes, the full installation is required. This is because Argo Workflows is used as a backend and when the Kubeflow Pipeline Run is created, an Argo Workflow Object is created on the cluster.

I agree that this information is not emphasized enough. On the default kubeflow installation that you get from the kustomize in kubeflow/manifests repository just installs everything automatically... But then it's just hard and cumbersome to configure for enterprise.

Some of the information on Argo WF requirement you can find here:

The full installation reference for this Helm Chart with all the dependencies, their values and ordering can be see under the link below.

You issue with accessing the cluster seems familiar. If you were to use any of the quickstart scripts (like the quickstart.helm.local.sh) or at least reference what configuration options are provided for each of the requirements/dependencies, you'd see that in the example/helm/values.istio-ingressgateway.yaml ther is the following config:

service:
  type: ClusterIP

With the above, it should be just enough to execute the following to make kubeflow available through the kubectl port-forward:

$ kubectl -n istio-ingress port-forward services/istio-ingressgateway 8080:80

If you were to use the quickstart, you'd also see that M2M connectivity is also enabled and configured by default:

$ curl localhost:8080/api/workgroup/exists -H "Authorization: Bearer $(kubectl -n kubeflow-user-example-com create token default-editor)"

@JWebDev
Copy link
Author

JWebDev commented Dec 3, 2024

@kromanow94 Thaaaaanks.

Where were you with this script before? )))
I just didn't see it, and now I realize it has everything I need.
I wasted so much time researching to get a Kubeflow installation. Now I see what I'm still missing.

Ok, I think I understand with istio-gateway. I need to do istio-gateway forwarding at the end, not centraldashboard.
Now I will reinstall everything with your script, hopefully it will work.

I looked at the script. Everything is more or less clear, as I have already understood a lot of things during my installation research. It seems to be one of the most complicated installations I've done.

  1. Your example specifies MySQL 9.21.2. Does this installation(version) work? As I wrote above, versions above 8.0 have a problem with default_authentication_plugin=mysql_native_password.

Please add in Readme that there are these scripts. I didn't notice something about them.

Thanks again!

@JWebDev
Copy link
Author

JWebDev commented Dec 4, 2024

  1. Do I need metacontroller for Kubeflow? How it will be used for? Because metacontroller throws error
{"level":"error","ts":"2024-12-04T03:03:16Z","msg":"Reconciler error","controller":"composite-metacontroller","object":{"name":"ml-pipeline-profile-controller"},"namespace":"","name":"ml-pipeline-profile-controller","reconcileID":"5fdc495a-95aa-40e1-9b8f-40fa50442497","error":"CustomResourceDefinition.apiextensions.k8s.io \"namespaces.\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224"}
{"level":"info","ts":"2024-12-04T03:03:16Z","logger":"composite","msg":"Sync CompositeController","name":"ml-pipeline-profile-controller"}
  1. Why do I need cluster-local-gateway? I do not installed knative.

Finally I went step by step through the whole script. Well structured. Thanks.
I ran everything with no problems and no errors only (metacontroller). The one I described above.

Got the istio-ingressgateway forwarding. The dex authorization page appeared. Authorized with my user.

The kubeflow dashboard was displayed. But unfortunately some problems with the certificate. I don't know what to look for and where to look, as the logs showing no error in kubeflow or istio namespaces and everything is correct.

upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: TLS_error:|268435703:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER:TLS_error_end:TLS_error_end

Almost all pages do not load with this error. I'm attaching a screenshot.

I have changed the username everywhere to 'kubeflow-user'.

Also 'Endpoints' shows Sorry, /models/ is not a valid page.

The pages that display correctly are. 'Home', 'Experiments' and 'Manage Contributors'. On 'Experiments' I can select all steps and so on. I mean the page is fully working.

image

Any idea how I can fix these errors?
Thank you very much for your help.

@kromanow94
Copy link
Owner

I agree, the information about quickstart scripts could be made more clear. I think it makes sense to include that as part of the #14.

Yes, this installation works with the MySQL used in the quickstarts. It was tested by a numer of people, number of times, without issues. FYI, this is the mysql image in bitnami/mysql Helm Chart:

$ helm template mysql mysql \
  --namespace Kubeflow \
  --repo https://charts.bitnami.com/bitnami \
  --version 9.21.2 \
  --values example/helm/values.mysql.yaml \
  | grep image:
          image: docker.io/bitnami/mysql:8.0.36-debian-12-r8

Metacontroller is used for ml-pipeline-profile-controller (a.k.a. kubeflow-pipelines-profile-controller). If you want to run Kubeflow Pipelines in multi-tenant mode, it is required.

Your error with metacontroller seems to be caused by the extra dot in the "namespaces." string, as suggested by the error message:

"error":"CustomResourceDefinition.apiextensions.k8s.io \"namespaces.\" not found"

Please double check if you did any change accidentally in the CompositeController for ml-pipeline-profile-controller. This is definitely not something that shows up when running the quickstart scripts.

For the cluster-local-gateway, if you don't use knative, it's not needed.

As for the issue with displaying the Central Dashboard and the pages for Pipelines and Models, I don't know where the issue is. What I can suggest is to start with a fresh install using clean git-tree for v0.4.0 and then gradually make changes to see what broke. Because I don't think the issue is related to the quickstarts and the reference configuration provided there.

@JWebDev
Copy link
Author

JWebDev commented Dec 4, 2024

Thanks for the quick reply.

Oh, for some reason I thought the version “9.21.2” was the version of image. Now I see what version it is in mysql.

Metacontroller - understood

CompositeController - no, I didn't make any changes there at all.

I installed selectively and with my values. For example, I do not use MinIO because I have Ceph. And I copied all the values and changed them to my own values. So just checkout will not help me unfortunately.

But I can't check after each installation what exactly is broken, because to start everything I need to install almost all components.

Ok, I will try to find and understand the error. Thanks.

If you have any more ideas, please write me.

@kromanow94
Copy link
Owner

If something comes up to me, I'll let you know. Also, feel free to add more findings and please share the final resolution :).

@JWebDev
Copy link
Author

JWebDev commented Dec 4, 2024

Hi. @kromanow94

I found a mistake! )))

Here's how to fix it. kubectl label namespace kubeflow istio-injection=enabled )))). I already had namespace, I meant to run this command, but forgot, and just remembered. Restarted the pods. And everything works.

3 more small things.

  1. istiod doesn't take global proxies for some reason, I had to correct it manually in istio-sidecar-injector ConfigMap.
global:
  proxy:
    resources: {}
  1. The Endpoints menu and will not work because the KServe component(Addon) is not installed. Am I understanding this correctly?
  2. When calling each page GET http://localhost:8080/api/metrics 405 (Method Not Allowed) VM2819 vendor.bundle.js:743. *(Can be seen on screenshot) Any ideas? The log doesn't show any errors.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants