From 9622403695466a038786fe6400e26ee3f4bd9b7e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Edu=20Gonz=C3=A1lez=20de=20la=20Herr=C3=A1n?= <25320357+eedugon@users.noreply.github.com> Date: Fri, 25 Oct 2024 18:14:50 +0200 Subject: [PATCH] elastic agent on k8s troubleshooting added --- .../troubleshooting/troubleshooting.asciidoc | 125 +++++++++++++++++- 1 file changed, 123 insertions(+), 2 deletions(-) diff --git a/docs/en/ingest-management/troubleshooting/troubleshooting.asciidoc b/docs/en/ingest-management/troubleshooting/troubleshooting.asciidoc index 593ecddad..046ea0b82 100644 --- a/docs/en/ingest-management/troubleshooting/troubleshooting.asciidoc +++ b/docs/en/ingest-management/troubleshooting/troubleshooting.asciidoc @@ -833,6 +833,127 @@ To resolve this, either install {agent} without the `--unprivileged` flag so tha [discrete] [[agent-kubernetes-kustomize]] -== Problems installing Elastic Agent on Kubernetes through `kustomize` +== Troubleshoot {agent} Installation on Kubernetes, with Kustomize -TBD :) +Potential issues during {agent} installation on Kubernetes can be categorized into two main areas: + +. <>. +. <>. + +[discrete] +[[agent-kustomize-manifest]] +=== Problems related to the creation of objects within the manifest + +When troubleshooting installations performed with https://github.com/kubernetes-sigs/kustomize[Kustomize], it's always a good practice to inspect the output of the rendered manifest. To do this, take the installation command provided by Kibana Onboarding and replace the final part, `| kubectl apply -f-`, with a redirection to a local file. This allows for easier analysis of the rendered output. + +For example, the following command, originally provided by Kibana for an {agent} Standalone installation, has been modified to redirect the output for troubleshooting purposes: + +[source,sh] +---- +kubectl kustomize https://github.com/elastic/elastic-agent/deploy/kubernetes/elastic-agent-kustomize/default/elastic-agent-standalone\?ref\=v8.15.3 | sed -e 's/JUFQSV9LRVkl/ZDAyNnZaSUJ3eWIwSUlCT0duRGs6Q1JfYmJoVFRUQktoN2dXTkd0FNMtdw==/g' -e "s/%ES_HOST%/https:\/\/7a912e8674a34086eacd0e3d615e6048.us-west2.gcp.elastic-cloud.com:443/g" -e "s/%ONBOARDING_ID%/db687358-2c1f-4ec9-86e0-8f1baa4912ed/g" -e "s/\(docker.elastic.co\/beats\/elastic-agent:\).*$/\18.15.3/g" -e "/{CA_TRUSTED}/c\ " > elastic_agent_installation_complete_manifest.yaml +---- + +The previous generates a local file named `elastic_agent_installation_complete_manifest.yaml`, which you can use for further analysis. It contains the complete set of resources required for the {agent} installation, including: + +* RBAC objects (`ServiceAccounts`, `Roles`, etc. ) + +* `ConfigMaps` and `Secrets` for {agent} configuration. + +* {agent} Standalone deployed as a `DaemonSet` + +* https://github.com/kubernetes/kube-state-metrics[Kube-state-metrics] deployed as a `Deployment`. + +The content of this file is equivalent to the one obtained when following the "<>" document, with the exception of `kube-state-metrics`, which is not included in the other method. + +Possible issues: + +* If your user don't have *cluster-admin* privileges, the RBAC resources creation might fail. + +* Some Kubernetes security mechanisms (like https://kubernetes.io/docs/concepts/security/pod-security-standards/[Pod Security Standards]) could make part of the manifest to be rejected, as `hostNetwork` access and `hostPath` volumes are required. + +* If you already have an installation of `kube-state-metrics`, it could make part of the manifest installation to fail or to update your existing resources wihtout notice. + +[discrete] +[[agent-kustomize-after]] +=== Failures occurring within specific components after installation + +If the installation is correct, all resources are deployed, but data is not flowing as expected (for example you don't see anything on *[Metrics Kubernetes] Cluster Overview* dashboard), check the following items: + +. Check resources status and ensure they are all `Running` ++ +[source,sh] +---- +kubectl get pods -n kube-system | grep elastic +kubectl get pods -n kube-system | grep kube-state-metrics +---- + +. Describe the Pods in case they are in `Pending` state: ++ +[source,sh] +---- +kubectl describe -n kube-system +---- + +. Check logs of elastic-agents and kube-state-metrics, and look for errors: ++ +[source,sh] +---- +kubectl logs -n kube-system +kubectl logs -n kube-system | grep -i error +---- ++ +[source,sh] +---- +kubectl logs -n kube-system +---- + +Possible issues: + +* Connectivity, authorization, or authentication issues when connecting to Elasticsearch: ++ +Ensure the API Key and Elasticsearch destination endpoint used during the installation is correct and is reachable from within the Pods. ++ +In an already installed system, the API Key is stored in a `Secret` named `elastic-agent-creds-`, and the endpoint is configured in the `ConfigMap` `elastic-agent-configs-`. + +* Only missing cluster-level metrics (provided by `kube-state-metrics`): ++ +These metrics (`state_*`) are retrieved by one of the Pods acting as `leader` (as described in <>), so in order to troubleshoot that situation: ++ +. Check which Pod owns the leadership `lease` in the cluster, with: ++ +[source,sh] +---- +kubectl get lease -n kube-system elastic-agent-cluster-leader +---- ++ +. Check the logs of that Pod to see if there are errors when connecting to `kube-state-metrics` and if the `state_*` metrics are being sent. ++ +One way to check if `state_*` metrics are being delivered to Elasticsearch is to check the log lines with the `"Non-zero metrics in the last 30s"` message and check the values of the "state_*" metricsets within the line, with something like: ++ +[source,sh] +---- +kubectl logs -n kube-system elastic-agent-xxxx | grep "Non-zero metrics" | grep "state_" +---- ++ +If the previous return something like `"state_pod":{"events":213,"success":213}` for all `state_*` metricsets, it means the metrics are being delivered. ++ +. As a last resort, if you believe none of the Pods is acting as a leader, you can try deleting the `lease`, thereby generating a new one: ++ +[source,sh] +---- +kubectl delete lease -n kube-system elastic-agent-cluster-leader +# wait a few seconds and check for the lease again +kubectl get lease -n kube-system elastic-agent-cluster-leader +---- + +* Performance problems ++ +Monitor the CPU and Memory usage of the agents Pods and adjust the manifest requests and limits when needed. Refer to <> documentation for extra details about the needed resources. + +Extra resources for {agent} on Kubernetes troubleshooting and information: + +* <>. + +* https://github.com/elastic/elastic-agent/tree/main/deploy/kubernetes/elastic-agent-kustomize/default[{agent} Kustomize Templates] documentation and resources. + +* Other examples and manifests to deploy https://github.com/elastic/elastic-agent/tree/main/deploy/kubernetes[{agent} on Kubernetes].