Fleet Server unhealthy on pr cloud deployment #575

juliaElastic · 2022-06-17T14:47:08Z

Created cloud deployment from this pr: elastic/kibana#134565
Cloud link: https://kibana-pr-134565.kb.us-west2.gcp.elastic-cloud.com:9243/app/fleet/agents
password "8IViQT7Ol1Ki5J1ABFyda1hN"
username "elastic"

Fleet Server shows up as unhealthy.
Agent/Kibana version: 8.4.0 (main)

I checked another kibana pr where Fleet Server was healthy.

Cloud Admin URL:
https://admin.found.no/deployments/efa47ae977e6d61437a23065eec13880

Agent Logs:

| https://kibana-ops-buildkite-monitoring.kb.us-central1.gcp.cloud.es.io:9243/app/logs/link-to/host-logs/kb-n2-2-047bf3bf51b94a40?time=1655466891264

Seeing this in logs, might be related:

Jun 17, 2022 @ 14:01:11.000	time="2022-06-17T12:01:11.550880609Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {localhost  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing only one connection allowed\". Reconnecting..." module=grpc

Integration Server Logs:
https://logging.us-west2.gcp.elastic-cloud.com/app/r/s/rancid-raspy-iron

The text was updated successfully, but these errors were encountered:

ph · 2022-06-17T14:48:34Z

@michalpristas if you can investigate how serious this is and we can prioritize it.

juliaElastic · 2022-06-21T07:02:17Z

I noticed something strange in the package policies.
There is a managed policy Elastic Cloud agent policy which uses Elastic cloud internal output.
But when I query package policies from index agent, I see the apm and fleet server package policies use fleet-default-output, which is not the same as defined in agent policy.
Maybe there is a bug in preconfig code in fleet?

.kibana/_search?q=type:ingest-package-policies

     "_id": "ingest-package-policies:elastic-cloud-fleet-server",
        "_score": 5.1603312,
        "_source": {
          "ingest-package-policies": {
            "name": "Fleet Server",
            "namespace": "default",
            "package": {
              "name": "fleet_server",
              "title": "Fleet Server",
              "version": "1.2.0"
            },
            "enabled": true,
            "policy_id": "policy-elastic-agent-on-cloud",
            "output_id": "fleet-default-output",
          }
}

.kibana/_search?q=type:ingest-outputs

          "ingest-outputs": {
            "name": "Elastic Cloud internal output",
            "type": "elasticsearch",
            "hosts": [
              "http://89a580b8af164e9d94dee28aed08d8b5.containerhost:9244"
            ],
            "is_default": false,
            "is_default_monitoring": false,
            "is_preconfigured": true,
            "output_id": "es-containerhost"
          },
          "type": "ingest-outputs",
          "references": [],
          "migrationVersion": {
            "ingest-outputs": "8.0.0"
          },
          "coreMigrationVersion": "8.4.0",
          "updated_at": "2022-06-17T18:16:11.538Z"
        }
      },
      {
        "_index": ".kibana_8.4.0_001",
        "_id": "ingest-outputs:a09a5397-7b9a-5a73-a622-e29f4c635658",
        "_score": 5.7339883,
        "_source": {
          "ingest-outputs": {
            "name": "default",
            "is_default": true,
            "is_default_monitoring": true,
            "type": "elasticsearch",
            "hosts": [
              "https://89a580b8af164e9d94dee28aed08d8b5.us-west2.gcp.elastic-cloud.com:443"
            ],
            "output_id": "fleet-default-output"

I could reproduce this locally:

add a preconfigured agent policy that used a non-default output (with fleet server package policy)
change the default output's host to something inaccessible
trigger Fleet setup by navigating to Fleet UI
check the contents of package-policies saved objects - I see the output refers the default output instead of the one set in agent policy
and see the fleet server enroll command indeed contains the es url for the default output, which is wrong

EDIT: I checked in a 8.2.3 cloud staging instance, and the same setup is there for package policies and outputs like in 8.4. Fleet Server comes up as healthy there.
Maybe something changed on agent/fleet server side on where the output host is taken from?

juliaElastic · 2022-06-21T09:25:36Z

I think this state of outputs is not the real problem, on other pr cloud deployments the same setup works fine with healthy Fleet Server: elastic/kibana#131322

Only my two prs have this issue:
elastic/kibana#134673
elastic/kibana#134565

michalpristas · 2022-06-21T10:56:15Z

could you extract logs for me?
i still don't have access to read logs in cloud for some reason, already raised with IT

juliaElastic · 2022-06-22T08:11:03Z

could you extract logs for me? i still don't have access to read logs in cloud for some reason, already raised with IT

@michalpristas
Here are the Agent logs from last 1 day.
elastic_agent_and_apm_logs_134565.csv

I think one reason why this instance stopped is being out of memory, as the cloud-ci Integration Server has 512 MB RAM, and I tried to enroll 10k agents.

However I tried to start today a new ess cluster with oblt-cli and the fleet server does not start up at all on 8.4.0-SNAPSHOT.
Agent logs of this instance:
elastic_agent_logs_ess-sarxy-custom.csv

This might be the same issue: elastic/fleet-server#1574

jlind23 · 2024-09-17T09:34:43Z

[Clean up] This hasn't been looked up nor haven't any traction for the past two years hence closing. I am happy to reopen if need be.
cc @ycombinator

juliaElastic added the bug Something isn't working label Jun 17, 2022

ph added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 17, 2022

ph assigned michalpristas Jun 17, 2022

juliaElastic mentioned this issue Jun 21, 2022

[Fleet] Package policies not using the output from agent policy elastic/kibana#134821

Open

jlind23 closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fleet Server unhealthy on pr cloud deployment #575

Fleet Server unhealthy on pr cloud deployment #575

juliaElastic commented Jun 17, 2022 •

edited

Loading

ph commented Jun 17, 2022

juliaElastic commented Jun 21, 2022 •

edited

Loading

juliaElastic commented Jun 21, 2022

michalpristas commented Jun 21, 2022

juliaElastic commented Jun 22, 2022 •

edited

Loading

jlind23 commented Sep 17, 2024

Fleet Server unhealthy on pr cloud deployment #575

Fleet Server unhealthy on pr cloud deployment #575

Comments

juliaElastic commented Jun 17, 2022 • edited Loading

Agent Logs:

ph commented Jun 17, 2022

juliaElastic commented Jun 21, 2022 • edited Loading

juliaElastic commented Jun 21, 2022

michalpristas commented Jun 21, 2022

juliaElastic commented Jun 22, 2022 • edited Loading

jlind23 commented Sep 17, 2024

juliaElastic commented Jun 17, 2022 •

edited

Loading

juliaElastic commented Jun 21, 2022 •

edited

Loading

juliaElastic commented Jun 22, 2022 •

edited

Loading