Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleet Server unhealthy on pr cloud deployment #575

Closed
juliaElastic opened this issue Jun 17, 2022 · 6 comments
Closed

Fleet Server unhealthy on pr cloud deployment #575

juliaElastic opened this issue Jun 17, 2022 · 6 comments
Assignees
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@juliaElastic
Copy link
Contributor

juliaElastic commented Jun 17, 2022

Created cloud deployment from this pr: elastic/kibana#134565
Cloud link: https://kibana-pr-134565.kb.us-west2.gcp.elastic-cloud.com:9243/app/fleet/agents
password "8IViQT7Ol1Ki5J1ABFyda1hN"
username "elastic"

Fleet Server shows up as unhealthy.
Agent/Kibana version: 8.4.0 (main)

I checked another kibana pr where Fleet Server was healthy.

Cloud Admin URL:
https://admin.found.no/deployments/efa47ae977e6d61437a23065eec13880

Agent Logs:

  | https://kibana-ops-buildkite-monitoring.kb.us-central1.gcp.cloud.es.io:9243/app/logs/link-to/host-logs/kb-n2-2-047bf3bf51b94a40?time=1655466891264

Seeing this in logs, might be related:

Jun 17, 2022 @ 14:01:11.000	time="2022-06-17T12:01:11.550880609Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {localhost  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing only one connection allowed\". Reconnecting..." module=grpc

Integration Server Logs:
https://logging.us-west2.gcp.elastic-cloud.com/app/r/s/rancid-raspy-iron

@juliaElastic juliaElastic added the bug Something isn't working label Jun 17, 2022
@ph ph added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 17, 2022
@ph
Copy link
Contributor

ph commented Jun 17, 2022

@michalpristas if you can investigate how serious this is and we can prioritize it.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Jun 21, 2022

I noticed something strange in the package policies.
There is a managed policy Elastic Cloud agent policy which uses Elastic cloud internal output.
But when I query package policies from index agent, I see the apm and fleet server package policies use fleet-default-output, which is not the same as defined in agent policy.
Maybe there is a bug in preconfig code in fleet?

.kibana/_search?q=type:ingest-package-policies

     "_id": "ingest-package-policies:elastic-cloud-fleet-server",
        "_score": 5.1603312,
        "_source": {
          "ingest-package-policies": {
            "name": "Fleet Server",
            "namespace": "default",
            "package": {
              "name": "fleet_server",
              "title": "Fleet Server",
              "version": "1.2.0"
            },
            "enabled": true,
            "policy_id": "policy-elastic-agent-on-cloud",
            "output_id": "fleet-default-output",
          }
}

.kibana/_search?q=type:ingest-outputs

          "ingest-outputs": {
            "name": "Elastic Cloud internal output",
            "type": "elasticsearch",
            "hosts": [
              "http://89a580b8af164e9d94dee28aed08d8b5.containerhost:9244"
            ],
            "is_default": false,
            "is_default_monitoring": false,
            "is_preconfigured": true,
            "output_id": "es-containerhost"
          },
          "type": "ingest-outputs",
          "references": [],
          "migrationVersion": {
            "ingest-outputs": "8.0.0"
          },
          "coreMigrationVersion": "8.4.0",
          "updated_at": "2022-06-17T18:16:11.538Z"
        }
      },
      {
        "_index": ".kibana_8.4.0_001",
        "_id": "ingest-outputs:a09a5397-7b9a-5a73-a622-e29f4c635658",
        "_score": 5.7339883,
        "_source": {
          "ingest-outputs": {
            "name": "default",
            "is_default": true,
            "is_default_monitoring": true,
            "type": "elasticsearch",
            "hosts": [
              "https://89a580b8af164e9d94dee28aed08d8b5.us-west2.gcp.elastic-cloud.com:443"
            ],
            "output_id": "fleet-default-output"

I could reproduce this locally:

  • add a preconfigured agent policy that used a non-default output (with fleet server package policy)
  • change the default output's host to something inaccessible
  • trigger Fleet setup by navigating to Fleet UI
  • check the contents of package-policies saved objects - I see the output refers the default output instead of the one set in agent policy
  • and see the fleet server enroll command indeed contains the es url for the default output, which is wrong

image

EDIT: I checked in a 8.2.3 cloud staging instance, and the same setup is there for package policies and outputs like in 8.4. Fleet Server comes up as healthy there.
Maybe something changed on agent/fleet server side on where the output host is taken from?

@juliaElastic
Copy link
Contributor Author

I think this state of outputs is not the real problem, on other pr cloud deployments the same setup works fine with healthy Fleet Server: elastic/kibana#131322

Only my two prs have this issue:
elastic/kibana#134673
elastic/kibana#134565

@michalpristas
Copy link
Contributor

could you extract logs for me?
i still don't have access to read logs in cloud for some reason, already raised with IT

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Jun 22, 2022

could you extract logs for me? i still don't have access to read logs in cloud for some reason, already raised with IT

@michalpristas
Here are the Agent logs from last 1 day.
elastic_agent_and_apm_logs_134565.csv

I think one reason why this instance stopped is being out of memory, as the cloud-ci Integration Server has 512 MB RAM, and I tried to enroll 10k agents.

However I tried to start today a new ess cluster with oblt-cli and the fleet server does not start up at all on 8.4.0-SNAPSHOT.
Agent logs of this instance:
elastic_agent_logs_ess-sarxy-custom.csv

This might be the same issue: elastic/fleet-server#1574

@jlind23
Copy link
Contributor

jlind23 commented Sep 17, 2024

[Clean up] This hasn't been looked up nor haven't any traction for the past two years hence closing. I am happy to reopen if need be.
cc @ycombinator

@jlind23 jlind23 closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

4 participants