Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Unable to reassign a large number agents to a new policy #134318

Closed
ablnk opened this issue Jun 8, 2022 · 7 comments · Fixed by #134673
Closed

[Fleet] Unable to reassign a large number agents to a new policy #134318

ablnk opened this issue Jun 8, 2022 · 7 comments · Fixed by #134673
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Project:FleetScaling QA:Validated Issue has been validated by QA Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@ablnk
Copy link

ablnk commented Jun 8, 2022

This defect found during Fleet scale testing. Please note that this defect may be caused by Horde, behavior may differ with real agents. During Fleet Scale sync we agreed to file issues even if it's not clearly a bug.

I reproduced it with 10k agents.

Environment:
ESS cluster was built with oblt-cli.
Horde was used to deploy agents.

Kibana version:
8.3.0 SNAPSHOT

Precondition to reproduce:
Have 10k agents in Fleet under the same policy.

Steps to reproduce:

  1. Click on the select all checkbox. then "Select everything on all pages".
  2. Select "Assign to a new policy" in actions.
  3. Select a new policy from drop-down list.
  4. Click on "Assign policy" button and check the result.

Actual result:
After ~2 minutes of hanging bulk_reassign request, the following message appear on UI: "Unable to reassign agent policy. Backend closed connection"
Unable to reassign policy
In response to bulk_reassign request, 502 error is returned:
{ok: false, message: "backend closed connection"} message: "backend closed connection" ok: false

Here are a few logs that I managed to find:
{ "_index": ".ds-elastic-cloud-logs-8-2022.05.28-000004", "_id": "QEYCRIEBFt3ZXvKtOm26", "_version": 1, "_score": 1, "_source": { "@timestamp": "2022-06-08T15:50:02.248Z", "log.level": "error", "ecs.version": "1.6.0", "service.name": "fleet-server", "message": "http: TLS handshake error from 10.47.192.54:47390: EOF\n", "log": { "file": { "path": "/app/elastic-agent/data/logs/es-containerhost/fleet-server-20220608-20.ndjson" }, "offset": 7539620 }, "service": { "node": { "name": "instance-0000000000" }, "version": "8.3.0-SNAPSHOT", "type": "agent", "name": "perf-ocuka-custom", "id": "f9e9545143faecb03be3530807b8a0b6" }, "agent": { "type": "filebeat", "version": "8.3.0", "ephemeral_id": "cd40f34b-d445-4e22-9371-5ca7458a9d98", "id": "e2c0815c-7b65-4bb7-82f9-47732f46bf48", "name": "f6b2fffd32a5" }, "host": { "name": "f6b2fffd32a5" }, "input": { "type": "log" }, "event": { "dataset": "agent.log" }, "cloud": { "availability_zone": "us-west2-b" }, "ecs": { "version": "8.0.0" } }, "fields": { "service.name": [ "perf-ocuka-custom", "fleet-server" ], "service.id": [ "f9e9545143faecb03be3530807b8a0b6" ], "service.node.name": [ "instance-0000000000" ], "input.type": [ "log" ], "log.offset": [ 7539620 ], "message": [ "http: TLS handshake error from 10.47.192.54:47390: EOF\n" ], "cloud.availability_zone": [ "us-west2-b" ], "service.type": [ "agent" ], "agent.type": [ "filebeat" ], "@timestamp": [ "2022-06-08T15:50:02.248Z" ], "service.version": [ "8.3.0-SNAPSHOT" ], "agent.id": [ "e2c0815c-7b65-4bb7-82f9-47732f46bf48" ], "ecs.version": [ "8.0.0", "1.6.0" ], "log.file.path": [ "/app/elastic-agent/data/logs/es-containerhost/fleet-server-20220608-20.ndjson" ], "log.level": [ "error" ], "agent.ephemeral_id": [ "cd40f34b-d445-4e22-9371-5ca7458a9d98" ], "agent.name": [ "f6b2fffd32a5" ], "agent.version": [ "8.3.0" ], "host.name": [ "f6b2fffd32a5" ], "event.dataset": [ "agent.log" ] } }

{ "_index": ".ds-elastic-cloud-logs-8-2022.05.28-000004", "_id": "ckUARIEBFt3ZXvKt51qf", "_version": 1, "_score": 1, "_source": { "agent": { "name": "112994cdba43", "id": "e29a2670-beca-4701-a449-49336d6229e6", "type": "filebeat", "ephemeral_id": "cf9770bd-bb40-49c2-a175-115e717be344", "version": "8.3.0" }, "log": { "file": { "path": "/app/logs/kibana-json.log" }, "offset": 89315 }, "fileset": { "name": "log" }, "message": "Cancelling task endpoint:user-artifact-packager \"endpoint:user-artifact-packager:1.0.0\" as it expired at 2022-06-08T15:48:04.140Z after running for 01m 31s (with timeout set at 1m).", "error": { "message": "field [kibana.log.meta.pid] doesn't exist" }, "cloud": { "availability_zone": "us-west2-b" }, "input": { "type": "log" }, "@timestamp": "2022-06-08T15:48:35.440Z", "ecs": { "version": "1.12.0" }, "service": { "node": { "name": "instance-0000000001" }, "name": "perf-ocuka-custom", "id": "f9e9545143faecb03be3530807b8a0b6", "type": "kibana", "version": "8.3.0-SNAPSHOT" }, "host": { "name": "112994cdba43" }, "event": { "ingested": "2022-06-08T15:48:38.942568566Z", "created": "2022-06-08T15:48:37.863Z", "module": "kibana", "dataset": "kibana.log" }, "kibana": { "log": { "meta": { "process": { "pid": 20 }, "trace": { "id": "c585a21988e91a00659f09da1d091565" }, "ecs": { "version": "8.0.0" }, "log": { "level": "WARN", "logger": "plugins.taskManager" }, "transaction": { "id": "1894b2fe51908601" } } } } }, "fields": { "service.id": [ "f9e9545143faecb03be3530807b8a0b6" ], "kibana.log.meta.process.pid": [ 20 ], "kibana.log.meta.transaction.id": [ "1894b2fe51908601" ], "service.node.name": [ "instance-0000000001" ], "kibana.log.meta.log.level": [ "WARN" ], "cloud.availability_zone": [ "us-west2-b" ], "service.type": [ "kibana" ], "agent.type": [ "filebeat" ], "event.module": [ "kibana" ], "agent.name": [ "112994cdba43" ], "host.name": [ "112994cdba43" ], "kibana.log.meta.log.logger": [ "plugins.taskManager" ], "service.name": [ "perf-ocuka-custom" ], "fileset.name": [ "log" ], "kibana.log.meta.ecs.version": [ "8.0.0" ], "input.type": [ "log" ], "log.offset": [ 89315 ], "message": [ "Cancelling task endpoint:user-artifact-packager \"endpoint:user-artifact-packager:1.0.0\" as it expired at 2022-06-08T15:48:04.140Z after running for 01m 31s (with timeout set at 1m)." ], "kibana.log.meta.trace.id": [ "c585a21988e91a00659f09da1d091565" ], "event.ingested": [ "2022-06-08T15:48:38.942Z" ], "@timestamp": [ "2022-06-08T15:48:35.440Z" ], "service.version": [ "8.3.0-SNAPSHOT" ], "agent.id": [ "e29a2670-beca-4701-a449-49336d6229e6" ], "ecs.version": [ "1.12.0" ], "error.message": [ "field [kibana.log.meta.pid] doesn't exist" ], "event.created": [ "2022-06-08T15:48:37.863Z" ], "log.file.path": [ "/app/logs/kibana-json.log" ], "agent.ephemeral_id": [ "cf9770bd-bb40-49c2-a175-115e717be344" ], "agent.version": [ "8.3.0" ], "event.dataset": [ "kibana.log" ] } }

Expected behavior:
Selected agents have been reassigned to a new policy. If it's not technically possible to reassign that number of agents, this option must not be available for a user on GUI.

@ablnk ablnk added the bug Fixes for quality problems that affect the customer experience label Jun 8, 2022
@nimarezainia nimarezainia added the Team:Fleet Team label for Observability Data Collection Fleet team label Jun 8, 2022
@joshdover joshdover transferred this issue from elastic/fleet-server Jun 14, 2022
@joshdover
Copy link
Contributor

@juliaElastic could this be related to #133388. The backend closed connection error is interesting, seems to indicate some timeout or max request size is being reached in the proxy layer in Cloud. Can we test this as part of your scaling efforts in 8.4?

You should be able to use https://github.com/elastic/horde to emulate 10k agents, though you may need to do this on a VM to reach that size. @ablnk we may want to work with you on reproducing this.

@jen-huang jen-huang changed the title Unable to reassign a large number agents to a new policy [Fleet] Unable to reassign a large number agents to a new policy Jun 14, 2022
@juliaElastic
Copy link
Contributor

I checked what is going on in reassign, and I think the reason for the timeout is that the bulk reassign process is not too efficient.
There is a logic that queries all (max 10k) agents, and goes through each one by one to check if the agent can be assigned (old or new policy not managed). I think this is the biggest reason of the slowness.
https://github.com/elastic/kibana/blob/main/x-pack/plugins/fleet/server/services/agents/reassign.ts#L116

export async function reassignAgentIsAllowed(

Further improvements can be made:

  • only agent.id and agent.policy_id is used from agent objects, it would be more efficient to query only these fields
  • if the process is still slow/take too much memory for 10k agents, the action can be done in batches of e.g. 1000

@juliaElastic
Copy link
Contributor

Also noticed that force flag is not used at all in reassign, whereas in unenroll and upgrade, force flag is applied to execute action on managed agents as well. I think the force becaviour should be consistent across all actions.
https://github.com/elastic/kibana/blob/main/x-pack/plugins/fleet/server/services/agents/reassign.ts#L80

@joshdover
Copy link
Contributor

@juliaElastic should we allow reassign on managed agents at all? (I'm assuming "managed" here means agents assigned to a managed policy, like Elastic Cloud agent)

@juliaElastic
Copy link
Contributor

@juliaElastic should we allow reassign on managed agents at all? (I'm assuming "managed" here means agents assigned to a managed policy, like Elastic Cloud agent)

It is not allowed currently, I was asking for force flag, as it is implemented for other actions e.g. to unenroll from managed policy.

@jen-huang jen-huang added the QA:Ready for Testing Code is merged and ready for QA to validate label Jun 29, 2022
@amolnater-qasource
Copy link

Hi @juliaElastic @jen-huang
We have revalidated this feature on latest 8.4 Snapshot and found it working fine.

  • We are successfully able to reassign more than 10k agent to any other policy.

Build details:
VERSION: 8.4.0 Snapshot
BUILD: 54194
COMMIT: f94d5ff

Screenshots:
3

Please let us know if we are missing anything here.
Thanks

@jen-huang jen-huang added QA:Validated Issue has been validated by QA and removed QA:Ready for Testing Code is merged and ready for QA to validate labels Jul 11, 2022
@amolnater-qasource
Copy link

Hi Team
We have revalidated this feature on latest 8.4 BC1 Kibana cloud environment and found it working fine.

Observations:

  • We are successfully able to reassign more than 10k agent to any other policy.

Build details:
BUILD: 54999
COMMIT: 58f7eaf

Screen Recording:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2022-08-01.16-09-26.mp4

Hence, we are marking it as QA: Validated.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Project:FleetScaling QA:Validated Issue has been validated by QA Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants