-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Unable to reassign a large number agents to a new policy #134318
Comments
@juliaElastic could this be related to #133388. The You should be able to use https://github.com/elastic/horde to emulate 10k agents, though you may need to do this on a VM to reach that size. @ablnk we may want to work with you on reproducing this. |
I checked what is going on in reassign, and I think the reason for the timeout is that the bulk reassign process is not too efficient.
Further improvements can be made:
|
Also noticed that |
@juliaElastic should we allow reassign on managed agents at all? (I'm assuming "managed" here means agents assigned to a managed policy, like Elastic Cloud agent) |
It is not allowed currently, I was asking for force flag, as it is implemented for other actions e.g. to unenroll from managed policy. |
Hi @juliaElastic @jen-huang
Build details: Please let us know if we are missing anything here. |
Hi Team Observations:
Build details: Screen Recording: Agents.-.Fleet.-.Elastic.-.Google.Chrome.2022-08-01.16-09-26.mp4Hence, we are marking it as QA: Validated. |
This defect found during Fleet scale testing. Please note that this defect may be caused by Horde, behavior may differ with real agents. During Fleet Scale sync we agreed to file issues even if it's not clearly a bug.
I reproduced it with 10k agents.
Environment:
ESS cluster was built with oblt-cli.
Horde was used to deploy agents.
Kibana version:
8.3.0 SNAPSHOT
Precondition to reproduce:
Have 10k agents in Fleet under the same policy.
Steps to reproduce:
Actual result:
After ~2 minutes of hanging bulk_reassign request, the following message appear on UI: "Unable to reassign agent policy. Backend closed connection"
In response to bulk_reassign request, 502 error is returned:
{ok: false, message: "backend closed connection"} message: "backend closed connection" ok: false
Here are a few logs that I managed to find:
{ "_index": ".ds-elastic-cloud-logs-8-2022.05.28-000004", "_id": "QEYCRIEBFt3ZXvKtOm26", "_version": 1, "_score": 1, "_source": { "@timestamp": "2022-06-08T15:50:02.248Z", "log.level": "error", "ecs.version": "1.6.0", "service.name": "fleet-server", "message": "http: TLS handshake error from 10.47.192.54:47390: EOF\n", "log": { "file": { "path": "/app/elastic-agent/data/logs/es-containerhost/fleet-server-20220608-20.ndjson" }, "offset": 7539620 }, "service": { "node": { "name": "instance-0000000000" }, "version": "8.3.0-SNAPSHOT", "type": "agent", "name": "perf-ocuka-custom", "id": "f9e9545143faecb03be3530807b8a0b6" }, "agent": { "type": "filebeat", "version": "8.3.0", "ephemeral_id": "cd40f34b-d445-4e22-9371-5ca7458a9d98", "id": "e2c0815c-7b65-4bb7-82f9-47732f46bf48", "name": "f6b2fffd32a5" }, "host": { "name": "f6b2fffd32a5" }, "input": { "type": "log" }, "event": { "dataset": "agent.log" }, "cloud": { "availability_zone": "us-west2-b" }, "ecs": { "version": "8.0.0" } }, "fields": { "service.name": [ "perf-ocuka-custom", "fleet-server" ], "service.id": [ "f9e9545143faecb03be3530807b8a0b6" ], "service.node.name": [ "instance-0000000000" ], "input.type": [ "log" ], "log.offset": [ 7539620 ], "message": [ "http: TLS handshake error from 10.47.192.54:47390: EOF\n" ], "cloud.availability_zone": [ "us-west2-b" ], "service.type": [ "agent" ], "agent.type": [ "filebeat" ], "@timestamp": [ "2022-06-08T15:50:02.248Z" ], "service.version": [ "8.3.0-SNAPSHOT" ], "agent.id": [ "e2c0815c-7b65-4bb7-82f9-47732f46bf48" ], "ecs.version": [ "8.0.0", "1.6.0" ], "log.file.path": [ "/app/elastic-agent/data/logs/es-containerhost/fleet-server-20220608-20.ndjson" ], "log.level": [ "error" ], "agent.ephemeral_id": [ "cd40f34b-d445-4e22-9371-5ca7458a9d98" ], "agent.name": [ "f6b2fffd32a5" ], "agent.version": [ "8.3.0" ], "host.name": [ "f6b2fffd32a5" ], "event.dataset": [ "agent.log" ] } }
{ "_index": ".ds-elastic-cloud-logs-8-2022.05.28-000004", "_id": "ckUARIEBFt3ZXvKt51qf", "_version": 1, "_score": 1, "_source": { "agent": { "name": "112994cdba43", "id": "e29a2670-beca-4701-a449-49336d6229e6", "type": "filebeat", "ephemeral_id": "cf9770bd-bb40-49c2-a175-115e717be344", "version": "8.3.0" }, "log": { "file": { "path": "/app/logs/kibana-json.log" }, "offset": 89315 }, "fileset": { "name": "log" }, "message": "Cancelling task endpoint:user-artifact-packager \"endpoint:user-artifact-packager:1.0.0\" as it expired at 2022-06-08T15:48:04.140Z after running for 01m 31s (with timeout set at 1m).", "error": { "message": "field [kibana.log.meta.pid] doesn't exist" }, "cloud": { "availability_zone": "us-west2-b" }, "input": { "type": "log" }, "@timestamp": "2022-06-08T15:48:35.440Z", "ecs": { "version": "1.12.0" }, "service": { "node": { "name": "instance-0000000001" }, "name": "perf-ocuka-custom", "id": "f9e9545143faecb03be3530807b8a0b6", "type": "kibana", "version": "8.3.0-SNAPSHOT" }, "host": { "name": "112994cdba43" }, "event": { "ingested": "2022-06-08T15:48:38.942568566Z", "created": "2022-06-08T15:48:37.863Z", "module": "kibana", "dataset": "kibana.log" }, "kibana": { "log": { "meta": { "process": { "pid": 20 }, "trace": { "id": "c585a21988e91a00659f09da1d091565" }, "ecs": { "version": "8.0.0" }, "log": { "level": "WARN", "logger": "plugins.taskManager" }, "transaction": { "id": "1894b2fe51908601" } } } } }, "fields": { "service.id": [ "f9e9545143faecb03be3530807b8a0b6" ], "kibana.log.meta.process.pid": [ 20 ], "kibana.log.meta.transaction.id": [ "1894b2fe51908601" ], "service.node.name": [ "instance-0000000001" ], "kibana.log.meta.log.level": [ "WARN" ], "cloud.availability_zone": [ "us-west2-b" ], "service.type": [ "kibana" ], "agent.type": [ "filebeat" ], "event.module": [ "kibana" ], "agent.name": [ "112994cdba43" ], "host.name": [ "112994cdba43" ], "kibana.log.meta.log.logger": [ "plugins.taskManager" ], "service.name": [ "perf-ocuka-custom" ], "fileset.name": [ "log" ], "kibana.log.meta.ecs.version": [ "8.0.0" ], "input.type": [ "log" ], "log.offset": [ 89315 ], "message": [ "Cancelling task endpoint:user-artifact-packager \"endpoint:user-artifact-packager:1.0.0\" as it expired at 2022-06-08T15:48:04.140Z after running for 01m 31s (with timeout set at 1m)." ], "kibana.log.meta.trace.id": [ "c585a21988e91a00659f09da1d091565" ], "event.ingested": [ "2022-06-08T15:48:38.942Z" ], "@timestamp": [ "2022-06-08T15:48:35.440Z" ], "service.version": [ "8.3.0-SNAPSHOT" ], "agent.id": [ "e29a2670-beca-4701-a449-49336d6229e6" ], "ecs.version": [ "1.12.0" ], "error.message": [ "field [kibana.log.meta.pid] doesn't exist" ], "event.created": [ "2022-06-08T15:48:37.863Z" ], "log.file.path": [ "/app/logs/kibana-json.log" ], "agent.ephemeral_id": [ "cf9770bd-bb40-49c2-a175-115e717be344" ], "agent.version": [ "8.3.0" ], "event.dataset": [ "kibana.log" ] } }
Expected behavior:
Selected agents have been reassigned to a new policy. If it's not technically possible to reassign that number of agents, this option must not be available for a user on GUI.
The text was updated successfully, but these errors were encountered: