[Fleet] Improving bulk actions for more than 10k agents #134565

juliaElastic · 2022-06-16T12:42:46Z

Summary

Improving bulk actions for more than 10k agents #133388

Changed getAllAgentsByKuery (used by bulk actions only) to query all agents with point in time query and search_after for datasets bigger than 10k.

Tested locally by changing SO_SEARCH_LIMIT to 5 and bulk actioning more than 10 agents by selecting all at once (with 5 page size on UI)

Pending work:

Find a way to write api integration test without having to put more than 10k agents to ES. Could be an internal API endpoint exposed which takes page size as a parameter
- Added internal API to provide a smaller perPage value than 10k, added integration test to verify logic. The response returns the real total value, and the first 10 agents in items.
- Example:
```
GET kbn:/internal/fleet/agents?perPage=1000
{ "items": [...],
  "total": 9009,
}
```
Test with actually more than 10k agents enrolled with horde
- tested an ran an into an issue as described here [Fleet] Improving bulk actions for more than 10k agents #134565 (comment)
- successfully tested with changes in this pr, see comment
Change the logic to perform the actions in batches rather than all agents at once in memory - we might hit a memory limit if we try to do it at once.

Checklist

Unit or functional tests were updated or added to match the most common scenarios

…fter

juliaElastic · 2022-06-16T13:56:43Z

@elasticsearch merge upstream

juliaElastic · 2022-06-17T11:34:34Z

@elasticmachine merge upstream

juliaElastic · 2022-06-20T09:22:10Z

@elasticmachine merge upstream

juliaElastic · 2022-06-20T12:15:51Z

@elasticmachine merge upstream

x-pack/plugins/fleet/server/services/agents/crud.ts

elasticmachine · 2022-06-21T10:42:28Z

Pinging @elastic/fleet (Team:Fleet)

juliaElastic · 2022-06-21T10:45:24Z

@elasticmachine merge upstream

juliaElastic · 2022-06-21T12:25:08Z

I've come across this issue once before when trying to action >10k agents, it was coming when trying to update that many documents in elastic at once.
I think doing the action in batches has to be done as part of this improvement.

info [o.e.c.r.a.AllocationService] [ftr] failing shard [FailedShard[routingEntry=[.kibana_task_manager_8.4.0_001][0], node[yASEYyvATfmHN2bdvzeFjA], [P], s[STARTED], a[id=KQUHNOhNQBeFJsGLpuTmzA], message=shard failure, reason [index id[task:reports:monitor] origin[PRIMARY] seq#[10919]], failure=[java.nio.fi](http://java.nio.fi/)le.FileSystemException: /Users/juliabardi/kibana/kibana/.es/cluster-ftr/data/indices/c7blFdO1TW-ydEqxJITOHw/0/index/_4cc.fdm: Too many open files, markAsStale=true]]
      java.nio.file.FileSystemException: /Users/juliabardi/kibana/kibana/.es/cluster-ftr/data/indices/c7blFdO1TW-ydEqxJITOHw/0/index/_4cc.fdm: Too many open files
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:100) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:218) ~[?:?]
        at java.nio.fi

juliaElastic · 2022-06-22T12:32:53Z

x-pack/plugins/fleet/server/services/agents/reassign.ts

-    const result: BulkActionResult = {
+  let results;
+
+  if (!skipSuccess) {


omitting successful agents from result to avoid hitting HTTP response limit (currently only for more than 10k actions)

juliaElastic · 2022-06-23T07:21:02Z

Test results on 8.3 branch (8.4 doesn't work):

ESS instance:
https://612aed4bdc0641f8a17788adf2b02685.us-west2.gcp.elastic-cloud.com:9243/app/fleet/agents
user: admin
password: 5tL7wVK7PIahRl9qoftvxhYM

enrolled 15k agents to Agent policy 1 with horde, all healthy
created Agent policy 2
selected all 15k agents (leaving out fleet server) and reassigned all at once
the bulk action took 8-11s, all successful (agentIds are not listed in response as the logic skips successful agentIds after >10k)

schedule upgrade took 13s, shows up on UI as a 5k and 10k notification. Probably because the batch logic generated a slightly different timestamp?

- aborted upgrade - clicked upgrade action on all 15k, took 10s, all successful - aborted upgrade

- after abort, some agents went offline, the count is slowly decreasing as agents come back healthy - after about 1 hour, all agents came back healthy. Those last offline showed activity in last 9-10 mins. Maybe we can increase the offline timeout to 10m.

unenroll all took 11s, all success
force unenroll all took 39s, all success

Issue with Fleet Server:

after enrolling 15k agents, Fleet Server is still healthy, but Add agent flyout shows the message to Enable Integrations Server
after performing the upgrade + abort, Fleet Server goes offline - eventually comes back healthy
this might be an issue of the container not being big enough

after unenrolling all agents, Add agent flyout works fine again

x-pack/plugins/fleet/common/constants/routes.ts

x-pack/plugins/fleet/server/services/agents/crud.test.ts

x-pack/plugins/fleet/server/services/agents/crud.ts

x-pack/test/fleet_api_integration/apis/agents/list.ts

kpollich

Changes after review LGTM. Really great set of performance and consistency improvements here. Thank you for all your work on this!

kpollich · 2022-06-23T14:44:40Z

x-pack/plugins/fleet/server/routes/agent/unenroll_handler.ts

@@ -65,6 +65,7 @@ export const postBulkAgentsUnenrollHandler: RequestHandler<
      ...agentOptions,
      revoke: request.body?.revoke,
      force: request.body?.force,
+      batchSize: request.body?.batchSize,


Thank you for clarifying this. I understand the implementation here much better now 👍

x-pack/plugins/fleet/server/services/agents/crud.ts

kibana-ci · 2022-06-24T07:49:01Z

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`fleet`	1308	1309	+1

Unknown metric groups

API count

id	before	after	diff
`fleet`	1435	1436	+1

History

💚 Build #52815 succeeded b154c3e
💚 Build #52749 succeeded f8c5b6c
💚 Build #52664 succeeded 495bd71
💔 Build #52465 failed b86e379
💛 Build #52394 was flaky af3a2d4
💚 Build #52096 succeeded 5e9d1b0

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

* changed getAllAgentsByKuery to query all agents with pit and search_after * added internal api to test pit query * changed reassign to work on batches of 10k * unenroll in batches * upgrade in batches * fixed upgrade * added tests * cleanup * revert changes in getAllAgentsByKuery * renamed perPage to batchSize in bulk actions * fixed test * try catch around close pit Co-authored-by: Kibana Machine <[email protected]> (cherry picked from commit 2732f26)

kibanamachine · 2022-06-24T07:53:36Z

💚 All backports created successfully

Status	Branch	Result
✅	8.3

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

…35104) * changed getAllAgentsByKuery to query all agents with pit and search_after * added internal api to test pit query * changed reassign to work on batches of 10k * unenroll in batches * upgrade in batches * fixed upgrade * added tests * cleanup * revert changes in getAllAgentsByKuery * renamed perPage to batchSize in bulk actions * fixed test * try catch around close pit Co-authored-by: Kibana Machine <[email protected]> (cherry picked from commit 2732f26) Co-authored-by: Julia Bardi <[email protected]>

This change documents the ability to leverage a batchSize body parameter for the bulk_reassign_agents_request which was introduced in the following PR: elastic/kibana#134565 This will help align the documentation with the Kibana API docs: https://www.elastic.co/guide/en/fleet/current/fleet-apis.html#bulkReassignAgents

…er (#1465) This change documents the ability to leverage a batchSize body parameter for the bulk_reassign_agents_request which was introduced in the following PR: elastic/kibana#134565 This will help align the documentation with the Kibana API docs: https://www.elastic.co/guide/en/fleet/current/fleet-apis.html#bulkReassignAgents

…er (#1465) This change documents the ability to leverage a batchSize body parameter for the bulk_reassign_agents_request which was introduced in the following PR: elastic/kibana#134565 This will help align the documentation with the Kibana API docs: https://www.elastic.co/guide/en/fleet/current/fleet-apis.html#bulkReassignAgents (cherry picked from commit 421a653)

…er (#1465) (#1476) This change documents the ability to leverage a batchSize body parameter for the bulk_reassign_agents_request which was introduced in the following PR: elastic/kibana#134565 This will help align the documentation with the Kibana API docs: https://www.elastic.co/guide/en/fleet/current/fleet-apis.html#bulkReassignAgents (cherry picked from commit 421a653) Co-authored-by: Austin Smith <[email protected]> Co-authored-by: David Kilfoyle <[email protected]>

…er (#1465) (#1477) This change documents the ability to leverage a batchSize body parameter for the bulk_reassign_agents_request which was introduced in the following PR: elastic/kibana#134565 This will help align the documentation with the Kibana API docs: https://www.elastic.co/guide/en/fleet/current/fleet-apis.html#bulkReassignAgents (cherry picked from commit 421a653) Co-authored-by: Austin Smith <[email protected]> Co-authored-by: David Kilfoyle <[email protected]>

changed getAllAgentsByKuery to query all agents with pit and search_a…

35f851c

…fter

juliaElastic added release_note:fix v8.4.0 labels Jun 16, 2022

juliaElastic self-assigned this Jun 16, 2022

juliaElastic added the ci:deploy-cloud label Jun 16, 2022

Merge branch 'main' into bulk_actions

9511186

Merge branch 'main' into bulk_actions

106cfd6

juliaElastic mentioned this pull request Jun 17, 2022

Fleet Server unhealthy on pr cloud deployment elastic/elastic-agent#575

Closed

Merge branch 'main' into bulk_actions

d621d35

Merge branch 'main' into bulk_actions

ea9bfe5

juliaElastic force-pushed the bulk_actions branch from a736e53 to ea9bfe5 Compare June 21, 2022 09:18

juliaElastic marked this pull request as ready for review June 21, 2022 10:01

juliaElastic requested a review from a team as a code owner June 21, 2022 10:01

hop-dev reviewed Jun 21, 2022

View reviewed changes

botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Jun 21, 2022

Merge branch 'main' into bulk_actions

61a4c7d

juliaElastic added 2 commits June 21, 2022 14:37

added internal api to test pit query

5e9d1b0

changed reassign to work on batches of 10k

af3a2d4

juliaElastic commented Jun 22, 2022

View reviewed changes

juliaElastic added 2 commits June 22, 2022 16:51

unenroll in batches

2ef62fb

upgrade in batches

b86e379

juliaElastic mentioned this pull request Jun 22, 2022

[8.3] Bulk actions 8 3 #134917

Closed

9 tasks

kpollich reviewed Jun 23, 2022

View reviewed changes

juliaElastic added 4 commits June 23, 2022 16:04

cleanup

4a94e51

revert changes in getAllAgentsByKuery

6f57a91

renamed perPage to batchSize in bulk actions

8b64724

fixed test

b154c3e

juliaElastic requested a review from kpollich June 23, 2022 14:19

kpollich approved these changes Jun 23, 2022

View reviewed changes

nchaulet reviewed Jun 23, 2022

View reviewed changes

x-pack/plugins/fleet/server/services/agents/crud.ts Show resolved Hide resolved

try catch around close pit

7f9bc3b

juliaElastic added v8.3.1 auto-backport Deprecated - use backport:version if exact versions are needed labels Jun 24, 2022

juliaElastic merged commit 2732f26 into elastic:main Jun 24, 2022

kibanamachine mentioned this pull request Jun 24, 2022

[8.3] [Fleet] Improving bulk actions for more than 10k agents (#134565) #135104

Merged

juliaElastic mentioned this pull request Jul 6, 2022

[Fleet] Improve UI for agent upgrade in batches to hide complexity #135801

Closed

juliaElastic mentioned this pull request Aug 1, 2022

Make Fleet kibana bulk action execution async #141567

Closed

tylersmalley added ci:cloud-deploy Create or update a Cloud deployment and removed ci:deploy-cloud labels Aug 17, 2022

asmith-elastic mentioned this pull request Nov 14, 2024

Update bulk_reassign_agents_request to include batchSize body parameter elastic/ingest-docs#1465

Merged

mergify bot mentioned this pull request Nov 18, 2024

[8.x] Update bulk_reassign_agents_request to include batchSize body parameter (backport #1465) elastic/ingest-docs#1476

Merged

mergify bot mentioned this pull request Nov 18, 2024

[8.16] Update bulk_reassign_agents_request to include batchSize body parameter (backport #1465) elastic/ingest-docs#1477

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Improving bulk actions for more than 10k agents #134565

[Fleet] Improving bulk actions for more than 10k agents #134565

juliaElastic commented Jun 16, 2022 •

edited

Loading

juliaElastic commented Jun 16, 2022

juliaElastic commented Jun 17, 2022

juliaElastic commented Jun 20, 2022

juliaElastic commented Jun 20, 2022

elasticmachine commented Jun 21, 2022

juliaElastic commented Jun 21, 2022

juliaElastic commented Jun 21, 2022

juliaElastic Jun 22, 2022

juliaElastic commented Jun 23, 2022 •

edited

Loading

kpollich left a comment

kpollich Jun 23, 2022

kibana-ci commented Jun 24, 2022

API count

kibanamachine commented Jun 24, 2022

[Fleet] Improving bulk actions for more than 10k agents #134565

[Fleet] Improving bulk actions for more than 10k agents #134565

Conversation

juliaElastic commented Jun 16, 2022 • edited Loading

Summary

Checklist

juliaElastic commented Jun 16, 2022

juliaElastic commented Jun 17, 2022

juliaElastic commented Jun 20, 2022

juliaElastic commented Jun 20, 2022

elasticmachine commented Jun 21, 2022

juliaElastic commented Jun 21, 2022

juliaElastic commented Jun 21, 2022

juliaElastic Jun 22, 2022

Choose a reason for hiding this comment

juliaElastic commented Jun 23, 2022 • edited Loading

kpollich left a comment

Choose a reason for hiding this comment

kpollich Jun 23, 2022

Choose a reason for hiding this comment

kibana-ci commented Jun 24, 2022

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

API count

History

kibanamachine commented Jun 24, 2022

💚 All backports created successfully

Questions ?

juliaElastic commented Jun 16, 2022 •

edited

Loading

juliaElastic commented Jun 23, 2022 •

edited

Loading