Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Improving bulk actions for more than 10k agents #134565

Merged
merged 18 commits into from
Jun 24, 2022

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Jun 16, 2022

Summary

Improving bulk actions for more than 10k agents #133388

Changed getAllAgentsByKuery (used by bulk actions only) to query all agents with point in time query and search_after for datasets bigger than 10k.

Tested locally by changing SO_SEARCH_LIMIT to 5 and bulk actioning more than 10 agents by selecting all at once (with 5 page size on UI)

Pending work:

  • Find a way to write api integration test without having to put more than 10k agents to ES. Could be an internal API endpoint exposed which takes page size as a parameter

    • Added internal API to provide a smaller perPage value than 10k, added integration test to verify logic. The response returns the real total value, and the first 10 agents in items.
    • Example:
    GET kbn:/internal/fleet/agents?perPage=1000
    { "items": [...],
      "total": 9009,
    }
    
  • Test with actually more than 10k agents enrolled with horde

  • Change the logic to perform the actions in batches rather than all agents at once in memory - we might hit a memory limit if we try to do it at once.

image

Checklist

@juliaElastic
Copy link
Contributor Author

@elasticsearch merge upstream

@juliaElastic
Copy link
Contributor Author

@elasticmachine merge upstream

@juliaElastic
Copy link
Contributor Author

@elasticmachine merge upstream

@juliaElastic
Copy link
Contributor Author

@elasticmachine merge upstream

@juliaElastic juliaElastic marked this pull request as ready for review June 21, 2022 10:01
@juliaElastic juliaElastic requested a review from a team as a code owner June 21, 2022 10:01
@botelastic botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Jun 21, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@juliaElastic
Copy link
Contributor Author

@elasticmachine merge upstream

@juliaElastic
Copy link
Contributor Author

I've come across this issue once before when trying to action >10k agents, it was coming when trying to update that many documents in elastic at once.
I think doing the action in batches has to be done as part of this improvement.

info [o.e.c.r.a.AllocationService] [ftr] failing shard [FailedShard[routingEntry=[.kibana_task_manager_8.4.0_001][0], node[yASEYyvATfmHN2bdvzeFjA], [P], s[STARTED], a[id=KQUHNOhNQBeFJsGLpuTmzA], message=shard failure, reason [index id[task:reports:monitor] origin[PRIMARY] seq#[10919]], failure=[java.nio.fi](http://java.nio.fi/)le.FileSystemException: /Users/juliabardi/kibana/kibana/.es/cluster-ftr/data/indices/c7blFdO1TW-ydEqxJITOHw/0/index/_4cc.fdm: Too many open files, markAsStale=true]]
      java.nio.file.FileSystemException: /Users/juliabardi/kibana/kibana/.es/cluster-ftr/data/indices/c7blFdO1TW-ydEqxJITOHw/0/index/_4cc.fdm: Too many open files
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:100) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:218) ~[?:?]
        at java.nio.fi

const result: BulkActionResult = {
let results;

if (!skipSuccess) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omitting successful agents from result to avoid hitting HTTP response limit (currently only for more than 10k actions)

@juliaElastic juliaElastic mentioned this pull request Jun 22, 2022
9 tasks
@juliaElastic
Copy link
Contributor Author

juliaElastic commented Jun 23, 2022

Test results on 8.3 branch (8.4 doesn't work):

ESS instance:
https://612aed4bdc0641f8a17788adf2b02685.us-west2.gcp.elastic-cloud.com:9243/app/fleet/agents
user: admin
password: 5tL7wVK7PIahRl9qoftvxhYM

  • enrolled 15k agents to Agent policy 1 with horde, all healthy
  • created Agent policy 2
  • selected all 15k agents (leaving out fleet server) and reassigned all at once
  • the bulk action took 8-11s, all successful (agentIds are not listed in response as the logic skips successful agentIds after >10k)

image

image

  • schedule upgrade took 13s, shows up on UI as a 5k and 10k notification. Probably because the batch logic generated a slightly different timestamp?

image

- aborted upgrade - clicked upgrade action on all 15k, took 10s, all successful - aborted upgrade

image

- after abort, some agents went offline, the count is slowly decreasing as agents come back healthy - after about 1 hour, all agents came back healthy. Those last offline showed activity in last 9-10 mins. Maybe we can increase the offline timeout to 10m.
  • unenroll all took 11s, all success

  • force unenroll all took 39s, all success

Issue with Fleet Server:

  • after enrolling 15k agents, Fleet Server is still healthy, but Add agent flyout shows the message to Enable Integrations Server
  • after performing the upgrade + abort, Fleet Server goes offline - eventually comes back healthy
  • this might be an issue of the container not being big enough

image

  • after unenrolling all agents, Add agent flyout works fine again

image

@juliaElastic juliaElastic requested a review from kpollich June 23, 2022 14:19
Copy link
Member

@kpollich kpollich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes after review LGTM. Really great set of performance and consistency improvements here. Thank you for all your work on this!

@@ -65,6 +65,7 @@ export const postBulkAgentsUnenrollHandler: RequestHandler<
...agentOptions,
revoke: request.body?.revoke,
force: request.body?.force,
batchSize: request.body?.batchSize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying this. I understand the implementation here much better now 👍

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
fleet 1308 1309 +1
Unknown metric groups

API count

id before after diff
fleet 1435 1436 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

@juliaElastic juliaElastic added v8.3.1 auto-backport Deprecated - use backport:version if exact versions are needed labels Jun 24, 2022
@juliaElastic juliaElastic merged commit 2732f26 into elastic:main Jun 24, 2022
kibanamachine pushed a commit that referenced this pull request Jun 24, 2022
* changed getAllAgentsByKuery to query all agents with pit and search_after

* added internal api to test pit query

* changed reassign to work on batches of 10k

* unenroll in batches

* upgrade in batches

* fixed upgrade

* added tests

* cleanup

* revert changes in getAllAgentsByKuery

* renamed perPage to batchSize in bulk actions

* fixed test

* try catch around close pit

Co-authored-by: Kibana Machine <[email protected]>
(cherry picked from commit 2732f26)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.3

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Jun 24, 2022
…35104)

* changed getAllAgentsByKuery to query all agents with pit and search_after

* added internal api to test pit query

* changed reassign to work on batches of 10k

* unenroll in batches

* upgrade in batches

* fixed upgrade

* added tests

* cleanup

* revert changes in getAllAgentsByKuery

* renamed perPage to batchSize in bulk actions

* fixed test

* try catch around close pit

Co-authored-by: Kibana Machine <[email protected]>
(cherry picked from commit 2732f26)

Co-authored-by: Julia Bardi <[email protected]>
@tylersmalley tylersmalley added ci:cloud-deploy Create or update a Cloud deployment and removed ci:deploy-cloud labels Aug 17, 2022
asmith-elastic added a commit to elastic/ingest-docs that referenced this pull request Nov 14, 2024
This change documents the ability to leverage a batchSize body parameter for the bulk_reassign_agents_request which was introduced in the following PR: elastic/kibana#134565


This will help align the documentation with the Kibana API docs: https://www.elastic.co/guide/en/fleet/current/fleet-apis.html#bulkReassignAgents
asmith-elastic added a commit to elastic/ingest-docs that referenced this pull request Nov 18, 2024
…er (#1465)

This change documents the ability to leverage a batchSize body parameter for the bulk_reassign_agents_request which was introduced in the following PR: elastic/kibana#134565


This will help align the documentation with the Kibana API docs: https://www.elastic.co/guide/en/fleet/current/fleet-apis.html#bulkReassignAgents
mergify bot pushed a commit to elastic/ingest-docs that referenced this pull request Nov 18, 2024
…er (#1465)

This change documents the ability to leverage a batchSize body parameter for the bulk_reassign_agents_request which was introduced in the following PR: elastic/kibana#134565

This will help align the documentation with the Kibana API docs: https://www.elastic.co/guide/en/fleet/current/fleet-apis.html#bulkReassignAgents

(cherry picked from commit 421a653)
mergify bot pushed a commit to elastic/ingest-docs that referenced this pull request Nov 18, 2024
…er (#1465)

This change documents the ability to leverage a batchSize body parameter for the bulk_reassign_agents_request which was introduced in the following PR: elastic/kibana#134565

This will help align the documentation with the Kibana API docs: https://www.elastic.co/guide/en/fleet/current/fleet-apis.html#bulkReassignAgents

(cherry picked from commit 421a653)
kilfoyle added a commit to elastic/ingest-docs that referenced this pull request Nov 18, 2024
…er (#1465) (#1476)

This change documents the ability to leverage a batchSize body parameter for the bulk_reassign_agents_request which was introduced in the following PR: elastic/kibana#134565

This will help align the documentation with the Kibana API docs: https://www.elastic.co/guide/en/fleet/current/fleet-apis.html#bulkReassignAgents

(cherry picked from commit 421a653)

Co-authored-by: Austin Smith <[email protected]>
Co-authored-by: David Kilfoyle <[email protected]>
kilfoyle added a commit to elastic/ingest-docs that referenced this pull request Nov 18, 2024
…er (#1465) (#1477)

This change documents the ability to leverage a batchSize body parameter for the bulk_reassign_agents_request which was introduced in the following PR: elastic/kibana#134565

This will help align the documentation with the Kibana API docs: https://www.elastic.co/guide/en/fleet/current/fleet-apis.html#bulkReassignAgents

(cherry picked from commit 421a653)

Co-authored-by: Austin Smith <[email protected]>
Co-authored-by: David Kilfoyle <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed ci:cloud-deploy Create or update a Cloud deployment release_note:fix Team:Fleet Team label for Observability Data Collection Fleet team v8.3.1 v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants