-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove cluster setting for native memory circuit breaker #1607
Comments
We can move to a node level CB but removing this CB can this lead to OOM killer on some nodes? because they ran out of memory? |
@navneet1v idea is that individual nodes would reject requests instead of a node rejecting the request because some other node is under duress. |
This can lead to partial success of bulk requests, which I am worried about. The main worry is it the right user experience? |
Is partial success worse than complete failure though?
Yes, issue is around user experience. |
I think the ultimate purpose of circuit breaker is to avoid OOM. To achieve that, both cluster level and node level will work. For cluster level, the downside is 1. code complexity as @jmazanec15 mentioned, 2. Less efficient use of resource; If user create very large index in single shard and trigger the CB, it will block the use of knn field in all the other nodes in the cluster. For node level, the downside is 1. the request will fail depending on which node the request will reach to as @navneet1v mentioned. For the downside for node level CB, as it is the same behavior with OpenSearch core with other requests(node level CB), it should be acceptable as long as we provide node level stats to indicate which node has triggered CB. From backward compatibility stand point, failing entire request -> failing partial requests looks okay. |
For bwc, I think my concern is more around the public knn.circuit_breaker.triggered setting. My concern with removing is that users may read this value from the cluster settings to check health. |
One option could be, we keep update the value even though the actual CB is handled in node level.
Not sure with this, we can achieve the goal of reducing code complexity. |
I think we would need to support it. But I think we could deprecate the setting and then pull out some of the sweeper jobs into separate classes to simplify things.
With these changes, I think it would simiplify circuit breaker logic and memory logic on the node. It would remove circular dependencies. And it would give a deprecation path for the setting. |
Description
Right now, as part of our k-NN circuit breaker and memory management service, we update a cluster setting knn.circuit_breaker.triggered that is then used to block operations such as ingest (ref). The setting allows there to be one value across the whole cluster - so, if one node circuit breaker is tripped, all ingestion is blocked.
I dont think this coordination across the cluster is necessary. It ends up making the circuit breaking logic more complex than it needs to be. It is inconsistent with OpenSearch circuit breaking, which only takes into account the node whose circuit breaker is tripped.
Proposal
Im proposing as a first step of #1582, we remove (or deprecate) this setting and make the circuit breaking logic not require cluster synchronization.
The text was updated successfully, but these errors were encountered: