Remove cluster setting for native memory circuit breaker #1607

jmazanec15 · 2024-04-10T15:36:20Z

Description

Right now, as part of our k-NN circuit breaker and memory management service, we update a cluster setting knn.circuit_breaker.triggered that is then used to block operations such as ingest (ref). The setting allows there to be one value across the whole cluster - so, if one node circuit breaker is tripped, all ingestion is blocked.

I dont think this coordination across the cluster is necessary. It ends up making the circuit breaking logic more complex than it needs to be. It is inconsistent with OpenSearch circuit breaking, which only takes into account the node whose circuit breaker is tripped.

Proposal

Im proposing as a first step of #1582, we remove (or deprecate) this setting and make the circuit breaking logic not require cluster synchronization.

navneet1v · 2024-04-12T18:29:49Z

We can move to a node level CB but removing this CB can this lead to OOM killer on some nodes? because they ran out of memory?

jmazanec15 · 2024-04-15T16:26:08Z

@navneet1v idea is that individual nodes would reject requests instead of a node rejecting the request because some other node is under duress.

navneet1v · 2024-04-15T17:49:03Z

@navneet1v idea is that individual nodes would reject requests instead of a node rejecting the request because some other node is under duress.

This can lead to partial success of bulk requests, which I am worried about. The main worry is it the right user experience?

jmazanec15 · 2024-04-16T21:31:28Z

This can lead to partial success of bulk requests, which I am worried about.

Is partial success worse than complete failure though?

The main worry is it the right user experience?

Yes, issue is around user experience.

heemin32 · 2024-07-25T18:10:38Z

I think the ultimate purpose of circuit breaker is to avoid OOM. To achieve that, both cluster level and node level will work.

For cluster level, the downside is 1. code complexity as @jmazanec15 mentioned, 2. Less efficient use of resource; If user create very large index in single shard and trigger the CB, it will block the use of knn field in all the other nodes in the cluster.

For node level, the downside is 1. the request will fail depending on which node the request will reach to as @navneet1v mentioned.

For the downside for node level CB, as it is the same behavior with OpenSearch core with other requests(node level CB), it should be acceptable as long as we provide node level stats to indicate which node has triggered CB.

From backward compatibility stand point, failing entire request -> failing partial requests looks okay.

jmazanec15 · 2024-07-25T18:19:43Z

From backward compatibility stand point, failing entire request -> failing partial requests looks okay.

For bwc, I think my concern is more around the public knn.circuit_breaker.triggered setting. My concern with removing is that users may read this value from the cluster settings to check health.

heemin32 · 2024-07-25T18:48:09Z

From backward compatibility stand point, failing entire request -> failing partial requests looks okay.

For bwc, I think my concern is more around the public knn.circuit_breaker.triggered setting. My concern with removing is that users may read this value from the cluster settings to check health.

One option could be, we keep update the value even though the actual CB is handled in node level.

Set it as true when a node trigger CB.
Set it as false when a node reset CB and all other nodes haven't triggered CB.

Not sure with this, we can achieve the goal of reducing code complexity.

jmazanec15 · 2024-07-25T19:04:15Z

I think we would need to support it. But I think we could deprecate the setting and then pull out some of the sweeper jobs into separate classes to simplify things.

We could move the cluster setting untrigger into its own class (https://github.com/opensearch-project/k-NN/blob/2.15.0.0/src/main/java/org/opensearch/knn/index/KNNCircuitBreaker.java#L74-L107)
We could move the local unsetter into its own class (https://github.com/opensearch-project/k-NN/blob/2.15.0.0/src/main/java/org/opensearch/knn/index/KNNCircuitBreaker.java#L60-L71)
We simplify the interface for KNNCircuitBreaker to be stateless and just have a method that checks if the cache's capacity has been reached or not. Then, throughout code, check here if we need to check anything
Remove setting updater on capacity miss in cache manager and have separate class handle that (https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/memory/NativeMemoryCacheManager.java#L388)

With these changes, I think it would simiplify circuit breaker logic and memory logic on the node. It would remove circular dependencies. And it would give a deprecation path for the setting.

github-actions bot added the untriaged label Apr 10, 2024

jmazanec15 added Enhancements Increases software capabilities beyond original client specifications and removed untriaged labels Apr 10, 2024

vamshin assigned ryanbogan Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove cluster setting for native memory circuit breaker #1607

Remove cluster setting for native memory circuit breaker #1607

jmazanec15 commented Apr 10, 2024

navneet1v commented Apr 12, 2024

jmazanec15 commented Apr 15, 2024

navneet1v commented Apr 15, 2024

jmazanec15 commented Apr 16, 2024

heemin32 commented Jul 25, 2024 •

edited

Loading

jmazanec15 commented Jul 25, 2024

heemin32 commented Jul 25, 2024

jmazanec15 commented Jul 25, 2024

Remove cluster setting for native memory circuit breaker #1607

Remove cluster setting for native memory circuit breaker #1607

Comments

jmazanec15 commented Apr 10, 2024

Description

Proposal

navneet1v commented Apr 12, 2024

jmazanec15 commented Apr 15, 2024

navneet1v commented Apr 15, 2024

jmazanec15 commented Apr 16, 2024

heemin32 commented Jul 25, 2024 • edited Loading

jmazanec15 commented Jul 25, 2024

heemin32 commented Jul 25, 2024

jmazanec15 commented Jul 25, 2024

heemin32 commented Jul 25, 2024 •

edited

Loading