-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Security privilege evaluation for wildcard index pattern can stall network threads(HTTP/Transport) #5022
Comments
cc @nibix (for visibility) |
It's likely that this is fixed by #4380. @Bukhtawar how many indices and roles do you have on the affected cluster? |
Thanks @nibix I see two problems
For 1. I feel that for the '*' based wildcard pattern we could have users with 10k indices and 100s of roles, currently this takes more than 300s without the optimisation. If with the optimisation this time reduces to less than a couple of seconds, I am good. However if the time is proportional to the count of indices which in future could grow to 10s of thousands of indices and the time to evaluate the privilege goes into 10s of seconds. In that case I strongly feel this task needs to be offloaded off of network threads. It would be good to see some benchmarks on the same |
The optimization yields constant time for all cases except index expressions with patterns other than the full wildcard See this for benchmarks and before-after comparisons: https://eliatra.com/blog/performance-improvements-for-the-access-control-layer-of-opensearch/ Do you mean with "network threads" the threads processing the transport requests? I do not think that offloading privilege evaluation off the transport threads will be possible without major conceptual changes. At the moment, the access control concept is tightly coupled to the execution of transport requests. |
I am yet to look at the benchmarks however if the worst case privilege evaluation runs over 5-10s of second we don't have a choice but to offload else all critical requests would either slow down or stall.
In this context I am referring to the http worker threads that are handling the REST requests. I do feel that forking in the transport action class in this case the
But more generically this needs to sit at TransportActions around these lines. Alternatively we can introduce a transport filter to apply this to selective actions. |
I'd expect the optimized privilege evaluation code to run within milliseconds for all relevant cases. |
[Triage] I noticed that ISM is also apply an actionFilter
When is this filter from ISM applied exactly? Edit: This filter is not adding considerable time to the request. Based on the line in ISM it is exiting early and continuing with the rest of the chain of action filters. |
What is the bug?
A clear and concise description of the bug.
Observed during a OpenSearch dashboard call from discover page trying to list
*
index pattern, triggers a*/_field_caps
API call which goes ahead and performs the fine grained privilege evaluation. In case when the indices count are large in number and/or have multiple roles configured, this can cause privilege evaluation to slow down.Now if the HTTP network threads aka event loop threads(meant for async IO i.e. read/write from socket channel) perform CPU or IO intensive work, it might cause the other requests bound to the same socket to get stalled(since wildcard privilege evaluation is a function of the number of indices and the roles). This might manifest as request timeouts, delays and worse case, external health checks to fail.
Even a
_bulk
request call takes roughly between few 100s of ms to few seconds which results in elevated latencies. Similarly if this evaluation happens on transport threads, they can also get stalled.How can one reproduce the bug?
Steps to reproduce the behavior:
*/_field_caps
API calls using one of those user rolesWhat is the expected behavior?
Proposal
What is your host/environment?
Do you have any screenshots?
If applicable, add screenshots to help explain your problem.
Do you have any additional context?
Add any other context about the problem.
The text was updated successfully, but these errors were encountered: