Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17158 Terminate distributed processing quickly when query limit is reached - Initial impl #2379

Closed
wants to merge 2 commits into from

Conversation

gus-asf
Copy link
Contributor

@gus-asf gus-asf commented Mar 31, 2024

still needs better tests, and it is still incorrectly caching empty results. Also, a few TODOs remain.

https://issues.apache.org/jira/browse/SOLR-17158

@gus-asf gus-asf changed the title SOLR-17158 Initial impl SOLR-17158 Terminate distributed processing quickly when query limit is reached - Initial impl Mar 31, 2024
@gus-asf gus-asf requested a review from sigram April 1, 2024 03:44
@@ -227,18 +243,27 @@ public ShardResponse takeCompletedOrError() {

private ShardResponse take(boolean bailOnError) {
try {
// although nothing in this class guarantees that pending has been incremented to the total
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CaoManhDat if you can double check my reading of this code and let me know if this comment is inaccurate in any way I'd appreciate it. (I see your name in the git blame a bunch here)

@gus-asf
Copy link
Contributor Author

gus-asf commented Apr 1, 2024

One thing folks may want to review is the addition of synchronization around manipulations of HttpShardHandler#responseCancelableMap which were necessary to avoid CME such as:

{
  "responseHeader":{
    "zkConnected":true,
    "status":500,
    "QTime":18,
    "params":{
      "q":"document",
      "indent":"true",
      "q.op":"OR",
      "timeAllowed":"5",
      "allowPartialResults":"false",
      "useParams":"",
      "_":"1711320289606"    }  },
  "error":{
    "trace":"java.util.ConcurrentModificationException
at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597)
at java.base/java.util.HashMap$ValueIterator.next(HashMap.java:1625)
at org.apache.solr.handler.component.HttpShardHandler.cancelAll(HttpShardHandler.java:257)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:577)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:238)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2886)
at org.apache.solr.servlet.HttpSolrCall.executeCoreRequest(HttpSolrCall.java:876)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:560)
at org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:254)
at org.apache.solr.servlet.SolrDispatchFilter.lambda$doFilter$0(SolrDispatchFilter.java:215)
at org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:247)
at org.apache.solr.servlet.ServletUtils.rateLimitRequest(ServletUtils.java:211)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:209)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:210)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)

This happens because the code is iterating an unsynchronized hash map when the request thread signals a cancel operation. I chose to also enclose pending.incrementAndGet() or decrementAndGet() and responses.take() and even though they are by themselves probablly safe, there's a loop on the value of pending > 0 and these three operations are logically linked so it seemed a bit dangerous to leave them independent.

Also I noticed that ShardeResponse.code is being set, but the IDE flags it as "assigned but never read", and against the recommendation of the previously existing comments it's not volatile...

@@ -228,6 +228,7 @@ public interface CommonParams {
METRICS_PATH);
String APISPEC_LOCATION = "apispec/";
String INTROSPECT = "/_introspect";
String ALLOW_PARTIAL_RESULTS = "allowPartialResults";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be replaced with (or used instead of) CommonParams.PARTIAL_RESULTS, and the associated docs in the ref guide needs to be changed accordingly.

These two parameters carry the exactly same meaning and serve the same purpose - i.e. that returning partial results is either acceptable or not. For QueryLimits if it's not acceptable (partialResults=false, whichever way this value was set) and the query limits are exceeded then partial results will be discarded by throwing an exception inside components (which is implemented in QueryLimits.maybeExitWithPartialResults). And here exactly the same behavior is expected, consistent with CommonParams.PARTIAL_RESULTS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, will check/fix

return;
}
// all variables that set inside this listener must be at least volatile
responseCancellableMap.put(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if we used a ConcurrentHashMap we could avoid locking here altogether?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't feel comfortable with that solution because this map is updated while looping on an Atomic integer pending and also is updated immediately after responses.take() all of which appear to need to happen as a unit. The prior code wasn't touching it from the child threads, but since I'm changing that I want to be sure that all of these operations happen as a unit to avoid potentially weird cases where the loop fails to exit but then the map is empty etc...

gus-asf added 2 commits April 10, 2024 19:38
…ly caching empty results. Also, a few TODOs remain.
Still need to tweak result attribute again, as this version is too
complicated, and though I did write upgrade notes those will change
(and hopefully get simpler) after said tweak and the main docs still
need updated.
@github-actions github-actions bot added documentation Improvements or additions to documentation configs module:ltr labels Apr 11, 2024
Comment on lines +387 to +389
case "allowPartialResultsDefault":
builder.setAllowPartialResultsDefault(it.boolVal(true));
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer you not bother with all the machinery in solr.xml/NodeConfig and instead have a simple EnvUtils (env or sys prop) to enable. It's just a boolean; doesn't have any interesting config to it. Yes there are other booleans here but I think we Solr maintainers should consider how NodeConfig might systematically / dynamically understand primitive values (boolean, integer, ...) and apply to EnvUtils automatically without having to touch the NodeConfig class (which is kind of a pain; that builder!). For example imagine "solr.search.partialResults" being settable as a system property, or also settable in solr.xml automatically via "searchPartialResults".

CC @janhoy

Another take on this is that, shouldn't the user simply go to their solrconfig.xml and set a default partialResults like we all do for miscellaneous things for search?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, IMHO having a per-collection setting instead of a global sysprop is more flexible (and you can still change it globally using property substitution).

Copy link

This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the [email protected] mailing list. Thank you for your contribution!

@github-actions github-actions bot added the stale PR not updated in 60 days label Jun 19, 2024
Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this PR touches 33 files, which is super surprising compared to #2493
Can this be simplified or split up?

Comment on lines +68 to +73
Boolean userParamAllowPartial = params.getBool(CommonParams.ALLOW_PARTIAL_RESULTS);
if (userParamAllowPartial != null) {
return !userParamAllowPartial;
} else {
return ALLOW_PARTIAL_RESULTS_DEFAULT;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the overloaded version of getBool that provides a default? We'd then have a one-liner here.

@github-actions github-actions bot removed the stale PR not updated in 60 days label Aug 2, 2024
@gus-asf
Copy link
Contributor Author

gus-asf commented Aug 16, 2024

Wow, this PR touches 33 files, which is super surprising compared to #2493 Can this be simplified or split up?

Ok finally back to this. Sorry about the delay, the merge of a multi-threading implementation that broke (or prevented use of) features I had just released co-opted most of my available Solr time...

@gus-asf
Copy link
Contributor Author

gus-asf commented Aug 26, 2024

Reworked version compatible with current main, and touching fewer files here: #2666

@gus-asf gus-asf closed this Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants