-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-17192: Add "field-limiting" URP to catch ill-designed schemas #2395
SOLR-17192: Add "field-limiting" URP to catch ill-designed schemas #2395
Conversation
The URP, `NumFieldLimitingUpdateRequestProcessor`, blocks all update requests that go through `processAdd` if the core exceeds a configurable threshold of fields. As the blocking only triggers _after_ the limit has been exceeded, it doesn't technically prevent users from getting into the bad state. But it forces a discussion around schema design, at which point users may either raise the limit, disable the URP, or go back to the drawing-board on their setup.
Still TBD - additional tests, ref-guide docs. |
In a sample solrconfig.xml, configuration for the proposed URP would look like:
|
Not currently passing; mostly just committing to transfer work between workstations.
This reverts commit 5512246.
This should all be ready to go: I've added unit and integration tests, One big question I have is: should we add this into the "_default" configset's default URP-chain? At least for 10.x? I'd argue we should. Safety guardrails like this one aren't any good unless users enable them, and having it be enabled by default is the best way to do that. We might get some complaints from folks who run into this during an inplace 9->10 upgrade, but:
I'd love a sanity-check/+1 from someone else though before I make this enabled by default on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions/comments...
solr/core/src/java/org/apache/solr/update/processor/NumFieldLimitingUpdateRequestProcessor.java
Outdated
Show resolved
Hide resolved
...src/java/org/apache/solr/update/processor/NumFieldLimitingUpdateRequestProcessorFactory.java
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/update/processor/NumFieldsMonitor.java
Outdated
Show resolved
Hide resolved
.../org/apache/solr/update/processor/NumFieldLimitingUpdateRequestProcessorIntegrationTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing the non-trivial checks determine the doc will be indexed in this core looks painful to me. If we have an URP, document that it needs to be placed to the right of DistributedUpdateProcessor and/or we enhance Solr to make sure that certain URPs cannot be misplaced relative to that pivotal URP.
I like the SolrEventListener for commits part. I feel like this should be its own component someone should be able to configure if they want, like for monitoring purposes but don't necessarily want to block adds. To block all indexing, it could set SolrCore.readOnly = true
and be done with it; no URP.
I see more test configs here; these are a heavy burden for our project to maintain. Please try to write the test to generate the configuration so that the test and its config is self-contained. You can see an example of this in org.apache.solr.search.TestThinCache#setupSolrHome
Sorry if I rained on your parade, attempting to get this in by EOW. I like the idea of having checks like this by default but I'd like Solr to have such built-in plugins simply be registered automatically unless you set some env/sys-prop to disable, like "solr.core.eventListener.fieldLimit.enabled". We already register a ton of query parsers even though nobody asked for them :-). I think that should be the model for built-in plugins so that we can continue slowly to reduce what's in a solrconfig.xml yet also allow someone to disable. |
solr/core/src/java/org/apache/solr/update/processor/NumFieldLimitingUpdateRequestProcessor.java
Outdated
Show resolved
Hide resolved
...src/java/org/apache/solr/update/processor/NumFieldLimitingUpdateRequestProcessorFactory.java
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/update/processor/NumFieldsMonitor.java
Outdated
Show resolved
Hide resolved
solr/core/src/test-files/solr/configsets/cloud-minimal-field-limiting/conf/solrconfig.xml
Outdated
Show resolved
Hide resolved
...test/org/apache/solr/update/processor/NumFieldLimitingUpdateRequestProcessorFactoryTest.java
Show resolved
Hide resolved
.../org/apache/solr/update/processor/NumFieldLimitingUpdateRequestProcessorIntegrationTest.java
Outdated
Show resolved
Hide resolved
Oh no worries; there's no particular rush. That was just how long I was planning to wait if no one commented in the interim. I'm glad you both chimed in! Though tbh I can't quite tell where you are on the PR approach overall:
Please let me know! One last reply inline:
IMO That's what I like about the URP approach - it gives us more granularity into the different types of "/update" operations and allows us to get a nice (or at least, nicer) error message back to users. |
I don't like the complexity in this URP relating to tolerance of where the URP is placed in the chain; I'd feel better if the URP were simplified from that concern and we expect the user to place it at an appropriate spot. We don't have complexity like that elsewhere and/or I argue it's not the right approach. I sympathize with why an URP feels right. Okay. On each |
Yeah I understand. I still feel a little reluctant I guess, but I'm likely being paranoid. It's harder to feel good about the "win" in making things safer for users if the change opens up another trappy state when misconfigured. But I'll take out the leader-check for now. We can always re-evaluate for a subsequent release if we see people getting bitten by this. And in any case there are other ways to mitigate the risk of misconfig (e.g. putting into the "_default" config for 10.0 so users needn't tweak things). |
One error case that removing the leader logic will introduce is that due to segment merging, etc. not all replicas will necessarily agree on the number of fields. So there's a chance that when NRT leaders forward the doc along to replicas, that it might get a mix of success/failure responses back. Just noting this here as a reminder to explain this case in the little blurb the URP gets in the ref guide. |
solr/core/src/java/org/apache/solr/update/processor/NumFieldLimitingUpdateRequestProcessor.java
Outdated
Show resolved
Hide resolved
...src/java/org/apache/solr/update/processor/NumFieldLimitingUpdateRequestProcessorFactory.java
Show resolved
Hide resolved
Alright, I've removed a lot of the leadership and shard checking logic based on review feedback. I think this should be ready to go, unless you have remaining concerns @dsmiley ? I've also added the new URP to the URP-chain defined in the "_default" configset ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of feedback... some is irrelevant depending on what path we take. Especially, read my comment in the factory first.
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
public class NumFieldLimitingUpdateRequestProcessor extends UpdateRequestProcessor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadoc needed, even if one line linking to a factory for further info.
Or don't make public :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed this earlier; done!
...src/java/org/apache/solr/update/processor/NumFieldLimitingUpdateRequestProcessorFactory.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/update/processor/NumFieldsMonitor.java
Outdated
Show resolved
Hide resolved
.../org/apache/solr/update/processor/NumFieldLimitingUpdateRequestProcessorIntegrationTest.java
Outdated
Show resolved
Hide resolved
+ | ||
Solr performance can degrade and even become unstable if cores accumulate too many (e.g. more than 500) fields. The "NumFieldLimiting" URP is offered as a safeguard that helps users notice potentially-dangerous schema design and/or mis-use of dynamic fields, before these performance and stability problems would manifest. | ||
Note that the field count an index reports can be influenced by deleted (but not yet purged) documents, and may vary from replica to replica. | ||
In order to avoid these sort of discrepancies between replicas, use of this URP should almost always precede DistributedUpdateProcessor in when running in SolrCloud mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This advice surprised me. I guess why not, as only the first node receiving the batch will evaluate this logic. It may/may-not be the leader but whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the URP certainly can't go after DistributedUpdateProcessor for the reason laid out in the docs - you don't want some subset of your replicas rejecting certain docs because their segments have them disagree on the current field count.
Putting it before DUP also has downsides, as you pointed out. The field count check is going to happen on that somewhat arbitrary "first node", instead of a the more authoritative "leader". And that adds wonkiness - if replicas disagree on field-count, this means you could get into situations where a given update could either succeed or fail depending on whether it gets lucky or not with its "first node". (Hopefully this isn't a big deal though, since the URP isn't trying to be an exact limit - but it's still ugly and potentially confusing.)
This was what earlier versions of this PR were trying to avoid by gating the field-count check so that it only happened for docs that were being processed by their shard leader.
<updateProcessor class="solr.NumFieldLimitingUpdateRequestProcessorFactory" name="max-fields"> | ||
<int name="maxFields">${solr.test.maxFields:1234}</int> | ||
</updateProcessor> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why declare this separately instead of in the one chain that uses it? It's not obvious, when looking at the chain, that there is this additional processor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh I wasn't aware that you could declare processors (that require config) directly in a processor chain. I can't find an example of that. Looking through several other solrconfig.xml files for URPs that require configuration...they all look similar to what I've done here afaict.
(See the _default configset Solr ships and the default configset in the 'core' modules tests...both look similar to the example here)
Maybe I'm misunderstanding you or missing something though - can you point to an example of how you'd prefer to see this declared, and a reason it's preferable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having no-config vs config doesn't conceptually change where the element can be placed.
ex:
<updateProcessor blah lbah />
vs
<updateProcessor blah blah>
<int blah blah/>
</updateProcessor>
Can be in the same places.
I went to the ref guide's page on URPs and observe the first example shows a chain with several URPs, one of which has config. I think it's natural to place URPs inline in the chain. Separating it out is for composability if there are multiple chains and URPs and you want to avoid possible repetition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this to be in line with your request.
Though I'm still curious if there's a reason to prefer this going forward, or whether it's subjective. I've not paid much attention historically to how these appear in our configsets, but naively I'd assume that more composability is better.
solr/core/src/java/org/apache/solr/update/processor/NumFieldsMonitor.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/update/processor/NumFieldsMonitor.java
Outdated
Show resolved
Hide resolved
|
||
// TODO Should we skip to the next URP if running in SolrCloud and *not* a leader? | ||
return new NumFieldLimitingUpdateRequestProcessor( | ||
req, next, maximumFields, numFieldsMonitor.getCurrentNumFields(), warnOnly); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woah, okay, I wasn't expecting this. The numFieldsMonitor.getCurrentNumFields
is only read on load of the URP; it's not per-doc. If that's adequate (I suppose; this is just a soft limit) then this can be much simpler -- just go read the SolrIndexSearcher right now to get the field count! No NumFieldsMonitor needed. Furthermore the URP could be renamed to simply "BlockNewDocsUrp" and is small enough can be an inner class really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I find the NumFieldsMonitor approach preferable to using a SIS here directly:
- it allows the URP and factory to be abstracted away from exactly how the currentNumFields is calculated.
- it allows us to make as few (potentially expensive) calls to
SIS.getFieldInfos()
as possible. (I know we could also avoid calls by doing things like hanging on to a searcher reference and checking whether the searcher has changed, but IMO this is at least as complex as the NumFieldsMonitor approach.) - as epugh pointed out in an earlier comment, NumFieldsMonitor seems like it could be applied in other places.
If there are other votes I'm happy to go with whatever the majority considers simpler. But otherwise I'll stick with NumFieldsMonitor approach.
Furthermore the URP could be renamed to simply "BlockNewDocsUrp"
I don't love "NumFieldLimitingURP" as a name, but I like that the name itself suggests why a user would want to use it. "Want to cap the fields in your index? Then use this thing..."
BlockNewDocs captures what the URP does, but doesn't capture the "why", which makes it less discoverable to users IMO (And you could also interpret it as distinguishing between "new" and "existing" docs, which isn't the case.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factory to be abstracted away from exactly how the currentNumFields is calculated
Unless that abstraction shows utility here, I don't think it's worth another class.
it allows us to make as few (potentially expensive) calls to SIS.getFieldInfos()
That's not expensive! req.getSearcher().getFieldInfos()
It gets a searcher, then returns a value in a field. Furthermore you'll do it just once on the indexing stream request.
as epugh pointed out in an earlier comment, NumFieldsMonitor seems like it could be applied in other places.
Sorry, what?
I didn't mean rename the factory (it's well named!). I think the URP should be a small inner class of this factory named whatever makes sense. Not public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I am proposing that the whole NumFieldsMonitor thing (and how we init it, etc.) be replaced by no more than one line of code. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced by no more than one line of code
NumFieldsMonitor is more LOC and more Java classes, but IMO both of those are red-herrings in terms of complexity and maintenance burden. My subjective opinion is that an unfamiliar reader would have an easier time grokking NumFieldsMonitor than they would parsing through the reference-checking logic that (as I understand it) the alternative would entail.
That said I'm surprised to hear your confidence in "one line". My understanding of what you proposed, based on your earlier comment here, sounded like a good bit more than that. It'd need to: hold onto a weak-reference to the previous searcher, do some reference equality check on each invocation, fetch the updated number of fields, update the weak-ref if the current searcher is now different, etc. If using withSearcher()
there's IOException handling to do as well, etc.
If I'm misunderstanding your suggestion and overcomplicating it in my head, and it's really just one line, please share 😄 . I'm happy to take a commit on this branch if you've got a minute!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple ideas have been discussed, so it's a bit confusing. My epiphany from your recent changes is that we only need to know how many fields exist at the time that the URP is instantiated, unlike the first incarnation. This ought to be a major simplifier and it nullifies my idea around wanting a WeakReference to a SolrIndexSearcher. May I do the change to your PR here in one commit that you might revert if you don't like it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, that'd be great. Thanks @dsmiley !
EDIT - oops, I see I responded here too late and you've already created a PR against my fork. Will go take a look at that. In future, you're welcome to push directly to my dev branch, I don't mind. Whatever you prefer.
See gerlowskija#5 for a commit that makes a large simplification:
The single line I refer to earlier is as follows in my PR to your PR: |
* No NumFieldsMonitor * No named URP; anonymous class is fine * URP need not be inserted in the happy path!
Thanks for the change David - apologies for the back and forth trying to understand what you were suggesting. I've merged your PR into this branch. With that wrapped up, I'm going to re-run tests/check/precommit and aim to merge this afternoon. |
Cool; please remember to mention me in CHANGES.txt. And thanks for adding this protection. |
…pache#2395) The URPF, `NumFieldLimitingUpdateRequestProcessorFactory`, blocks all update requests that go through `processAdd` if the core exceeds a configurable threshold of fields. The factory accepts two parameters: `maxFields` is a required integer representing the maximum field threshold, and `warnOnly` is an optional boolean that (when enabled) has the URP chain log warnings instead of blocking updates. The factory is included in the default configset, with warnOnly=true and maxFields=1000. ('warnOnly' will be false starting in the 10.0 release) --------- Co-authored-by: David Smiley <[email protected]>
…2395) The URPF, `NumFieldLimitingUpdateRequestProcessorFactory`, blocks all update requests that go through `processAdd` if the core exceeds a configurable threshold of fields. The factory accepts two parameters: `maxFields` is a required integer representing the maximum field threshold, and `warnOnly` is an optional boolean that (when enabled) has the URP chain log warnings instead of blocking updates. The factory is included in the default configset, with warnOnly=true and maxFields=1000. ('warnOnly' will be false starting in the 10.0 release) --------- Co-authored-by: David Smiley <[email protected]>
…pache#2395) The URPF, `NumFieldLimitingUpdateRequestProcessorFactory`, blocks all update requests that go through `processAdd` if the core exceeds a configurable threshold of fields. The factory accepts two parameters: `maxFields` is a required integer representing the maximum field threshold, and `warnOnly` is an optional boolean that (when enabled) has the URP chain log warnings instead of blocking updates. The factory is included in the default configset, with warnOnly=false and maxFields=1000. (More lenient settings will be used on branch_9x) --------- Co-authored-by: David Smiley <[email protected]>
…pache#2395) The URPF, `NumFieldLimitingUpdateRequestProcessorFactory`, blocks all update requests that go through `processAdd` if the core exceeds a configurable threshold of fields. The factory accepts two parameters: `maxFields` is a required integer representing the maximum field threshold, and `warnOnly` is an optional boolean that (when enabled) has the URP chain log warnings instead of blocking updates. The factory is included in the default configset, with warnOnly=false and maxFields=1000. (More lenient settings will be used on branch_9x) --------- Co-authored-by: David Smiley <[email protected]>
https://issues.apache.org/jira/browse/SOLR-17192
Description
Solr isn't infinitely scalable when it comes to the number of fields in each core/collection. Most deployments start to experience problems any time a core has upwards of a few hundred fields. Usually this doesn't exhibit itself right away. instead waiting until segment-merge or some other time to rear its head.
Despite this being a known limitation Solr doesn't have any (active) way of helping users avoid this, excepting one or two references buried in the Solr ref-gudie.
Solution
This commit adds a new UpdateRequestProcessor,
NumFieldsLimitingUpdateRequestProcessor
, that flags potentially-dangerous schema design for users by failing all updates once the relevant core has exceeded a configurable limit. The proposed URP has two configuration properties:maxFields
- (required) a positive integer value representing the maximum number of fields a core should be allowed to have.warnOnly
- (optional, defaults to 'false' if not specified) a boolean flag indicating whether Solr should "really" fail requests once the threshold has been hit, or should merely log a warning instead.Note: In order to be as "lightweight" as possible, this URP only updates its "view" of the number of fields in the core on each commit. Of course, this means that it operates on somewhat of a delay, and isn't a guarantee that users won't exceed the threshold at all. Rather, the real goal of the URP is less about serving as a hard limit, and more about forcing users to reconsider their schema design once the number of fields starts trending into "questionable" territory.
Tests
Unit tests in
NumFieldsLimitingUpdateRequestProcessorFactoryTest
. Still needs some higher level testing prior to merge.Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.