Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17192: Add "field-limiting" URP to catch ill-designed schemas #2395

Merged

Conversation

gerlowskija
Copy link
Contributor

@gerlowskija gerlowskija commented Apr 9, 2024

https://issues.apache.org/jira/browse/SOLR-17192

Description

Solr isn't infinitely scalable when it comes to the number of fields in each core/collection. Most deployments start to experience problems any time a core has upwards of a few hundred fields. Usually this doesn't exhibit itself right away. instead waiting until segment-merge or some other time to rear its head.

Despite this being a known limitation Solr doesn't have any (active) way of helping users avoid this, excepting one or two references buried in the Solr ref-gudie.

Solution

This commit adds a new UpdateRequestProcessor, NumFieldsLimitingUpdateRequestProcessor, that flags potentially-dangerous schema design for users by failing all updates once the relevant core has exceeded a configurable limit. The proposed URP has two configuration properties:

  • maxFields - (required) a positive integer value representing the maximum number of fields a core should be allowed to have.
  • warnOnly - (optional, defaults to 'false' if not specified) a boolean flag indicating whether Solr should "really" fail requests once the threshold has been hit, or should merely log a warning instead.

Note: In order to be as "lightweight" as possible, this URP only updates its "view" of the number of fields in the core on each commit. Of course, this means that it operates on somewhat of a delay, and isn't a guarantee that users won't exceed the threshold at all. Rather, the real goal of the URP is less about serving as a hard limit, and more about forcing users to reconsider their schema design once the number of fields starts trending into "questionable" territory.

Tests

Unit tests in NumFieldsLimitingUpdateRequestProcessorFactoryTest. Still needs some higher level testing prior to merge.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

The URP, `NumFieldLimitingUpdateRequestProcessor`, blocks all update
requests that go through `processAdd` if the core exceeds a configurable
threshold of fields.

As the blocking only triggers _after_ the limit has been exceeded, it
doesn't technically prevent users from getting into the bad state.

But it forces a discussion around schema design, at which point users
may either raise the limit, disable the URP, or go back to the
drawing-board on their setup.
@gerlowskija
Copy link
Contributor Author

Still TBD - additional tests, ref-guide docs.

@gerlowskija
Copy link
Contributor Author

In a sample solrconfig.xml, configuration for the proposed URP would look like:

  <updateProcessor class="solr.NumFieldLimitingUpdateRequestProcessorFactory" name="max-fields">
    <int name="maxFields">123</int>
  </updateProcessor>
  ...
  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema"
           processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields,max-fields">

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 10, 2024
@gerlowskija
Copy link
Contributor Author

This should all be ready to go: I've added unit and integration tests, ./gradlew test passes, there's ref-guide coverage and decent Javadocs, etc. I'll aim to merge EOW, absent any feedback folks might have here.

One big question I have is: should we add this into the "_default" configset's default URP-chain? At least for 10.x?

I'd argue we should. Safety guardrails like this one aren't any good unless users enable them, and having it be enabled by default is the best way to do that. We might get some complaints from folks who run into this during an inplace 9->10 upgrade, but:

  1. We tell users to carefully reconsider their configsets when doing upgrades, so this shouldn't be a surprise.
  2. We can highlight it clearly in the upgrade notes
  3. Users who hit this on an in-place upgrade can always bump the limit, toggle the URP into 'warnOnly' mode, or disable the URP altogether if they really want.

I'd love a sanity-check/+1 from someone else though before I make this enabled by default on main/10.x, so please chime in!

Copy link
Contributor

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions/comments...

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing the non-trivial checks determine the doc will be indexed in this core looks painful to me. If we have an URP, document that it needs to be placed to the right of DistributedUpdateProcessor and/or we enhance Solr to make sure that certain URPs cannot be misplaced relative to that pivotal URP.

I like the SolrEventListener for commits part. I feel like this should be its own component someone should be able to configure if they want, like for monitoring purposes but don't necessarily want to block adds. To block all indexing, it could set SolrCore.readOnly = true and be done with it; no URP.

I see more test configs here; these are a heavy burden for our project to maintain. Please try to write the test to generate the configuration so that the test and its config is self-contained. You can see an example of this in org.apache.solr.search.TestThinCache#setupSolrHome

@dsmiley
Copy link
Contributor

dsmiley commented Apr 10, 2024

Sorry if I rained on your parade, attempting to get this in by EOW.

I like the idea of having checks like this by default but I'd like Solr to have such built-in plugins simply be registered automatically unless you set some env/sys-prop to disable, like "solr.core.eventListener.fieldLimit.enabled". We already register a ton of query parsers even though nobody asked for them :-). I think that should be the model for built-in plugins so that we can continue slowly to reduce what's in a solrconfig.xml yet also allow someone to disable.

@gerlowskija
Copy link
Contributor Author

Sorry if I rained on your parade, attempting to get this in by EOW.

Oh no worries; there's no particular rush. That was just how long I was planning to wait if no one commented in the interim. I'm glad you both chimed in!

Though tbh I can't quite tell where you are on the PR approach overall:

  1. Are you '-1' this being an URP at all?
  2. Are you OK with this being an URP, with the minor tweaks you mentioned (e.g. documenting that it goes after DUP, etc.)?
  3. Are you only OK with this being an URP, but only if we make some improvements to the URP framework (e.g. to ensure certain URPs always follow DUP)
  4. ...something else?

Please let me know!

One last reply inline:

To block all indexing, it could set SolrCore.readOnly = true and be done with it

IMO SolrCore.readOnly is too blunt a tool for the "field limiting" this PR attempts. First because readOnly would block deletions and operations that we'd want to allow. And second because a user blocked by 'readOnly' would have their /update requests fail with a message that's pretty inscrutable or at least unrelated to the root cause.

That's what I like about the URP approach - it gives us more granularity into the different types of "/update" operations and allows us to get a nice (or at least, nicer) error message back to users.

@dsmiley
Copy link
Contributor

dsmiley commented Apr 11, 2024

I don't like the complexity in this URP relating to tolerance of where the URP is placed in the chain; I'd feel better if the URP were simplified from that concern and we expect the user to place it at an appropriate spot. We don't have complexity like that elsewhere and/or I argue it's not the right approach.

I sympathize with why an URP feels right. Okay. On each addDoc, don't we just need to do the check on the current SolrIndexSearcher but remember who the searcher was so that the next addDoc can see it's the same searcher and if so don't do the check? It'd need to cache the searcher reference for the instance equality; use a WeakReference so we allow the searcher to close & GC. If we do this, we don't need a commit event listener, thus don't need an additional component / confusing interaction complexity. Don't get the SolrIndexSearcher from the request (we don't want a per-request cache instance), get it from the SolrCore so we see a possibly evolving SolrIndexSearcher.

@github-actions github-actions bot removed the configs label Apr 16, 2024
@gerlowskija
Copy link
Contributor Author

gerlowskija commented Apr 16, 2024

I don't like the complexity in this URP relating to tolerance of where the URP is placed in the chain; I'd feel better if the URP were simplified from that concern and we expect the user to place it at an appropriate spot

Yeah I understand. I still feel a little reluctant I guess, but I'm likely being paranoid. It's harder to feel good about the "win" in making things safer for users if the change opens up another trappy state when misconfigured.

But I'll take out the leader-check for now. We can always re-evaluate for a subsequent release if we see people getting bitten by this. And in any case there are other ways to mitigate the risk of misconfig (e.g. putting into the "_default" config for 10.0 so users needn't tweak things).

@gerlowskija
Copy link
Contributor Author

One error case that removing the leader logic will introduce is that due to segment merging, etc. not all replicas will necessarily agree on the number of fields. So there's a chance that when NRT leaders forward the doc along to replicas, that it might get a mix of success/failure responses back.

Just noting this here as a reminder to explain this case in the little blurb the URP gets in the ref guide.

@gerlowskija
Copy link
Contributor Author

Alright, I've removed a lot of the leadership and shard checking logic based on review feedback. I think this should be ready to go, unless you have remaining concerns @dsmiley ?

I've also added the new URP to the URP-chain defined in the "_default" configset (/solr/server/solr/configsets/_default). My intent here is that this part of the PR would be a "main" only change. 9.x would either remain without the URP in the _default configset, or have it enabled but in "warnOnly" mode (my personal preference). I used a relatively high field limit, 1000, but we can raise it if folks think 1000 is too low?

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of feedback... some is irrelevant depending on what path we take. Especially, read my comment in the factory first.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class NumFieldLimitingUpdateRequestProcessor extends UpdateRequestProcessor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadoc needed, even if one line linking to a factory for further info.
Or don't make public :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this earlier; done!

+
Solr performance can degrade and even become unstable if cores accumulate too many (e.g. more than 500) fields. The "NumFieldLimiting" URP is offered as a safeguard that helps users notice potentially-dangerous schema design and/or mis-use of dynamic fields, before these performance and stability problems would manifest.
Note that the field count an index reports can be influenced by deleted (but not yet purged) documents, and may vary from replica to replica.
In order to avoid these sort of discrepancies between replicas, use of this URP should almost always precede DistributedUpdateProcessor in when running in SolrCloud mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This advice surprised me. I guess why not, as only the first node receiving the batch will evaluate this logic. It may/may-not be the leader but whatever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the URP certainly can't go after DistributedUpdateProcessor for the reason laid out in the docs - you don't want some subset of your replicas rejecting certain docs because their segments have them disagree on the current field count.

Putting it before DUP also has downsides, as you pointed out. The field count check is going to happen on that somewhat arbitrary "first node", instead of a the more authoritative "leader". And that adds wonkiness - if replicas disagree on field-count, this means you could get into situations where a given update could either succeed or fail depending on whether it gets lucky or not with its "first node". (Hopefully this isn't a big deal though, since the URP isn't trying to be an exact limit - but it's still ugly and potentially confusing.)

This was what earlier versions of this PR were trying to avoid by gating the field-count check so that it only happened for docs that were being processed by their shard leader.

Comment on lines 47 to 49
<updateProcessor class="solr.NumFieldLimitingUpdateRequestProcessorFactory" name="max-fields">
<int name="maxFields">${solr.test.maxFields:1234}</int>
</updateProcessor>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why declare this separately instead of in the one chain that uses it? It's not obvious, when looking at the chain, that there is this additional processor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I wasn't aware that you could declare processors (that require config) directly in a processor chain. I can't find an example of that. Looking through several other solrconfig.xml files for URPs that require configuration...they all look similar to what I've done here afaict.

(See the _default configset Solr ships and the default configset in the 'core' modules tests...both look similar to the example here)

Maybe I'm misunderstanding you or missing something though - can you point to an example of how you'd prefer to see this declared, and a reason it's preferable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having no-config vs config doesn't conceptually change where the element can be placed.
ex:
<updateProcessor blah lbah />
vs

<updateProcessor blah blah>
  <int blah blah/>
</updateProcessor>

Can be in the same places.
I went to the ref guide's page on URPs and observe the first example shows a chain with several URPs, one of which has config. I think it's natural to place URPs inline in the chain. Separating it out is for composability if there are multiple chains and URPs and you want to avoid possible repetition.

Copy link
Contributor Author

@gerlowskija gerlowskija May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this to be in line with your request.

Though I'm still curious if there's a reason to prefer this going forward, or whether it's subjective. I've not paid much attention historically to how these appear in our configsets, but naively I'd assume that more composability is better.


// TODO Should we skip to the next URP if running in SolrCloud and *not* a leader?
return new NumFieldLimitingUpdateRequestProcessor(
req, next, maximumFields, numFieldsMonitor.getCurrentNumFields(), warnOnly);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah, okay, I wasn't expecting this. The numFieldsMonitor.getCurrentNumFields is only read on load of the URP; it's not per-doc. If that's adequate (I suppose; this is just a soft limit) then this can be much simpler -- just go read the SolrIndexSearcher right now to get the field count! No NumFieldsMonitor needed. Furthermore the URP could be renamed to simply "BlockNewDocsUrp" and is small enough can be an inner class really.

Copy link
Contributor Author

@gerlowskija gerlowskija May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I find the NumFieldsMonitor approach preferable to using a SIS here directly:

  • it allows the URP and factory to be abstracted away from exactly how the currentNumFields is calculated.
  • it allows us to make as few (potentially expensive) calls to SIS.getFieldInfos() as possible. (I know we could also avoid calls by doing things like hanging on to a searcher reference and checking whether the searcher has changed, but IMO this is at least as complex as the NumFieldsMonitor approach.)
  • as epugh pointed out in an earlier comment, NumFieldsMonitor seems like it could be applied in other places.

If there are other votes I'm happy to go with whatever the majority considers simpler. But otherwise I'll stick with NumFieldsMonitor approach.

Furthermore the URP could be renamed to simply "BlockNewDocsUrp"

I don't love "NumFieldLimitingURP" as a name, but I like that the name itself suggests why a user would want to use it. "Want to cap the fields in your index? Then use this thing..."

BlockNewDocs captures what the URP does, but doesn't capture the "why", which makes it less discoverable to users IMO (And you could also interpret it as distinguishing between "new" and "existing" docs, which isn't the case.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factory to be abstracted away from exactly how the currentNumFields is calculated

Unless that abstraction shows utility here, I don't think it's worth another class.

it allows us to make as few (potentially expensive) calls to SIS.getFieldInfos()

That's not expensive! req.getSearcher().getFieldInfos() It gets a searcher, then returns a value in a field. Furthermore you'll do it just once on the indexing stream request.

as epugh pointed out in an earlier comment, NumFieldsMonitor seems like it could be applied in other places.

Sorry, what?

I didn't mean rename the factory (it's well named!). I think the URP should be a small inner class of this factory named whatever makes sense. Not public.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I am proposing that the whole NumFieldsMonitor thing (and how we init it, etc.) be replaced by no more than one line of code. :-)

Copy link
Contributor Author

@gerlowskija gerlowskija May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced by no more than one line of code

NumFieldsMonitor is more LOC and more Java classes, but IMO both of those are red-herrings in terms of complexity and maintenance burden. My subjective opinion is that an unfamiliar reader would have an easier time grokking NumFieldsMonitor than they would parsing through the reference-checking logic that (as I understand it) the alternative would entail.

That said I'm surprised to hear your confidence in "one line". My understanding of what you proposed, based on your earlier comment here, sounded like a good bit more than that. It'd need to: hold onto a weak-reference to the previous searcher, do some reference equality check on each invocation, fetch the updated number of fields, update the weak-ref if the current searcher is now different, etc. If using withSearcher() there's IOException handling to do as well, etc.

If I'm misunderstanding your suggestion and overcomplicating it in my head, and it's really just one line, please share 😄 . I'm happy to take a commit on this branch if you've got a minute!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple ideas have been discussed, so it's a bit confusing. My epiphany from your recent changes is that we only need to know how many fields exist at the time that the URP is instantiated, unlike the first incarnation. This ought to be a major simplifier and it nullifies my idea around wanting a WeakReference to a SolrIndexSearcher. May I do the change to your PR here in one commit that you might revert if you don't like it?

Copy link
Contributor Author

@gerlowskija gerlowskija May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, that'd be great. Thanks @dsmiley !

EDIT - oops, I see I responded here too late and you've already created a PR against my fork. Will go take a look at that. In future, you're welcome to push directly to my dev branch, I don't mind. Whatever you prefer.

@gerlowskija
Copy link
Contributor Author

Thanks for the re-review @dsmiley . Pending feedback from others I'm going to stick with the NumFieldMonitor approach (more details here), but I've addressed the rest of your suggestions. Lmk what you think!

@dsmiley
Copy link
Contributor

dsmiley commented May 6, 2024

See gerlowskija#5 for a commit that makes a large simplification:

  • No NumFieldsMonitor
  • No named URP; anonymous class is fine (and less code)
  • URP need not be inserted in the happy path!

The single line I refer to earlier is as follows in my PR to your PR:
final int currentNumFields = req.getSearcher().getFieldInfos().size();

* No NumFieldsMonitor
* No named URP; anonymous class is fine
* URP need not be inserted in the happy path!
@gerlowskija
Copy link
Contributor Author

Thanks for the change David - apologies for the back and forth trying to understand what you were suggesting. I've merged your PR into this branch.

With that wrapped up, I'm going to re-run tests/check/precommit and aim to merge this afternoon.

@dsmiley
Copy link
Contributor

dsmiley commented May 8, 2024

Cool; please remember to mention me in CHANGES.txt. And thanks for adding this protection.

@gerlowskija gerlowskija merged commit d9bd58b into apache:main May 8, 2024
4 checks passed
@gerlowskija gerlowskija deleted the SOLR-17192-add-urp-to-limit-fields branch May 8, 2024 15:45
gerlowskija added a commit to gerlowskija/solr that referenced this pull request May 8, 2024
…pache#2395)

The URPF, `NumFieldLimitingUpdateRequestProcessorFactory`, blocks all
update requests that go through `processAdd` if the core exceeds a
configurable threshold of fields.

The factory accepts two parameters: `maxFields` is a required integer
representing the maximum field threshold,  and `warnOnly` is an optional
boolean that (when enabled) has the URP chain log warnings instead of
blocking updates.

The factory is included in the default configset, with warnOnly=true and
maxFields=1000.  ('warnOnly' will be false starting in the 10.0 release)

---------

Co-authored-by: David Smiley <[email protected]>
gerlowskija added a commit that referenced this pull request May 8, 2024
…2395)

The URPF, `NumFieldLimitingUpdateRequestProcessorFactory`, blocks all
update requests that go through `processAdd` if the core exceeds a
configurable threshold of fields.

The factory accepts two parameters: `maxFields` is a required integer
representing the maximum field threshold,  and `warnOnly` is an optional
boolean that (when enabled) has the URP chain log warnings instead of
blocking updates.

The factory is included in the default configset, with warnOnly=true and
maxFields=1000.  ('warnOnly' will be false starting in the 10.0 release)

---------

Co-authored-by: David Smiley <[email protected]>
dsmiley added a commit to iamsanjay/solr that referenced this pull request May 8, 2024
…pache#2395)


The URPF, `NumFieldLimitingUpdateRequestProcessorFactory`, blocks all
update requests that go through `processAdd` if the core exceeds a
configurable threshold of fields.

The factory accepts two parameters: `maxFields` is a required integer
representing the maximum field threshold,  and `warnOnly` is an optional
boolean that (when enabled) has the URP chain log warnings instead of
blocking updates.

The factory is included in the default configset, with warnOnly=false and
maxFields=1000.  (More lenient settings will be used on branch_9x)

---------

Co-authored-by: David Smiley <[email protected]>
janhoy pushed a commit to solrbot/apache-_-solr that referenced this pull request May 8, 2024
…pache#2395)


The URPF, `NumFieldLimitingUpdateRequestProcessorFactory`, blocks all
update requests that go through `processAdd` if the core exceeds a
configurable threshold of fields.

The factory accepts two parameters: `maxFields` is a required integer
representing the maximum field threshold,  and `warnOnly` is an optional
boolean that (when enabled) has the URP chain log warnings instead of
blocking updates.

The factory is included in the default configset, with warnOnly=false and
maxFields=1000.  (More lenient settings will be used on branch_9x)

---------

Co-authored-by: David Smiley <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:index client:solrj configs documentation Improvements or additions to documentation tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants