SOLR-17049: Fix Replica Down on startup logic #2432

HoustonPutman · 2024-04-30T19:40:48Z

https://issues.apache.org/jira/browse/SOLR-17049

The list of replicas should not be determined by what exists locally, but instead on what exists in Zookeeper.

markrmiller

Nice patch! It would be great to get a test in that could at least catch if this did nothing at all, but that is probably not a simple request and I wouldn't want to see it hold up a fix given it would be a minimally valuable test in the near term and the rest of the tests are good enough in terms of checking if it would break anything.

dsmiley · 2024-05-01T19:27:48Z

solr/solrj/src/java/org/apache/solr/common/cloud/ClusterState.java

@@ -181,29 +183,46 @@ public String getShardId(String nodeName, String coreName) {
  }

  public String getShardId(String collectionName, String nodeName, String coreName) {


Nobody calls this; lets remove it

It's a public SolrJ API so I hesitate to remove it in a minor release.

Sorry; I defintiely don't think we should hold ourselves to that high of a standard. As you know there are a ton of classes and methods that are public in all of SolrJ JARs. The risk of not giving ourselves permission to remove stuff is that contributions that might come, instead do not because it's too much work to maintain backwards compatibility.

Can mark deprecated for now; don't need to remove it yet.

dsmiley · 2024-05-01T19:27:58Z

solr/solrj/src/java/org/apache/solr/common/cloud/ClusterState.java

    for (CollectionRef ref : states) {
      DocCollection coll = ref.get();


(I know already existed) looping all collections scares me where I work. Maybe that's a LazyDocCollection and we call get() (default false to get disallow cached state, i.e. we need to fetch from ZK).

dsmiley · 2024-05-01T19:31:34Z

solr/core/src/java/org/apache/solr/cloud/ZkController.java

+
+    ClusterState clusterState = cc.getZkController().getClusterState();
+    Map<String, List<Replica>> replicasPerCollectionOnNode =
+        clusterState.getReplicaNamesPerCollectionOnNode(nodeName);


It appears this is only called on the current node. If true, couldn't we instead list CoreDescriptors to find cores on the current node? This scales much better than looping all collections that exist in the entire cluster.

That would be nice, but two problems.

The core descriptors are not yet available. That's actually why the wait didn't work initially.

What if the cores not longer exist locally? The cluster state would possibly have them there acting as if they are active and healthy on startup.

Fair point. I wish the replica's state, in PRS, was an ephemeral node, and then this would be a non-issue. No need to mark anything down, ZK does it for you :-). @murblanc often speaks of wishing for this.

So it should be mentioned that this is only really a problem because of PRS. In order to determine if a collection needs to be managed by the node itself, we have to load the collection and check if it is PRS enabled. If the overseer could handle the updates for PRS, it would be much less of a strain.

As for the initial issue that the node doesn't know which collections to watch, we could have the overseer somehow give that information back to the node to tell it which collections to watch for state.

To be clear, the overseer should have all the collections state anyways, so its much less of an issue. I do understand that looping through thousands (or hundreds of thousands) of collections is a problem itself, without storing replica information in two places (under nodes and under shards) it's a necessary evil sometimes.

Having replica state be an ephemeral node could be useful in this regard, I wonder how it would complicate the general loading and reading of cluster state. (and impact ZK load for large clusters)

To be clear, the overseer should have all the collections state anyways, so its much less of an issue

I've heard this might be true but haven't found how to see this in the code. You imply, I suppose, that LazyCollectionRef is never used on an Overseer node? How?

I guess I am handwaving a bit. But if all cluster-mutating events go through the overseer, then the overseer should have all the most recent versions of all collections, right? (Sure, it won't have the states of collections that have not been referenced since the node became overseer, but that issue goes away after the first "node-down" event)

But if all cluster-mutating events go through the overseer ...

Yeah, probably this. Locally I want to add either a log or metric on org.apache.solr.common.cloud.ZkStateReader.DocCollectionWatches#activeCollectionCount to help monitor this and understand it.

Even if all collections are PRS, we will have to do this loop.

Totally understood. Eventually if ephemeral nodes are used for replica state, then this loop would be removed, as the replica' state would be interpreted as down.

We would still need the loop to bring the replica states back to ACTIVE, right?

I'd keep that overseer alive somewhere. It would make an excellent case study. I think it's pretty rare - I'm hard pressed to think of anything I've ever run into that comes close to its performance / cluster impact in comparison to the infrequent number of bits it actually has to manage and distribute. It's honestly breath taking in its own way. The level of independence in its work, the amount of information involved ... if I ever teach a software course, I'd pull it out of a jar. You can't just waltz into code like that. There are a lot of lessons tied up in that code.

dsmiley

+1 to merge

dsmiley · 2024-05-03T21:22:39Z

solr/solrj/src/java/org/apache/solr/common/cloud/ClusterState.java

@@ -181,29 +183,46 @@ public String getShardId(String nodeName, String coreName) {
  }

  public String getShardId(String collectionName, String nodeName, String coreName) {


Can mark deprecated for now; don't need to remove it yet.

(cherry picked from commit 1b582e9)

dsmiley · 2024-12-08T17:54:29Z

I don't think we should add additional methods to an important class like ClusterState unless we think it's "worthy". This PR introduces a method called by only ZkController; so let's instead have ZkController have this logic so we can keep ClusterState smaller with fewer obscure methods.

SOLR-17049: Fix Replica Down on startup logic

6371682

HoustonPutman requested review from dsmiley, markrmiller and risdenk April 30, 2024 19:40

github-actions bot added client:solrj cat:cloud labels Apr 30, 2024

markrmiller reviewed Apr 30, 2024

View reviewed changes

HoustonPutman added 3 commits May 1, 2024 11:25

Improve node mutator logic

eef9571

Fix NPE

3abb66a

Fix some tests

ed615fa

dsmiley reviewed May 1, 2024

View reviewed changes

github-actions bot added jetty-server tests labels May 1, 2024

dsmiley approved these changes May 3, 2024

View reviewed changes

rajanim approved these changes May 5, 2024

View reviewed changes

HoustonPutman added 3 commits May 14, 2024 11:28

Tidy and deprecate solrj methods

87ef71a

Fix test to work with new logic

4f6d990

Merge remote-tracking branch 'apache/main' into replica-down-on-startup

024a176

HoustonPutman merged commit 1b582e9 into apache:main May 14, 2024
4 checks passed

HoustonPutman deleted the replica-down-on-startup branch May 14, 2024 19:16

HoustonPutman added a commit that referenced this pull request May 14, 2024

SOLR-17049: Fix Replica Down on startup logic (#2432)

e04095c

(cherry picked from commit 1b582e9)

HoustonPutman added a commit that referenced this pull request May 14, 2024

SOLR-17049: Fix Replica Down on startup logic (#2432)

08e6054

(cherry picked from commit 1b582e9)

patsonluk mentioned this pull request Oct 11, 2024

SAI-5162: Experimental downnode approach to use CoreContainer if available cowpaths/fullstory-solr#232

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-17049: Fix Replica Down on startup logic #2432

SOLR-17049: Fix Replica Down on startup logic #2432

HoustonPutman commented Apr 30, 2024

markrmiller left a comment

dsmiley May 1, 2024

HoustonPutman May 1, 2024

dsmiley May 1, 2024

dsmiley May 3, 2024

dsmiley May 1, 2024

dsmiley May 1, 2024

HoustonPutman May 1, 2024

dsmiley May 1, 2024

HoustonPutman May 2, 2024

HoustonPutman May 2, 2024

dsmiley May 2, 2024

HoustonPutman May 2, 2024

dsmiley May 2, 2024

murblanc May 6, 2024

markrmiller May 10, 2024

dsmiley left a comment

dsmiley May 3, 2024

dsmiley commented Dec 8, 2024

		@@ -181,29 +183,46 @@ public String getShardId(String nodeName, String coreName) {
		}

		public String getShardId(String collectionName, String nodeName, String coreName) {

		for (CollectionRef ref : states) {
		DocCollection coll = ref.get();

SOLR-17049: Fix Replica Down on startup logic #2432

SOLR-17049: Fix Replica Down on startup logic #2432

Conversation

HoustonPutman commented Apr 30, 2024

markrmiller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsmiley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsmiley commented Dec 8, 2024