SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail #2935

mlbiscoc · 2025-01-03T17:30:07Z

https://issues.apache.org/jira/browse/SOLR-17519

Description

In BaseHttpClusterStateProvider if all initially passed nodes except for 1 go down, then CSP will only be aware of the 1 live node. If that node were to go down and any of the other initially passed nodes were to recover, CSP would not be able to connect to get latest cluster state of live nodes because it only holds the address of the now downed node.

Solution

CSP holds immutable initialNodes which is used to never remove this set from live nodes to keep it resilient. Now if the case above were to occur, CSP would still be able to get cluster state and latest set of live nodes because live nodes always has the initial set. Live nodes can also hold new nodes that are added to the cluster after CSP initialization but those are removable unlike the initial nodes.

Tests

testClusterStateProviderDownedInitialLiveNodes tests the above test case and used to fail.
testClusterStateProviderLiveNodesWithNewHost tests live nodes with a 3rd added new host after CSP is initialized which is removable but initial nodes are always still present to be reachable by CSP.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

dsmiley

Thanks for the PR!

dsmiley · 2025-01-06T04:33:47Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java

+  private Set<String> initialNodes;
  volatile Set<String> liveNodes;
+  volatile Set<String> knownNodes;


right away, I'm surprised. I can understand needing two sets, but 3? IMO knownNodes doesn't need to exists; that's the same as liveNodes. I see below you've changed many references from liveNodes to knownNodes but I recommend against doing that because the exact wording of "liveNodes" is extremely pervasive in SolrCloud (well known concept) so let's not have a vary it in just this class or any class.

I originally did it with just liveNodes but liveNodes is returned to the client which from it's name I was thinking they would assume they are "live". If we just place all the nodes into liveNodes thats not necessarily true, right? If the current set hold all the initially passed nodes, it isn't guaranteed it's live which is why I switched to known nodes while live is what was actually fetched from ZooKeeper.

There's no guarantee that getLiveNodes is going to return a list of reachable nodes. A moment after returning it, a node could become unreachable. So it's "best effort" and the caller has to deal with failures by trying the other nodes in the list and/or getting a possibly refreshed list.

Thats fair. Removed known nodes and put everything into live nodes while using the initial set inside all the time.

dsmiley · 2025-01-06T04:41:24Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java

@@ -65,6 +68,8 @@ public void init(List<String> solrUrls) throws Exception {
      urlScheme = solrUrl.startsWith("https") ? "https" : "http";
      try (SolrClient initialClient = getSolrClient(solrUrl)) {
        this.liveNodes = fetchLiveNodes(initialClient);
+        this.initialNodes = Set.copyOf(liveNodes);


I think the idea of this JIRA issue is that we'll take the Solr URLs as configured and use this is the initial / backup liveNodes. I think this is a very simple idea to to document/understand/implement.

dsmiley · 2025-01-06T04:44:34Z

solr/solrj/src/test/org/apache/solr/client/solrj/impl/ClusterStateProviderTest.java

+  private static JettySolrRunner jettyNode1;
+  private static JettySolrRunner jettyNode2;


static fields in tests that refer to Solr should be null'ed out, so there's some burden to them. Here... I think you could avoid these altogether and simply call cluster.getJettySolrRunner(0) which isn't bad!

could add a convenience method if you like.

dsmiley · 2025-01-06T04:59:20Z

solr/solrj/src/test/org/apache/solr/client/solrj/impl/ClusterStateProviderTest.java

+  }
+
+  private void waitForCSPCacheTimeout() throws InterruptedException {
+    Thread.sleep(6000);


This test should set the system property solr.solrj.cache.timeout.sec to maybe 1ms and then you can merely sleep for like a 100 milliseconds.

Cache timeout setting is in seconds and pulls an int. Can't make it 1ms. This probably should have had better granularity instead of increments of seconds.

I could change cacheTimeout to take milliseconds as the int instead but that would change peoples system property not using the default. So if they set the cache timeout to 5 seconds, it now turns into 5ms if they are not aware of this change.

mlbiscoc · 2025-01-10T17:29:40Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java

@@ -61,6 +67,7 @@ public abstract class BaseHttpClusterStateProvider implements ClusterStateProvid
  private int cacheTimeout = EnvUtils.getPropertyAsInteger("solr.solrj.cache.timeout.sec", 5);

  public void init(List<String> solrUrls) throws Exception {
+    this.initialNodes = getNodeNamesFromSolrUrls(solrUrls);


I think the idea of this JIRA issue is that we'll take the Solr URLs as configured and use this is the initial / backup liveNodes. I think this is a very simple idea to to document/understand/implement.

That makes sense. This leaves me with a few questions then. Shouldn't this take a list of URL or URI java object to verify actual non malformed URLs instead of a list of strings? I created the functions below to convert these Strings into cluster state nodeNames for liveNodes

great idea to do basic URL malformed checks like this

mlbiscoc · 2025-01-10T17:30:36Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java

+  public Set<String> getNodeNamesFromSolrUrls(List<String> urls)
+      throws URISyntaxException, MalformedURLException {
+    Set<String> set = new HashSet<>();
+    for (String url : urls) {
+      String nodeNameFromSolrUrl = getNodeNameFromSolrUrl(url);
+      set.add(nodeNameFromSolrUrl);
+    }
+    return Collections.unmodifiableSet(set);
+  }
+
+  /** URL to cluster state node name (http://127.0.0.1:12345/solr to 127.0.0.1:12345_solr) */
+  public String getNodeNameFromSolrUrl(String solrUrl)
+      throws MalformedURLException, URISyntaxException {
+    URL url = new URI(solrUrl).toURL();
+    return url.getAuthority() + url.getPath().replace('/', '_');
+  }
+


Not sure if these methods belong here but it is required to convert the initial set of String urls into cluster state node names

dsmiley · 2025-01-10T19:08:30Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java

+    Set<String> liveNodes = new HashSet<>(nodes);
+    liveNodes.addAll(this.initialNodes);
+    this.liveNodes = Set.copyOf(liveNodes);


Why is this 3 lines of extra copy-ing instead of nothing more than: this.liveNodes = Set.copyOf(nodes); ? That is, why are we touching / using initialNodes at all here?
Maybe an IllegalArgumentException if nodes.isEmpty.

My idea was to not do another loop through if liveNodes was exhausted. initalNodes would always exist in liveNodes with this set method. Let me refactor again from your suggestion comment

dsmiley · 2025-01-10T19:13:22Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java

+    Set<String> set = new HashSet<>();
+    for (String url : urls) {
+      String nodeNameFromSolrUrl = getNodeNameFromSolrUrl(url);
+      set.add(nodeNameFromSolrUrl);
+    }
+    return Collections.unmodifiableSet(set);


Could easily be converted to a single expression with streams

dsmiley · 2025-01-10T19:14:24Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java

+  /** URL to cluster state node name (http://127.0.0.1:12345/solr to 127.0.0.1:12345_solr) */
+  public String getNodeNameFromSolrUrl(String solrUrl)
+      throws MalformedURLException, URISyntaxException {
+    URL url = new URI(solrUrl).toURL();
+    return url.getAuthority() + url.getPath().replace('/', '_');
+  }


can you find other code in Solr doing this? Surely it's somewhere.

should be a static method that with a unit test

Hmm maybe... I found getBaseUrlForNodeName but it's going the other way nodeName->url. Let me look around more

dsmiley · 2025-01-10T19:19:13Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java

@@ -229,10 +236,9 @@ > getCacheTimeout()) {
      for (String nodeName : liveNodes) {


If we exhaust liveNodes, shouldn't we then try the initial configured nodes, and then only failing that, throw an exception?

dsmiley · 2025-01-10T19:34:33Z

I simplified characterization of how I think this should be done:

on initialization, copy the configured URLs to a field for safe keeping. Convert to URL if you like (validates if malformed)
on initialization, liveNodes shall be empty; we haven't fetched it yet. Ideally we don't even try to on initialization.
when getLiveNodes is called and it's not out of date, return it. If it's out of date, loop live nodes and then the initial configured URLs for the first server that will respond to fetch the current live nodes.

mlbiscoc added 3 commits January 3, 2025 11:34

Add Known nodes to BaseHttpClusterStateProvider

c2bde1f

Fix tests

00368b5

Rename setKnownHosts

52ea46b

github-actions bot added test-framework client:solrj tests cat:cloud labels Jan 3, 2025

dsmiley reviewed Jan 6, 2025

View reviewed changes

Remove known nodes and only use liveNodes

9d0e720

mlbiscoc commented Jan 10, 2025

View reviewed changes

inline urlSet

04ba07d

dsmiley reviewed Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail #2935

SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail #2935

mlbiscoc commented Jan 3, 2025 •

edited

Loading

dsmiley left a comment

dsmiley Jan 6, 2025

mlbiscoc Jan 6, 2025

dsmiley Jan 6, 2025

mlbiscoc Jan 10, 2025

dsmiley Jan 6, 2025

dsmiley Jan 6, 2025

dsmiley Jan 6, 2025

dsmiley Jan 6, 2025

mlbiscoc Jan 10, 2025

mlbiscoc Jan 10, 2025 •

edited

Loading

dsmiley Jan 10, 2025

mlbiscoc Jan 10, 2025

dsmiley Jan 10, 2025

mlbiscoc Jan 10, 2025

dsmiley Jan 10, 2025

dsmiley Jan 10, 2025

dsmiley Jan 10, 2025

mlbiscoc Jan 10, 2025

dsmiley Jan 10, 2025

dsmiley commented Jan 10, 2025

		private static JettySolrRunner jettyNode1;
		private static JettySolrRunner jettyNode2;

		@@ -229,10 +236,9 @@ > getCacheTimeout()) {
		for (String nodeName : liveNodes) {

SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail #2935

Are you sure you want to change the base?

SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail #2935

Conversation

mlbiscoc commented Jan 3, 2025 • edited Loading

Description

Solution

Tests

Checklist

dsmiley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlbiscoc Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsmiley commented Jan 10, 2025

mlbiscoc commented Jan 3, 2025 •

edited

Loading

mlbiscoc Jan 10, 2025 •

edited

Loading