Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail #2935

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

mlbiscoc
Copy link
Contributor

@mlbiscoc mlbiscoc commented Jan 3, 2025

https://issues.apache.org/jira/browse/SOLR-17519

Description

In BaseHttpClusterStateProvider if all initially passed nodes except for 1 go down, then CSP will only be aware of the 1 live node. If that node were to go down and any of the other initially passed nodes were to recover, CSP would not be able to connect to get latest cluster state of live nodes because it only holds the address of the now downed node.

Solution

CSP holds immutable initialNodes which is used to never remove this set from live nodes to keep it resilient. Now if the case above were to occur, CSP would still be able to get cluster state and latest set of live nodes because live nodes always has the initial set. Live nodes can also hold new nodes that are added to the cluster after CSP initialization but those are removable unlike the initial nodes.

Tests

testClusterStateProviderDownedInitialLiveNodes tests the above test case and used to fail.
testClusterStateProviderLiveNodesWithNewHost tests live nodes with a 3rd added new host after CSP is initialized which is removable but initial nodes are always still present to be reachable by CSP.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Comment on lines 55 to 57
private Set<String> initialNodes;
volatile Set<String> liveNodes;
volatile Set<String> knownNodes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right away, I'm surprised. I can understand needing two sets, but 3? IMO knownNodes doesn't need to exists; that's the same as liveNodes. I see below you've changed many references from liveNodes to knownNodes but I recommend against doing that because the exact wording of "liveNodes" is extremely pervasive in SolrCloud (well known concept) so let's not have a vary it in just this class or any class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally did it with just liveNodes but liveNodes is returned to the client which from it's name I was thinking they would assume they are "live". If we just place all the nodes into liveNodes thats not necessarily true, right? If the current set hold all the initially passed nodes, it isn't guaranteed it's live which is why I switched to known nodes while live is what was actually fetched from ZooKeeper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no guarantee that getLiveNodes is going to return a list of reachable nodes. A moment after returning it, a node could become unreachable. So it's "best effort" and the caller has to deal with failures by trying the other nodes in the list and/or getting a possibly refreshed list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats fair. Removed known nodes and put everything into live nodes while using the initial set inside all the time.

@@ -65,6 +68,8 @@ public void init(List<String> solrUrls) throws Exception {
urlScheme = solrUrl.startsWith("https") ? "https" : "http";
try (SolrClient initialClient = getSolrClient(solrUrl)) {
this.liveNodes = fetchLiveNodes(initialClient);
this.initialNodes = Set.copyOf(liveNodes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea of this JIRA issue is that we'll take the Solr URLs as configured and use this is the initial / backup liveNodes. I think this is a very simple idea to to document/understand/implement.

Comment on lines 46 to 47
private static JettySolrRunner jettyNode1;
private static JettySolrRunner jettyNode2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static fields in tests that refer to Solr should be null'ed out, so there's some burden to them. Here... I think you could avoid these altogether and simply call cluster.getJettySolrRunner(0) which isn't bad!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could add a convenience method if you like.

}

private void waitForCSPCacheTimeout() throws InterruptedException {
Thread.sleep(6000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should set the system property solr.solrj.cache.timeout.sec to maybe 1ms and then you can merely sleep for like a 100 milliseconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache timeout setting is in seconds and pulls an int. Can't make it 1ms. This probably should have had better granularity instead of increments of seconds.

I could change cacheTimeout to take milliseconds as the int instead but that would change peoples system property not using the default. So if they set the cache timeout to 5 seconds, it now turns into 5ms if they are not aware of this change.

@@ -61,6 +67,7 @@ public abstract class BaseHttpClusterStateProvider implements ClusterStateProvid
private int cacheTimeout = EnvUtils.getPropertyAsInteger("solr.solrj.cache.timeout.sec", 5);

public void init(List<String> solrUrls) throws Exception {
this.initialNodes = getNodeNamesFromSolrUrls(solrUrls);
Copy link
Contributor Author

@mlbiscoc mlbiscoc Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea of this JIRA issue is that we'll take the Solr URLs as configured and use this is the initial / backup liveNodes. I think this is a very simple idea to to document/understand/implement.

That makes sense. This leaves me with a few questions then. Shouldn't this take a list of URL or URI java object to verify actual non malformed URLs instead of a list of strings? I created the functions below to convert these Strings into cluster state nodeNames for liveNodes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea to do basic URL malformed checks like this

Comment on lines 428 to 444
public Set<String> getNodeNamesFromSolrUrls(List<String> urls)
throws URISyntaxException, MalformedURLException {
Set<String> set = new HashSet<>();
for (String url : urls) {
String nodeNameFromSolrUrl = getNodeNameFromSolrUrl(url);
set.add(nodeNameFromSolrUrl);
}
return Collections.unmodifiableSet(set);
}

/** URL to cluster state node name (http://127.0.0.1:12345/solr to 127.0.0.1:12345_solr) */
public String getNodeNameFromSolrUrl(String solrUrl)
throws MalformedURLException, URISyntaxException {
URL url = new URI(solrUrl).toURL();
return url.getAuthority() + url.getPath().replace('/', '_');
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if these methods belong here but it is required to convert the initial set of String urls into cluster state node names

Comment on lines +423 to +425
Set<String> liveNodes = new HashSet<>(nodes);
liveNodes.addAll(this.initialNodes);
this.liveNodes = Set.copyOf(liveNodes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this 3 lines of extra copy-ing instead of nothing more than: this.liveNodes = Set.copyOf(nodes); ? That is, why are we touching / using initialNodes at all here?
Maybe an IllegalArgumentException if nodes.isEmpty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was to not do another loop through if liveNodes was exhausted. initalNodes would always exist in liveNodes with this set method. Let me refactor again from your suggestion comment

Comment on lines 430 to 435
Set<String> set = new HashSet<>();
for (String url : urls) {
String nodeNameFromSolrUrl = getNodeNameFromSolrUrl(url);
set.add(nodeNameFromSolrUrl);
}
return Collections.unmodifiableSet(set);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could easily be converted to a single expression with streams

Comment on lines +438 to +443
/** URL to cluster state node name (http://127.0.0.1:12345/solr to 127.0.0.1:12345_solr) */
public String getNodeNameFromSolrUrl(String solrUrl)
throws MalformedURLException, URISyntaxException {
URL url = new URI(solrUrl).toURL();
return url.getAuthority() + url.getPath().replace('/', '_');
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you find other code in Solr doing this? Surely it's somewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be a static method that with a unit test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm maybe... I found getBaseUrlForNodeName but it's going the other way nodeName->url. Let me look around more

@@ -229,10 +236,9 @@ > getCacheTimeout()) {
for (String nodeName : liveNodes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we exhaust liveNodes, shouldn't we then try the initial configured nodes, and then only failing that, throw an exception?

@dsmiley
Copy link
Contributor

dsmiley commented Jan 10, 2025

I simplified characterization of how I think this should be done:

  • on initialization, copy the configured URLs to a field for safe keeping. Convert to URL if you like (validates if malformed)
  • on initialization, liveNodes shall be empty; we haven't fetched it yet. Ideally we don't even try to on initialization.
  • when getLiveNodes is called and it's not out of date, return it. If it's out of date, loop live nodes and then the initial configured URLs for the first server that will respond to fetch the current live nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants