-
Notifications
You must be signed in to change notification settings - Fork 676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail #2935
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
private Set<String> initialNodes; | ||
volatile Set<String> liveNodes; | ||
volatile Set<String> knownNodes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right away, I'm surprised. I can understand needing two sets, but 3? IMO knownNodes doesn't need to exists; that's the same as liveNodes. I see below you've changed many references from liveNodes to knownNodes but I recommend against doing that because the exact wording of "liveNodes" is extremely pervasive in SolrCloud (well known concept) so let's not have a vary it in just this class or any class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally did it with just liveNodes
but liveNodes is returned to the client which from it's name I was thinking they would assume they are "live". If we just place all the nodes into liveNodes thats not necessarily true, right? If the current set hold all the initially passed nodes, it isn't guaranteed it's live which is why I switched to known nodes while live is what was actually fetched from ZooKeeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no guarantee that getLiveNodes is going to return a list of reachable nodes. A moment after returning it, a node could become unreachable. So it's "best effort" and the caller has to deal with failures by trying the other nodes in the list and/or getting a possibly refreshed list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats fair. Removed known nodes and put everything into live nodes while using the initial set inside all the time.
@@ -65,6 +68,8 @@ public void init(List<String> solrUrls) throws Exception { | |||
urlScheme = solrUrl.startsWith("https") ? "https" : "http"; | |||
try (SolrClient initialClient = getSolrClient(solrUrl)) { | |||
this.liveNodes = fetchLiveNodes(initialClient); | |||
this.initialNodes = Set.copyOf(liveNodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea of this JIRA issue is that we'll take the Solr URLs as configured and use this is the initial / backup liveNodes. I think this is a very simple idea to to document/understand/implement.
private static JettySolrRunner jettyNode1; | ||
private static JettySolrRunner jettyNode2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static fields in tests that refer to Solr should be null'ed out, so there's some burden to them. Here... I think you could avoid these altogether and simply call cluster.getJettySolrRunner(0)
which isn't bad!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could add a convenience method if you like.
} | ||
|
||
private void waitForCSPCacheTimeout() throws InterruptedException { | ||
Thread.sleep(6000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should set the system property solr.solrj.cache.timeout.sec
to maybe 1ms and then you can merely sleep for like a 100 milliseconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache timeout setting is in seconds and pulls an int. Can't make it 1ms. This probably should have had better granularity instead of increments of seconds.
I could change cacheTimeout to take milliseconds as the int instead but that would change peoples system property not using the default. So if they set the cache timeout to 5 seconds, it now turns into 5ms if they are not aware of this change.
@@ -61,6 +67,7 @@ public abstract class BaseHttpClusterStateProvider implements ClusterStateProvid | |||
private int cacheTimeout = EnvUtils.getPropertyAsInteger("solr.solrj.cache.timeout.sec", 5); | |||
|
|||
public void init(List<String> solrUrls) throws Exception { | |||
this.initialNodes = getNodeNamesFromSolrUrls(solrUrls); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea of this JIRA issue is that we'll take the Solr URLs as configured and use this is the initial / backup liveNodes. I think this is a very simple idea to to document/understand/implement.
That makes sense. This leaves me with a few questions then. Shouldn't this take a list of URL
or URI
java object to verify actual non malformed URLs instead of a list of strings? I created the functions below to convert these Strings into cluster state nodeNames for liveNodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea to do basic URL malformed checks like this
public Set<String> getNodeNamesFromSolrUrls(List<String> urls) | ||
throws URISyntaxException, MalformedURLException { | ||
Set<String> set = new HashSet<>(); | ||
for (String url : urls) { | ||
String nodeNameFromSolrUrl = getNodeNameFromSolrUrl(url); | ||
set.add(nodeNameFromSolrUrl); | ||
} | ||
return Collections.unmodifiableSet(set); | ||
} | ||
|
||
/** URL to cluster state node name (http://127.0.0.1:12345/solr to 127.0.0.1:12345_solr) */ | ||
public String getNodeNameFromSolrUrl(String solrUrl) | ||
throws MalformedURLException, URISyntaxException { | ||
URL url = new URI(solrUrl).toURL(); | ||
return url.getAuthority() + url.getPath().replace('/', '_'); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if these methods belong here but it is required to convert the initial set of String urls into cluster state node names
Set<String> liveNodes = new HashSet<>(nodes); | ||
liveNodes.addAll(this.initialNodes); | ||
this.liveNodes = Set.copyOf(liveNodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this 3 lines of extra copy-ing instead of nothing more than: this.liveNodes = Set.copyOf(nodes);
? That is, why are we touching / using initialNodes at all here?
Maybe an IllegalArgumentException if nodes.isEmpty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was to not do another loop through if liveNodes was exhausted. initalNodes
would always exist in liveNodes with this set method. Let me refactor again from your suggestion comment
Set<String> set = new HashSet<>(); | ||
for (String url : urls) { | ||
String nodeNameFromSolrUrl = getNodeNameFromSolrUrl(url); | ||
set.add(nodeNameFromSolrUrl); | ||
} | ||
return Collections.unmodifiableSet(set); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could easily be converted to a single expression with streams
/** URL to cluster state node name (http://127.0.0.1:12345/solr to 127.0.0.1:12345_solr) */ | ||
public String getNodeNameFromSolrUrl(String solrUrl) | ||
throws MalformedURLException, URISyntaxException { | ||
URL url = new URI(solrUrl).toURL(); | ||
return url.getAuthority() + url.getPath().replace('/', '_'); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you find other code in Solr doing this? Surely it's somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be a static method that with a unit test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm maybe... I found getBaseUrlForNodeName
but it's going the other way nodeName->url. Let me look around more
@@ -229,10 +236,9 @@ > getCacheTimeout()) { | |||
for (String nodeName : liveNodes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we exhaust liveNodes, shouldn't we then try the initial configured nodes, and then only failing that, throw an exception?
I simplified characterization of how I think this should be done:
|
https://issues.apache.org/jira/browse/SOLR-17519
Description
In
BaseHttpClusterStateProvider
if all initially passed nodes except for 1 go down, then CSP will only be aware of the 1 live node. If that node were to go down and any of the other initially passed nodes were to recover, CSP would not be able to connect to get latest cluster state of live nodes because it only holds the address of the now downed node.Solution
CSP holds immutable
initialNodes
which is used to never remove this set from live nodes to keep it resilient. Now if the case above were to occur, CSP would still be able to get cluster state and latest set of live nodes because live nodes always has the initial set. Live nodes can also hold new nodes that are added to the cluster after CSP initialization but those are removable unlike the initial nodes.Tests
testClusterStateProviderDownedInitialLiveNodes
tests the above test case and used to fail.testClusterStateProviderLiveNodesWithNewHost
tests live nodes with a 3rd added new host after CSP is initialized which is removable but initial nodes are always still present to be reachable by CSP.Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.