Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BTS-1644] ResignLeadership + Wait #21401

Open
wants to merge 5 commits into
base: devel
Choose a base branch
from

Conversation

maierlars
Copy link
Contributor

@maierlars maierlars commented Oct 28, 2024

Scope & Purpose

Added waitForInSync and waitForInSyncTimeout parameters to the resign leadership job. This allows the user to wait for a certain amount of time to make sure that there exist common in sync followers. Previously they were just ignored and could cause downtime.

Design Doc: https://github.com/arangodb/documents/pull/136

@maierlars maierlars self-assigned this Oct 28, 2024
@cla-bot cla-bot bot added the cla-signed label Oct 28, 2024
@@ -337,7 +344,28 @@ bool ResignLeadership::start(bool& aborts) {

// Schedule shard relocations
if (!scheduleMoveShards(pending)) {
finish("", "", false, "Could not schedule MoveShard.");
LOG_TOPIC("d4473", DEBUG, Logger::SUPERVISION)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this a "WARN" log level, according to our policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "INFO", but it should show up in the logs whenever it happens in production, IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured that it is enough to display this message when the job actually fails (see further down). Furthermore scheduleMoveShards already creates log messages which explain why it won't start yet.

@maierlars maierlars marked this pull request as ready for review October 29, 2024 13:40
Copy link
Contributor

@jvolmer jvolmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got two type-improvement suggestions, otherwise LGTM.

<< "Not starting resign leadership job because some shards have no "
"common in sync follower";
// check if a timeout value is specified
if (_waitForInSyncTimeout > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand it correctly that if _waitForInSyncTimeout == 0, there is no timeout and we can wait indefinitely? This is a bit confusing, because 0 timeout normally just means instantly.
Possible fix: We could have an optional instead of just a number and if the the snapshot-velocypack does not include waitForInSyncTimeout, the option is std::nullopt.

@@ -55,6 +55,8 @@ struct ResignLeadership : public Job {

std::string _server;
bool _undoMoves{true};
bool _waitForInSync{false};
uint64_t _waitForInSyncTimeout{30 * 60};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30 minutes sound a lot to me, but perhaps this is correct.
This value is actually set per default to 0 in the ResignLeadership constructor, so I guess it is more readable to set it to zero here as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing: I would be nice to make it visible that they both belong together - _waitForInSyncTimeout does not make any sense if _waitForInSync is false - as far as I understood.
E.g. via an std::variant<bool, std::tuple<bool, TimeoutInSec>> or with the comment above: std::variant<bool, std::tuple<bool, std::optional<TimeoutInSec>>>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants