-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276
base: develop
Are you sure you want to change the base?
Conversation
- Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5276 +/- ##
=========================================
- Coverage 78.2% 78.2% -0.0%
=========================================
Files 790 790
Lines 67639 67646 +7
Branches 8160 8163 +3
=========================================
- Hits 52869 52867 -2
- Misses 14770 14779 +9
|
* Use a second mutex to protect the backends from modification * Remove a bunch of warning comments
// backendMutex_ is only needed when the *Backend_ members are modified. | ||
// Reads are protected by the general mutex_. | ||
std::mutex backendMutex_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this sounds like a typical single-write and one-or-more-read scenario, is it possible to use a single shared_mutex
here instead of these two mutexes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible, but there are risks. The biggest one is that I'd have to take a shared_lock
at the start of rotateWithLock
, and upgrade it to a unique_lock
after the callback. If there is somehow ever a second caller to that function, or even a different caller that upgrades the lock, there is a potential deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bthomee @vvysokikh1 Ok, it took waaaaaaay longer than it should have because I kept trying clever things that didn't work or turned out unsupported, but I rewrote the locking, and changed to a shared mutex, and I think I've got a pretty foolproof solution here. And a unit test to exercise it.
But don't take my word for it. The point of code reviews is to spot the stuff I didn't consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your solution is not completely solving the issue. It's still technically possible to deadlock (calling rotateWithLock from inside of the callback, this will cause a deadlock on your new mutex).
If it's good enough for now, please leave some comments to rotateWithLock()
to warn any user of calling rotateWithLock()
directly or indirectly from callback.
* upstream/develop: Updates Conan dependencies (5256)
913df26
to
9f564bc
Compare
- Rewrite the locking in DatabaseRotatingImp::rotateWithLock to use a shared_lock, and write a unit test to show (as much as possible) that it won't deadlock.
13fb47c
to
d912b50
Compare
* upstream/develop: fix: Do not allow creating Permissioned Domains if credentials are not enabled (5275) fix: issues in `simulate` RPC (5265)
std::unique_lock writeLock(mutex_); | ||
if (!rotating) | ||
{ | ||
// Once this flag is set, we're committed to doing the work and | ||
// returning true. | ||
rotating = true; | ||
} | ||
else | ||
{ | ||
// This should only be reachable through unit tests. | ||
XRPL_ASSERT( | ||
unitTest_, | ||
"ripple::NodeStore::DatabaseRotatingImp::rotateWithLock " | ||
"unit testing"); | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to lock mutex here? I would assume we can make rotating
atomic bool and use compare_exchange to switch this flag safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to lock mutex here? I would assume we can make
rotating
atomic bool and use compare_exchange to switch this flag safely.
I got rid of rotating
entirely.
auto const writableBackend = [&] { | ||
std::shared_lock readLock(mutex_); | ||
XRPL_ASSERT( | ||
rotating, | ||
"ripple::NodeStore::DatabaseRotatingImp::rotateWithLock rotating " | ||
"flag set"); | ||
|
||
return writableBackend_; | ||
}(); | ||
|
||
auto newBackend = f(writableBackend->getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these lambda and read lock are actually required with current implementation. We are only using write lock before (which might be switched to atomic) and after. Assuming previous synchronization block switches rotating flag, there should be no other 'write' thread being able to proceed and capture writeLock
while we are here.
// This should only be reachable through unit tests. | ||
XRPL_ASSERT( | ||
unitTest_, | ||
"ripple::NodeStore::DatabaseRotatingImp::rotateWithLock " | ||
"unit testing"); | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment doesn't work. It can be reached not only with unit tests, but also by accidental concurrent call to rotateWithLock or indirect call to rotateWithLock from the callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment doesn't work. It can be reached not only with unit tests, but also by accidental concurrent call to rotateWithLock or indirect call to rotateWithLock from the callback.
It will only be intentionally reachable through tests. An accidental concurrent call is not intentional. Will update.
// "Shared mutexes do not support direct transition from shared to unique | ||
// ownership mode: the shared lock has to be relinquished with | ||
// unlock_shared() before exclusive ownership may be obtained with lock()." | ||
mutable std::shared_timed_mutex mutex_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason for choosing timed mutex here? I believe shared_mutex would be enough here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason for choosing timed mutex here? I believe shared_mutex would be enough here
No longer relevant, but I had been working with try_lock_for
for a while, and missed changing this back.
|
||
// Because of the "rotating" flag, there should be no way any other thread | ||
// is waiting at this point. As long as they all release the shared_lock | ||
// before taking the unique_lock (which they have to, because upgrading is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For read/write threads, you can also consider using boost::upgrade_lock
[1] wrapping a boost::upgrade_mutex
[2] here instead of std::shared_lock
with a std::shared_(timed)_mutex
, as it supports upgrading a read lock to a write lock without unlocking in between. Read-only threads can then use boost::shared_lock
while wrapping that same boost::upgrade_mutex
.
[1] https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.locks.upgrade_lock
[2] https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_types.upgrade_mutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for those links. That's exactly what I was looking for previously, but (obviously) didn't find. I'll update again ASAP.
* upstream/develop: fix: Amendment to add transaction flag checking functionality for Credentials (5250) fix: Omit superfluous setCurrentThreadName call in GRPCServer.cpp (5280)
- Use a boost::upgrade_mutex, which implements a clean read -> write lock upgrade.
|
||
boost::upgrade_lock readLock(mutex_, boost::defer_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would suggest renaming this from readLock (as this is not actually a read lock, but a unique lock that allows other shared_locks). The name is confusing a bit as it suggests multiple threads can enter, while this is not the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would suggest renaming this from readLock (as this is not actually a read lock, but a unique lock that allows other shared_locks). The name is confusing a bit as it suggests multiple threads can enter, while this is not the case
Done. Also added a comment, since a lot of folks are unfamiliar with this class.
savedState.lastRotated = lastRotated; | ||
state_db_.setState(savedState); | ||
|
||
clearCaches(validatedSeq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line 363 the clearCaches(validatedSeq)
is already called - is calling it here again needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line 363 the
clearCaches(validatedSeq)
is already called - is calling it here again needed?
I believe so because some time could have passed between the two calls, and this helps clear out any entries that were added in that time.
bool const unitTest_; | ||
|
||
// Implements the "UpgradeLockable" concept | ||
// https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable | |
// https://www.boost.org/doc/libs/1_83_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable |
We're currently using 1.83. Keeping it as-is is fine, but this is simply more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're currently using 1.83. Keeping it as-is is fine, but this is simply more accurate.
I updated it to link to the latest release because I don't expect it to change much, and don't want to have to chase version numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really don't know how to say this without sounding like a complete joy-kill, but 1.87 is latest...
// This function should be the only one taking any kind of unique/write | ||
// lock, and should only be called once at a time by its syncronous caller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the consequences if the synchronous caller calls it multiple times? Can we protect against that, or at least detect it, and would it make sense to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the consequences if the synchronous caller calls it multiple times? Can we protect against that, or at least detect it, and would it make sense to do so?
If it calls them sequentially, then it'll just rotate multiple times, which would be dumb. If it calls them in parallel, all but one will fail. If it calls it recursively through the callback, the recursive call will fail.
I've added test cases that demonstrate all three of those possibilities, though the focus is on the latter two.
// boost::upgrade_mutex guarantees that only one thread can have "upgrade | ||
// ownership" at a time, so this is 100% safe, and guaranteed to avoid | ||
// deadlock. | ||
boost::upgrade_to_unique_lock writeLock(readLock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a risk that readers keep arriving such that the shared lock is always held for reading by at least one of them, and therefore starving the writer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think such situation is theoretically possible. I'm not sure this is actually the case here but if we decide on preventing this, there's no standard C++ or Boost locks that could provide writer fairness. If we decide to ensure this, something like this would be required:
boost::shared_mutex mutex_;
std::atomic<bool> writerWaiting_{false};
void anyReadFunction() {
while(writerWaiting_.load(std::memory_order_acquire)) {
std::this_thread::yield(); // Back off if writer is waiting
}
boost::shared_lock lock(mutex_);
...
}
void rotateWithLock(....) {
boost::upgrade_lock<boost::shared_mutex> upgradeLock(mutex_);
...
writerWaiting_.store(true, std::memory_order_release); // should do it RAII-way, this is an example
boost::upgrade_to_unique_lock<boost::shared_mutex> writeLock(upgradeLock);
...
writerWaiting_.store(false, std::memory_order_release);
}
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description for shared_mutex includes:
Note the the lack of reader-writer priority policies in shared_mutex. This is due to an algorithm credited to Alexander Terekhov which lets the OS decide which thread is the next to get the lock without caring whether a unique lock or shared lock is being sought. This results in a complete lack of reader or writer starvation. It is simply fair.
Since upgrade_mutex
is an extension I think it's safe to assume it uses the same algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is applicable. Bart mentions the situation where writer would never have a chance to acquire lock since there's always some read threads holding the lock (they can always enter as other read locks won't prevent it).
This is highly theoretical situation though, I don't think we can hit this here.
EDIT: looking into the implementation of upgrade_mutex, I can see that this seems to be addressed here. Not a issue.
8b90537
to
217bc1d
Compare
I do not like that now the code which is meant to be more robust, has apparently become more brittle by the switch to |
I think that we do not need to pass a lambda to |
High Level Overview of Change
Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."
This rewrites the code so that the lock is not held during the callback. Instead it locks twice, once before, and once after. This is safe due to the structure of the code, but is checked after the second lock. This allows
mutex_
to be changed back to a regular mutex.Context of Change
From #4989:
Type of Change
Test Plan
Testing can be the same as that for #4989, plus ensure that there are no regressions.