Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

Open
wants to merge 12 commits into
base: develop
Choose a base branch
from

Conversation

ximinez
Copy link
Collaborator

@ximinez ximinez commented Feb 4, 2025

High Level Overview of Change

Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."

This rewrites the code so that the lock is not held during the callback. Instead it locks twice, once before, and once after. This is safe due to the structure of the code, but is checked after the second lock. This allows mutex_ to be changed back to a regular mutex.

Context of Change

From #4989:

The rotateWithLock function holds a lock while it calls a callback function that's passed in by the caller. This is a problematic design that needs to be used very carefully. In this case, at least one caller passed in a callback that eventually relocks the mutex on the same thread, causing UB (a deadlock was observed). The caller was from SHAMapStoreImpl, and it called clearCaches. This clearCaches can potentially call fetchNodeObject, which tried to relock the mutex.

This patch resolves the issue by changing the mutex type to a recursive_mutex. Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex.

Type of Change

  • Refactor (non-breaking change that only restructures code)

Test Plan

Testing can be the same as that for #4989, plus ensure that there are no regressions.

- Follow-up to #4989, which stated "Ideally, the code should be
  rewritten so it doesn't hold the mutex during the callback and the
  mutex should be changed back to a regular mutex."
Copy link

codecov bot commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 80.76923% with 5 lines in your changes missing coverage. Please review.

Project coverage is 78.2%. Comparing base (a079bac) to head (8b90537).

Files with missing lines Patch % Lines
src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp 73.3% 4 Missing ⚠️
src/xrpld/app/misc/SHAMapStoreImp.cpp 90.9% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           develop   #5276     +/-   ##
=========================================
- Coverage     78.2%   78.2%   -0.0%     
=========================================
  Files          790     790             
  Lines        67639   67646      +7     
  Branches      8160    8163      +3     
=========================================
- Hits         52869   52867      -2     
- Misses       14770   14779      +9     
Files with missing lines Coverage Δ
src/xrpld/nodestore/DatabaseRotating.h 100.0% <ø> (ø)
src/xrpld/nodestore/detail/DatabaseRotatingImp.h 66.7% <ø> (ø)
src/xrpld/app/misc/SHAMapStoreImp.cpp 76.3% <90.9%> (-0.8%) ⬇️
src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp 67.0% <73.3%> (+6.3%) ⬆️

... and 4 files with indirect coverage changes

Impacted file tree graph

* Use a second mutex to protect the backends from modification
* Remove a bunch of warning comments
@ximinez ximinez requested a review from Bronek February 6, 2025 00:01
Comment on lines 83 to 85
// backendMutex_ is only needed when the *Backend_ members are modified.
// Reads are protected by the general mutex_.
std::mutex backendMutex_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this sounds like a typical single-write and one-or-more-read scenario, is it possible to use a single shared_mutex here instead of these two mutexes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible, but there are risks. The biggest one is that I'd have to take a shared_lock at the start of rotateWithLock, and upgrade it to a unique_lock after the callback. If there is somehow ever a second caller to that function, or even a different caller that upgrades the lock, there is a potential deadlock.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bthomee @vvysokikh1 Ok, it took waaaaaaay longer than it should have because I kept trying clever things that didn't work or turned out unsupported, but I rewrote the locking, and changed to a shared mutex, and I think I've got a pretty foolproof solution here. And a unit test to exercise it.

But don't take my word for it. The point of code reviews is to spot the stuff I didn't consider.

Copy link
Collaborator

@vvysokikh1 vvysokikh1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your solution is not completely solving the issue. It's still technically possible to deadlock (calling rotateWithLock from inside of the callback, this will cause a deadlock on your new mutex).

If it's good enough for now, please leave some comments to rotateWithLock() to warn any user of calling rotateWithLock() directly or indirectly from callback.

* upstream/develop:
  Updates Conan dependencies (5256)
- Rewrite the locking in DatabaseRotatingImp::rotateWithLock to use
  a shared_lock, and write a unit test to show (as much as possible)
  that it won't deadlock.
* upstream/develop:
  fix: Do not allow creating Permissioned Domains if credentials are not enabled (5275)
  fix: issues in `simulate` RPC (5265)
Comment on lines 55 to 70
std::unique_lock writeLock(mutex_);
if (!rotating)
{
// Once this flag is set, we're committed to doing the work and
// returning true.
rotating = true;
}
else
{
// This should only be reachable through unit tests.
XRPL_ASSERT(
unitTest_,
"ripple::NodeStore::DatabaseRotatingImp::rotateWithLock "
"unit testing");
return false;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to lock mutex here? I would assume we can make rotating atomic bool and use compare_exchange to switch this flag safely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to lock mutex here? I would assume we can make rotating atomic bool and use compare_exchange to switch this flag safely.

I got rid of rotating entirely.

Comment on lines 72 to 82
auto const writableBackend = [&] {
std::shared_lock readLock(mutex_);
XRPL_ASSERT(
rotating,
"ripple::NodeStore::DatabaseRotatingImp::rotateWithLock rotating "
"flag set");

return writableBackend_;
}();

auto newBackend = f(writableBackend->getName());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these lambda and read lock are actually required with current implementation. We are only using write lock before (which might be switched to atomic) and after. Assuming previous synchronization block switches rotating flag, there should be no other 'write' thread being able to proceed and capture writeLock while we are here.

src/xrpld/app/misc/SHAMapStoreImp.cpp Show resolved Hide resolved
Comment on lines 64 to 69
// This should only be reachable through unit tests.
XRPL_ASSERT(
unitTest_,
"ripple::NodeStore::DatabaseRotatingImp::rotateWithLock "
"unit testing");
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment doesn't work. It can be reached not only with unit tests, but also by accidental concurrent call to rotateWithLock or indirect call to rotateWithLock from the callback.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment doesn't work. It can be reached not only with unit tests, but also by accidental concurrent call to rotateWithLock or indirect call to rotateWithLock from the callback.

It will only be intentionally reachable through tests. An accidental concurrent call is not intentional. Will update.

// "Shared mutexes do not support direct transition from shared to unique
// ownership mode: the shared lock has to be relinquished with
// unlock_shared() before exclusive ownership may be obtained with lock()."
mutable std::shared_timed_mutex mutex_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for choosing timed mutex here? I believe shared_mutex would be enough here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for choosing timed mutex here? I believe shared_mutex would be enough here

No longer relevant, but I had been working with try_lock_for for a while, and missed changing this back.


// Because of the "rotating" flag, there should be no way any other thread
// is waiting at this point. As long as they all release the shared_lock
// before taking the unique_lock (which they have to, because upgrading is
Copy link
Collaborator

@bthomee bthomee Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For read/write threads, you can also consider using boost::upgrade_lock [1] wrapping a boost::upgrade_mutex [2] here instead of std::shared_lock with a std::shared_(timed)_mutex, as it supports upgrading a read lock to a write lock without unlocking in between. Read-only threads can then use boost::shared_lock while wrapping that same boost::upgrade_mutex.

[1] https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.locks.upgrade_lock
[2] https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_types.upgrade_mutex

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for those links. That's exactly what I was looking for previously, but (obviously) didn't find. I'll update again ASAP.

* upstream/develop:
  fix: Amendment to add transaction flag checking functionality for Credentials (5250)
  fix: Omit superfluous setCurrentThreadName call in GRPCServer.cpp (5280)
- Use a boost::upgrade_mutex, which implements a clean read -> write
  lock upgrade.

boost::upgrade_lock readLock(mutex_, boost::defer_lock);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would suggest renaming this from readLock (as this is not actually a read lock, but a unique lock that allows other shared_locks). The name is confusing a bit as it suggests multiple threads can enter, while this is not the case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would suggest renaming this from readLock (as this is not actually a read lock, but a unique lock that allows other shared_locks). The name is confusing a bit as it suggests multiple threads can enter, while this is not the case

Done. Also added a comment, since a lot of folks are unfamiliar with this class.

savedState.lastRotated = lastRotated;
state_db_.setState(savedState);

clearCaches(validatedSeq);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line 363 the clearCaches(validatedSeq) is already called - is calling it here again needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line 363 the clearCaches(validatedSeq) is already called - is calling it here again needed?

I believe so because some time could have passed between the two calls, and this helps clear out any entries that were added in that time.

bool const unitTest_;

// Implements the "UpgradeLockable" concept
// https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable
// https://www.boost.org/doc/libs/1_83_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable

We're currently using 1.83. Keeping it as-is is fine, but this is simply more accurate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're currently using 1.83. Keeping it as-is is fine, but this is simply more accurate.

I updated it to link to the latest release because I don't expect it to change much, and don't want to have to chase version numbers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really don't know how to say this without sounding like a complete joy-kill, but 1.87 is latest...

Comment on lines +50 to +51
// This function should be the only one taking any kind of unique/write
// lock, and should only be called once at a time by its syncronous caller.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the consequences if the synchronous caller calls it multiple times? Can we protect against that, or at least detect it, and would it make sense to do so?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the consequences if the synchronous caller calls it multiple times? Can we protect against that, or at least detect it, and would it make sense to do so?

If it calls them sequentially, then it'll just rotate multiple times, which would be dumb. If it calls them in parallel, all but one will fail. If it calls it recursively through the callback, the recursive call will fail.

I've added test cases that demonstrate all three of those possibilities, though the focus is on the latter two.

// boost::upgrade_mutex guarantees that only one thread can have "upgrade
// ownership" at a time, so this is 100% safe, and guaranteed to avoid
// deadlock.
boost::upgrade_to_unique_lock writeLock(readLock);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a risk that readers keep arriving such that the shared lock is always held for reading by at least one of them, and therefore starving the writer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think such situation is theoretically possible. I'm not sure this is actually the case here but if we decide on preventing this, there's no standard C++ or Boost locks that could provide writer fairness. If we decide to ensure this, something like this would be required:

    boost::shared_mutex mutex_;
    std::atomic<bool> writerWaiting_{false};
    
    void anyReadFunction() {
        while(writerWaiting_.load(std::memory_order_acquire)) {
            std::this_thread::yield(); // Back off if writer is waiting
        }
        boost::shared_lock lock(mutex_);
        ...
    }
    
    void rotateWithLock(....) {
        boost::upgrade_lock<boost::shared_mutex> upgradeLock(mutex_);
        ...
        writerWaiting_.store(true, std::memory_order_release); // should do it RAII-way, this is an example 
        boost::upgrade_to_unique_lock<boost::shared_mutex> writeLock(upgradeLock);
        ...
        writerWaiting_.store(false, std::memory_order_release);
    }
    ```

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description for shared_mutex includes:

Note the the lack of reader-writer priority policies in shared_mutex. This is due to an algorithm credited to Alexander Terekhov which lets the OS decide which thread is the next to get the lock without caring whether a unique lock or shared lock is being sought. This results in a complete lack of reader or writer starvation. It is simply fair.

Since upgrade_mutex is an extension I think it's safe to assume it uses the same algorithm.

Copy link
Collaborator

@vvysokikh1 vvysokikh1 Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is applicable. Bart mentions the situation where writer would never have a chance to acquire lock since there's always some read threads holding the lock (they can always enter as other read locks won't prevent it).

This is highly theoretical situation though, I don't think we can hit this here.

EDIT: looking into the implementation of upgrade_mutex, I can see that this seems to be addressed here. Not a issue.

@Bronek Bronek self-requested a review February 12, 2025 18:05
@Bronek
Copy link
Collaborator

Bronek commented Feb 12, 2025

I do not like that now the code which is meant to be more robust, has apparently become more brittle by the switch to shared_lock which needs to be upgraded in the critical section. Maybe I simply do not understand the rationale.

@Bronek
Copy link
Collaborator

Bronek commented Feb 12, 2025

I think that we do not need to pass a lambda to rotateWithLock at all. I do not see a reason that there should be a second call to clearCaches or update of state_db_ in SHAMapStoreImp::run, both in a critical section controlled by DatabaseRotatingImp. We most likely do not need that second call to clearCaches, and the update of state_db_ could be most likely moved outside of this critical section. See in #5286 what I mean @ximinez

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants