Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay lock taking slightly #6627

Conversation

eddyashton
Copy link
Member

@eddyashton eddyashton commented Nov 11, 2024

Pulling standalone fixes out of #6616 to be merged separately.

Hidden by our suppressions file, but the snapshot_test unit test spots a lock inversion:

$ ./snapshot_test 
[doctest] doctest version is "2.4.11"
[doctest] run with "--help" for options
==================
WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=63651)
  Cycle in lock order graph: M0 (0x7b4400000be8) => M1 (0x7ffcc1b2f898) => M0

  Mutex M1 acquired here while holding mutex M0 in main thread:
    #0 pthread_mutex_lock <null> (snapshot_test+0x8051a) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)
    #1 std::__1::mutex::lock() <null> (libc++.so.1+0x4af15) (BuildId: e3dee72a81fed73680e4d05b6858c5327d95f499)
    #2 ccf::kv::BaseTx::get_map_and_change_set_by_name(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, bool) /data/src/3.CCF/build.tsan/../src/kv/tx.cpp:64:21 (snapshot_test+0x1ec439) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)
    #3 ccf::kv::ValueHandle<std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, ccf::kv::serialisers::BlitSerialiser<std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>>, ccf::kv::serialisers::ZeroBlitUnitCreator>* ccf::kv::BaseTx::get_handle_by_name<ccf::kv::ValueHandle<std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, ccf::kv::serialisers::BlitSerialiser<std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>>, ccf::kv::serialisers::ZeroBlitUnitCreator>>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, bool) /data/src/3.CCF/build.tsan/../include/ccf/tx.h:97:43 (snapshot_test+0x1add77) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)
    #4 ccf::kv::TypedValue<std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, ccf::kv::serialisers::BlitSerialiser<std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>>, ccf::kv::serialisers::ZeroBlitUnitCreator>::ReadOnlyHandle* ccf::kv::ReadOnlyTx::ro<ccf::kv::TypedValue<std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, ccf::kv::serialisers::BlitSerialiser<std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>>, ccf::kv::serialisers::ZeroBlitUnitCreator>>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) /data/src/3.CCF/build.tsan/../include/ccf/tx.h:186:14 (snapshot_test+0x1713aa) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)
    #5 ccf::HashedTxHistory<ccf::MerkleTreeHistory>::init_from_snapshot(std::__1::vector<unsigned char, std::__1::allocator<unsigned char>> const&) /data/src/3.CCF/build.tsan/../src/node/history.h:682:33 (snapshot_test+0x1713aa)
    #6 ccf::kv::Store::deserialise_snapshot(unsigned char const*, unsigned long, std::__1::vector<std::__1::unique_ptr<ccf::kv::ConsensusHook, std::__1::default_delete<ccf::kv::ConsensusHook>>, std::__1::allocator<std::__1::unique_ptr<ccf::kv::ConsensusHook, std::__1::default_delete<ccf::kv::ConsensusHook>>>>&, std::__1::vector<unsigned long, std::__1::allocator<unsigned long>>*, bool) /data/src/3.CCF/build.tsan/../src/kv/store.h:514:17 (snapshot_test+0x146349) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)
    #7 DOCTEST_ANON_FUNC_14() /data/src/3.CCF/build.tsan/../src/node/test/snapshot.cpp:125:7 (snapshot_test+0xf4b67) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)
    #8 doctest::Context::run() /data/src/3.CCF/build.tsan/../3rdparty/test/doctest/doctest.h:7007:21 (snapshot_test+0xf0de6) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)
    #9 main /data/src/3.CCF/build.tsan/../src/node/test/snapshot.cpp:186:21 (snapshot_test+0xf7090) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)

    Hint: use TSAN_OPTIONS=second_deadlock_stack=1 to get more informative warning message

  Mutex M0 acquired here while holding mutex M1 in main thread:
    #0 pthread_mutex_lock <null> (snapshot_test+0x8051a) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)
    #1 std::__1::mutex::lock() <null> (libc++.so.1+0x4af15) (BuildId: e3dee72a81fed73680e4d05b6858c5327d95f499)
    #2 ccf::kv::Store::rollback(ccf::kv::TxID const&, unsigned long) /data/src/3.CCF/build.tsan/../src/kv/store.h:617:14 (snapshot_test+0x14c7bc) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)
    #3 ccf::kv::Store::fill_maps(std::__1::vector<unsigned char, std::__1::allocator<unsigned char>> const&, bool, unsigned long&, unsigned long&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, ccf::kv::MapChanges, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, ccf::kv::MapChanges>>>&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::shared_ptr<ccf::kv::AbstractMap>, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, std::__1::shared_ptr<ccf::kv::AbstractMap>>>>&, ccf::ClaimsDigest&, std::__1::optional<ccf::crypto::Sha256Hash>&, bool) /data/src/3.CCF/build.tsan/../src/kv/store.h:734:7 (snapshot_test+0x150b2b) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)
    #4 non-virtual thunk to ccf::kv::Store::fill_maps(std::__1::vector<unsigned char, std::__1::allocator<unsigned char>> const&, bool, unsigned long&, unsigned long&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, ccf::kv::MapChanges, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, ccf::kv::MapChanges>>>&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::shared_ptr<ccf::kv::AbstractMap>, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, std::__1::shared_ptr<ccf::kv::AbstractMap>>>>&, ccf::ClaimsDigest&, std::__1::optional<ccf::crypto::Sha256Hash>&, bool) /data/src/3.CCF/build.tsan/../src/kv/store.h (snapshot_test+0x152813) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)
    #5 ccf::kv::CFTExecutionWrapper::apply(bool) /data/src/3.CCF/build.tsan/../src/kv/deserialise.h:83:19 (snapshot_test+0x1bed1b) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)
    #6 DOCTEST_ANON_FUNC_14() /data/src/3.CCF/build.tsan/../src/node/test/snapshot.cpp:171:48 (snapshot_test+0xf5c64) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)
    #7 doctest::Context::run() /data/src/3.CCF/build.tsan/../3rdparty/test/doctest/doctest.h:7007:21 (snapshot_test+0xf0de6) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)
    #8 main /data/src/3.CCF/build.tsan/../src/node/test/snapshot.cpp:186:21 (snapshot_test+0xf7090) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded)

SUMMARY: ThreadSanitizer: lock-order-inversion (potential deadlock) (/data/src/3.CCF/build.tsan/snapshot_test+0x8051a) (BuildId: ba8a8b25dd81ecc61267ef6b9ef050dfb1148ded) in pthread_mutex_lock
==================
===============================================================================
[doctest] test cases:  1 |  1 passed | 0 failed | 0 skipped
[doctest] assertions: 13 | 13 passed | 0 failed |
[doctest] Status: SUCCESS!
ThreadSanitizer: reported 1 warnings

During rollback we hold the maps_lock, and eventually take the history.state_lock to touch the History's state. In the init_from_snapshot path, we take the history.state_lock very early and then try to read from the KV (which internally requires taking maps_lock to establish which tables exist). This causes a lock cycle.

I think the fix is simple and uncontroversial - delay taking the lock in the init_from_snapshot path until after the KV read. This mutex is only supposed to protect the History's internal state, so arbitrary work may come before it. Also, the KV read should be monotonic.

@eddyashton eddyashton added 5.x-todo PRs which should be backported to 5.x run-long-test Run Long Test job labels Nov 11, 2024
@eddyashton eddyashton requested a review from a team November 11, 2024 15:50
@achamayou achamayou enabled auto-merge November 11, 2024 16:52
@achamayou achamayou added this pull request to the merge queue Nov 11, 2024
Merged via the queue into microsoft:main with commit d8020c0 Nov 11, 2024
13 checks passed
@achamayou achamayou deleted the delay_lock_until_afer_read_during_init_from_snapshot branch November 11, 2024 18:02
@eddyashton eddyashton added the auto-backport Automatically backport this PR to LTS branch label Nov 12, 2024
ghost pushed a commit that referenced this pull request Nov 12, 2024
Co-authored-by: Amaury Chamayou <[email protected]>
(cherry picked from commit d8020c0)
@ghost ghost added the backported This PR was successfully backported to LTS branch label Nov 12, 2024
achamayou pushed a commit that referenced this pull request Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.x-todo PRs which should be backported to 5.x auto-backport Automatically backport this PR to LTS branch backported This PR was successfully backported to LTS branch run-long-test Run Long Test job
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants