Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug/rescan datarace #820

Closed
wants to merge 19 commits into from
Closed

Conversation

MStreet3
Copy link
Contributor

@MStreet3 MStreet3 commented Oct 15, 2022

Background

There is datarace in the neutrino client implementation that
is causing test flakes in lnd. Specifically, there was a shared
struct property rescan whose updates resulted in a data race with
the possibility of a panic when rescan was nil. Further investigation
revealed race conditions on the rescanQuit and rescanErr
props of the client as well.

Changes

This PR modifies the design of the neutrino client to better reflect
the requirement that only a single Rescan object instance be
running at a time. To that end, this PR removes the rescan property
on the client and replaces it with two buffered channels (cap = 1), one each for
a rescanner and the rescanner's signal channel. In this way there is no
need to maintain the state scanning. Now any method can
infer that the client is scanning if it is able to read a scanner from the channel.

This PR resolves a data race on the rescanErr channel via a
new error chan consumer method that takes all new error channels
from rescanner restarts and feeds any read errors onto rescanErr.

Testing

This PR adds two interfaces rescan.Interface and NeutrinoChainService
to abstract out the dependency on structs from the neutrino package.

This PR adds a file of mock implementations for these interfaces and uses them
in a new unit test file. The unit tests added are not intended to capture all
the functionality of the client, but to demonstrate that the data races are resolved
and these tests should be run with the -race flag enabled.

Reviewing

  1. Checkout 93ca685 and run tests.
    TestNeutrinoClientNotifyReceivedRescan will fail with a segmentation fault
    in the neutrino client due torescan being set to nil.

  2. Checkout 6e1d604 and verify that
    the segmentation fault is fixed and TestNeutrinoClientNotifyReceivedRescan
    passes, but that -race flag testing fails.

  3. Checkout 56af455 run tests and
    verify that TestNeutrinoClientNotifyReceivedRescan passes with -race flag enabled.

All changes after 56af455 are cosmetic and
cleanup related.

fixes #819

@MStreet3 MStreet3 force-pushed the bug/rescan-datarace branch 5 times, most recently from 05658d0 to 00ea81b Compare October 16, 2022 02:33
@MStreet3 MStreet3 marked this pull request as draft October 16, 2022 02:56
@MStreet3 MStreet3 force-pushed the bug/rescan-datarace branch from 5edb8ab to acea149 Compare October 16, 2022 03:52
@MStreet3 MStreet3 marked this pull request as ready for review October 16, 2022 05:42
@MStreet3 MStreet3 force-pushed the bug/rescan-datarace branch 7 times, most recently from f2b6301 to 4ae09a5 Compare October 17, 2022 15:01
Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your work on this! The rescan logic is definitely something that needs to be cleaned up.
But I have to admit, this is very hard to review. All the changes are in a single commit so it's almost impossible to reason about everything at the same time. Even with the new tests added that makes it very hard to make sure no dead lock or new data races are introduced with the changes. And that the formatting doesn't match the rest of the code base doesn't really help the review either.

Can you please split things up a bit more for easier review? Thanks!

chain/mocks_test.go Outdated Show resolved Hide resolved
chain/mocks_test.go Show resolved Hide resolved
chain/mocks_test.go Outdated Show resolved Hide resolved
chain/mocks_test.go Outdated Show resolved Hide resolved
chain/neutrino.go Outdated Show resolved Hide resolved
chain/neutrino.go Outdated Show resolved Hide resolved

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add a NeutrinoChainService interface to provide an abstract
implementation of all the methods exported by a *neutrino.ChainService.
Add an internal rescan package for use in the neutrino client to
allow abstract access to the public methods of a *neutrino.Rescan.
Add mock implementations for NeutrinoChainService and the rescan
Interface.
Add TestNeutrinoClientSequentialStartStop to verify that the
neutrino client can have Start() and Stop() called sequentially.
Test case fails with -race flag enabled.
Resolve the data race in the existing unit test by removing the
variable aliasing on s.rescanErr in notification handler
and by only writing ClientConnected notification after the notification
handler is running.

Write ClientConnected notification via a select statement as a
goroutine is unnecessary to send to the unbounded queue.
Add TestNeutrinoClientNotifyReceived to verify that calls to
NotifyReceived call the Update method of the rescan object.  Verify
that there is no race condition on any state variables of the
NeutrinoClient.  Pass test with -race flag enabled.

Add a newRescanner private property of the NeutrinoClient to
dynamically inject the rescan constructor.  Cannot test without dyanmic
dependency injection as the neutrino client previously expected a
*neutrino.ChainService struct to create the rescan object.  Test
with a mockRescan object requires that this assumption be overridden
for unit tests.
Move transform of outpoints to a separate method to make Rescan
method easier to read.
Add a unit test that demonstrates a segmentation fault exists when
concurrent calls to NotifyReceived, NotifyBlocks and Rescan methods of
the NeutrinoClient are executed.  The rescan property is set to nil
and a segmentation fault arises.
Fix segmentation fault by using a channel to share the single
instance of the rescan object.  TestNeutrinoClientNotifyReceivedRescan
passes, but fails with -race flag.

Remove rescan and scanning properties and modify test to
use rescanCh.
Use a channel of channels to safely share the proper channel to close
for quitting the running rescan.  Add rescanQuit as a parameter to each
of the notification handlers.  Wrap the notification handlers as
closures and pass the single quit channel to the handlers.
Resolve the data race on s.rescanErr by setting it only once in the
constructor and by consuming the error channels returned from starting
a rescan process.
Use createRescanner method to create the rescan and rescanQuit
channels in a single place.
@MStreet3 MStreet3 force-pushed the bug/rescan-datarace branch from 4eeed4d to a21bc67 Compare October 22, 2022 06:09
@MStreet3 MStreet3 marked this pull request as draft October 24, 2022 13:12
@MStreet3
Copy link
Contributor Author

Converting to draft state to address comments from @guggero

@MStreet3
Copy link
Contributor Author

See #822

@MStreet3 MStreet3 closed this Oct 26, 2022
@MStreet3 MStreet3 deleted the bug/rescan-datarace branch December 2, 2022 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NeutrinoClient data race
2 participants