-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug/rescan datarace #820
Bug/rescan datarace #820
Conversation
05658d0
to
00ea81b
Compare
5edb8ab
to
acea149
Compare
f2b6301
to
4ae09a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your work on this! The rescan logic is definitely something that needs to be cleaned up.
But I have to admit, this is very hard to review. All the changes are in a single commit so it's almost impossible to reason about everything at the same time. Even with the new tests added that makes it very hard to make sure no dead lock or new data races are introduced with the changes. And that the formatting doesn't match the rest of the code base doesn't really help the review either.
Can you please split things up a bit more for easier review? Thanks!
Add a NeutrinoChainService interface to provide an abstract implementation of all the methods exported by a *neutrino.ChainService.
Add an internal rescan package for use in the neutrino client to allow abstract access to the public methods of a *neutrino.Rescan.
Add mock implementations for NeutrinoChainService and the rescan Interface.
Add TestNeutrinoClientSequentialStartStop to verify that the neutrino client can have Start() and Stop() called sequentially. Test case fails with -race flag enabled.
Resolve the data race in the existing unit test by removing the variable aliasing on s.rescanErr in notification handler and by only writing ClientConnected notification after the notification handler is running. Write ClientConnected notification via a select statement as a goroutine is unnecessary to send to the unbounded queue.
Add TestNeutrinoClientNotifyReceived to verify that calls to NotifyReceived call the Update method of the rescan object. Verify that there is no race condition on any state variables of the NeutrinoClient. Pass test with -race flag enabled. Add a newRescanner private property of the NeutrinoClient to dynamically inject the rescan constructor. Cannot test without dyanmic dependency injection as the neutrino client previously expected a *neutrino.ChainService struct to create the rescan object. Test with a mockRescan object requires that this assumption be overridden for unit tests.
Move transform of outpoints to a separate method to make Rescan method easier to read.
Add a unit test that demonstrates a segmentation fault exists when concurrent calls to NotifyReceived, NotifyBlocks and Rescan methods of the NeutrinoClient are executed. The rescan property is set to nil and a segmentation fault arises.
Fix segmentation fault by using a channel to share the single instance of the rescan object. TestNeutrinoClientNotifyReceivedRescan passes, but fails with -race flag. Remove rescan and scanning properties and modify test to use rescanCh.
Use a channel of channels to safely share the proper channel to close for quitting the running rescan. Add rescanQuit as a parameter to each of the notification handlers. Wrap the notification handlers as closures and pass the single quit channel to the handlers.
Resolve the data race on s.rescanErr by setting it only once in the constructor and by consuming the error channels returned from starting a rescan process.
Use createRescanner method to create the rescan and rescanQuit channels in a single place.
4eeed4d
to
a21bc67
Compare
Converting to draft state to address comments from @guggero |
See #822 |
Background
There is datarace in the neutrino client implementation that
is causing test flakes in
lnd
. Specifically, there was a sharedstruct property
rescan
whose updates resulted in a data race withthe possibility of a panic when
rescan
wasnil
. Further investigationrevealed race conditions on the
rescanQuit
andrescanErr
props of the client as well.
Changes
This PR modifies the design of the neutrino client to better reflect
the requirement that only a single
Rescan
object instance berunning at a time. To that end, this PR removes the
rescan
propertyon the client and replaces it with two buffered channels (
cap
= 1), one each fora rescanner and the rescanner's signal channel. In this way there is no
need to maintain the state
scanning
. Now any method caninfer that the client is scanning if it is able to read a scanner from the channel.
This PR resolves a data race on the
rescanErr
channel via anew error chan consumer method that takes all new error channels
from rescanner restarts and feeds any read errors onto
rescanErr
.Testing
This PR adds two interfaces
rescan.Interface
andNeutrinoChainService
to abstract out the dependency on structs from the
neutrino
package.This PR adds a file of mock implementations for these interfaces and uses them
in a new unit test file. The unit tests added are not intended to capture all
the functionality of the client, but to demonstrate that the data races are resolved
and these tests should be run with the
-race
flag enabled.Reviewing
Checkout 93ca685 and run tests.
TestNeutrinoClientNotifyReceivedRescan
will fail with a segmentation faultin the neutrino client due to
rescan
being set tonil
.Checkout 6e1d604 and verify that
the segmentation fault is fixed and
TestNeutrinoClientNotifyReceivedRescan
passes, but that
-race
flag testing fails.Checkout 56af455 run tests and
verify that
TestNeutrinoClientNotifyReceivedRescan
passes with-race
flag enabled.All changes after 56af455 are cosmetic and
cleanup related.
fixes #819