Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix chanArb deadlock #9253

Merged
merged 3 commits into from
Nov 20, 2024
Merged

Conversation

ziggie1984
Copy link
Collaborator

@ziggie1984 ziggie1984 commented Nov 9, 2024

This PR does 2 things:

  1. Fixes [bug]: ChannelArbitrator does not cleanly stop #8149, it now starts a chainArb level goroutine which is responsible to stop goroutines of the channel_arbitrators when they are fully resolved.

  2. Now starts the different channel arbitrator during startup in an errorGroup and collects the result concurrently. This makes sure LND starts-up correctly. Sometimes the channelArbs depend on other subsystems like for example taproot assets, so we need to make sure we do not block here forever.

EDIT:

Changed the approach. This will prevent the deadlock from happening.

A separate PR will be created to start the arbitrator concurrently.

Copy link
Contributor

coderabbitai bot commented Nov 9, 2024

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@ziggie1984 ziggie1984 self-assigned this Nov 9, 2024
@ziggie1984 ziggie1984 force-pushed the fix-chanArb-deadlock branch 2 times, most recently from 7a5003b to b6296d2 Compare November 9, 2024 14:24
@@ -1192,9 +1232,6 @@ func (c *ChainArbitrator) ForceCloseContract(chanPoint wire.OutPoint) (*wire.Msg
// channel has finished its final funding flow, it should be registered with
// the ChainArbitrator so we can properly react to any on-chain events.
func (c *ChainArbitrator) WatchNewChannel(newChan *channeldb.OpenChannel) error {
c.Lock()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure but I think we cannot lock the chainArb here and also start the ChannelArbitrator, because the ChannelArbitrator might call the ResolveContract which needs the ChainArb lock as well, so probably this deadlock was never seen in the wild but I think we need to unlock the ChainArb before starting the ChannelArb ?

@ziggie1984 ziggie1984 force-pushed the fix-chanArb-deadlock branch 3 times, most recently from ce71936 to 1d76335 Compare November 9, 2024 15:47
@ziggie1984 ziggie1984 marked this pull request as ready for review November 9, 2024 15:49
@ziggie1984 ziggie1984 force-pushed the fix-chanArb-deadlock branch 4 times, most recently from c46f6b9 to b91eee3 Compare November 9, 2024 20:46
@yyforyongyu
Copy link
Member

I would re-access this issue after blockbeat as it greatly refactors the resolvers and likely the issue will be gone, also to reduce rebase conflicts from either side.

@ziggie1984
Copy link
Collaborator Author

I would re-access this issue after blockbeat as it greatly refactors the resolvers and likely the issue will be gone, also to reduce rebase conflicts from either side.

That would be cool, because the main reason is that some external services might depend on the successful startup of LND however they also have dependencies when starting the ChannelArbitrator so let's see then.

@guggero guggero mentioned this pull request Nov 11, 2024
8 tasks
Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the quick fixes! I think we need to handle a couple of edge cases a bit more, but the general approach looks good!

@@ -1192,9 +1232,6 @@ func (c *ChainArbitrator) ForceCloseContract(chanPoint wire.OutPoint) (*wire.Msg
// channel has finished its final funding flow, it should be registered with
// the ChainArbitrator so we can properly react to any on-chain events.
func (c *ChainArbitrator) WatchNewChannel(newChan *channeldb.OpenChannel) error {
c.Lock()
defer c.Unlock()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now no longer guard the read access to c.activeChannels below (L1205). Perhaps we need to introduce an RWMutex instead?

chainArb := c.activeChannels[chanPoint]
c.Unlock()
if chainArb != nil {
arbLog := chainArb.log
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could arbLog still be nil at this point? Perhaps the above condition should be if chainArb != nil && chainArg.log != nil?

c.wg.Add(1)
go c.channelAttendant(bestHeight)
return nil
err = c.wg.Go(func(ctx context.Context) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could return directly here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what does this do exactly? No error is returned from channelAttendant.

Was this issue actually introduced by adding the a goroutine the prior commit?

// timeouts for itests and normal operations.
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)

// Create an errgroup with the context
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing full stop at end of sentence, here and a couple of places below.

for _, arbitrator := range c.activeChannels {
startState, ok := startStates[arbitrator.cfg.ChanPoint]
if !ok {
stopAndLog()
// In case we encounter an error we need to cancel the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add empty line before comment if it isn't at the start of a block.


// Start arbitrator in a separate goroutine
go func() {
errChan <- arbitrator.Start(startState)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the whole goal of the errGroup that we have a context that is canceled when an error occurs?
But now we're starting a new goroutine inside the err group goroutine just so we can abort Start()?
This will work I think. But perhaps another possible approach would be to pass the cancellable context into Start() and abort on context cancel there?
Otherwise we'd kind of kill/abandon the goroutines spawned in Start() on shutdown.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good idea

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why isn't arb.Start() returned directly as the error here? That would actually use the errgroup features.

Zooming out, what we want here is that chanAbr.Start() actually doesn't block, but AFAICT, it'll wait for all the goroutines to start below, which can still enter the deadlock scenario we were trying to resolve.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I undestand your question, but I introduced the errGroup and this goroutine, because otherwise I cannot fail the goroutines as soon as a goroutines fails with an error. The normal errGroup waits until all goroutines are done which would bring us into the deadlock again. but yeah not really necessary wich the new appraoch.

select {
// As soon as the context cancels we can be sure the
// errGroup has finished waiting.
case <-ctx.Done():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can rely on the context being canceled here. The Godoc for the err group says "the first call to return a non-nil error cancels the group's context". But if there is no failure, there won't be a cancel.

Perhaps instead of using an err group we just create an error channel for the number of arbitrators we start, start them all in a goroutine then wait for them to be completed here, by reading the number of err (or nil) from the channel as there are arbitrators.

@@ -258,6 +258,10 @@ type ChainArbitrator struct {
// methods and interface it needs to operate.
cfg ChainArbitratorConfig

// resolveContract is a channel which is used to signal the cleanup of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re the commit comment: do we have a demonstration of the supposed deadlock?

@@ -509,44 +517,24 @@ func (c *ChainArbitrator) ResolveContract(chanPoint wire.OutPoint) error {
return err
}

// Now that the channel has been marked as fully closed, we'll stop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what's he deadlock scenario here? That the channel arb calls this function while the chain arb is trying to stop it?

I think that can alternatively be handled with an async call from the chan arb. At that point, it's shutting down, and can't really do much with any error returned here as all the contracts have been resolved (channel is fully closed).

c.wg.Add(1)
go c.channelAttendant(bestHeight)
return nil
err = c.wg.Go(func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what does this do exactly? No error is returned from channelAttendant.

Was this issue actually introduced by adding the a goroutine the prior commit?

c.wg.Add(1)
go c.resolveContract(contract, immediate)
err := c.wg.Go(func(ctx context.Context) {
c.resolveContract(contract, immediate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here re not returning an err at all.


// Start arbitrator in a separate goroutine
go func() {
errChan <- arbitrator.Start(startState)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why isn't arb.Start() returned directly as the error here? That would actually use the errgroup features.

Zooming out, what we want here is that chanAbr.Start() actually doesn't block, but AFAICT, it'll wait for all the goroutines to start below, which can still enter the deadlock scenario we were trying to resolve.

@Roasbeef
Copy link
Member

Seeing this laid out a bit, I wonder if we should entertain the other idea that @ziggie1984 had: modify ForceClose to only conditionally try to make the chan close summary.

In terms of breaking changes, we can sidestep that by using a new set of functional options for the main arg. This way we only need to update callers at the site of the new unit tests, then also then ChainArb.

@dstadulis dstadulis assigned Roasbeef and unassigned ziggie1984 Nov 12, 2024
@ziggie1984
Copy link
Collaborator Author

ziggie1984 commented Nov 12, 2024

@Roasbeef ok changed the approach for now. Just going for the Optional Resolution approach.

Will create a separate PR to make the startup async.

But this will definitely solve the deadlock issue, but we should however add the async arbitrator feature (will create a separate PR).


// Resolutions contains all the data required for resolving the
// different output types of a commitment transaction.
Resolutions fn.Option[Resolutions]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for this approach here, it would be to much of a change to introduce options for all the separate types like for example:

AnchorResolution fn.Option[AnchorResolution] ...

because we use the nil case a lot and also pass it into other structures. Maybe a deeper refactor in the log run but not now.

Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the fix! This approach makes a lot of sense to me. Just a couple of minor suggestions, otherwise LGTM 🎉

lnwallet/channel.go Outdated Show resolved Hide resolved
contractcourt/chain_arbitrator.go Outdated Show resolved Hide resolved
lnwallet/channel.go Outdated Show resolved Hide resolved
@guggero
Copy link
Collaborator

guggero commented Nov 13, 2024

Some itests now seem to fail. Perhaps we need to increase some timeouts or wait for a different signal since things are now a bit more async?

    harness.go:353: Finished the setup, now running tests...
    --- FAIL: TestLightningNetworkDaemon/tranche00/05-of-174/btcd/channel_backup_restore_basic (55.79s)
        --- FAIL: TestLightningNetworkDaemon/tranche00/05-of-174/btcd/channel_backup_restore_basic/restore_from_RPC_backup (52.25s)
            harness_rpc.go:100: 
                	Error Trace:	/home/runner/work/lnd/lnd/lntest/rpc/harness_rpc.go:100
                	            				/home/runner/work/lnd/lnd/lntest/rpc/lnd.go:46
                	            				/home/runner/work/lnd/lnd/lntest/harness_assertion.go:90
                	            				/home/runner/work/lnd/lnd/lntest/wait/wait.go:51
                	            				/home/runner/work/lnd/lnd/lntest/wait/wait.go:27
                	            				/opt/hostedtoolcache/go/1.22.6/x64/src/runtime/asm_amd64.s:1695
                	Error:      	Received unexpected error:
                	            	rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:10630: connect: connection refused"
                	Messages:   	carol: failed to call ListPeers
            harness_assertion.go:105: 
                	Error Trace:	/home/runner/work/lnd/lnd/lntest/harness_assertion.go:105
                	            				/home/runner/work/lnd/lnd/lntest/harness_assertion.go:239
                	            				/home/runner/work/lnd/lnd/itest/lnd_channel_backup_test.go:1564
                	            				/home/runner/work/lnd/lnd/itest/lnd_channel_backup_test.go:231
                	            				/home/runner/work/lnd/lnd/itest/lnd_channel_backup_test.go:459
                	            				/home/runner/work/lnd/lnd/itest/lnd_channel_backup_test.go:427
                	Error:      	Received unexpected error:
                	            	method did not return within the timeout
                	Test:       	TestLightningNetworkDaemon/tranche00/05-of-174/btcd/channel_backup_restore_basic/restore_from_RPC_backup
                	Messages:   	unable to connect carol to dave, got error: peers not connected within 30s seconds

@ziggie1984
Copy link
Collaborator Author

hmm strange this PR did not introduce any new timeout issue, I take a look

@ziggie1984 ziggie1984 force-pushed the fix-chanArb-deadlock branch 2 times, most recently from df295ea to f42d24a Compare November 13, 2024 15:33
@ziggie1984
Copy link
Collaborator Author

Passes now @guggero, but will add another safety check which crashed the tests prior. Should never happen but safety first.

contractcourt/chain_arbitrator.go Show resolved Hide resolved
lnwallet/channel.go Outdated Show resolved Hide resolved
contractcourt/channel_arbitrator.go Show resolved Hide resolved
@Roasbeef
Copy link
Member

Passes now @guggero, but will add another safety check which crashed the tests prior. Should never happen but safety first.

Which check is this?

Is it the idea I had to test that this doesn't deadlock in the litd context we observed?

@ziggie1984
Copy link
Collaborator Author

Tested this PR via an Itest, works as expected no deadlock happens.

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 📡

@Roasbeef
Copy link
Member

Can land after a rebase.

We don't always need the resolutions in the local force close
summary so we make it an option.
@guggero guggero merged commit 4b563e6 into lightningnetwork:master Nov 20, 2024
28 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug]: ChannelArbitrator does not cleanly stop
4 participants