Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tapchannel: improve aux signer signal handling #1118

Merged
merged 2 commits into from
Sep 12, 2024

Conversation

jharveyb
Copy link
Contributor

@jharveyb jharveyb commented Sep 9, 2024

Fixes #1114 .

This PR updates the aux signer to handle cancel and quit signals in a similar fashion to the 'primary' sig job handler in lnwallet/sigpool. This also improves safety as the cancel channel can now be closed by both tapd and lnd.

@jharveyb jharveyb requested review from guggero and Roasbeef September 9, 2024 11:03
@coveralls
Copy link

coveralls commented Sep 9, 2024

Pull Request Test Coverage Report for Build 10815203740

Details

  • 20 of 43 (46.51%) changed or added relevant lines in 2 files are covered.
  • 20 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+0.02%) to 40.173%

Changes Missing Coverage Covered Lines Changed/Added Lines %
tapchannel/aux_leaf_signer.go 8 31 25.81%
Files with Coverage Reduction New Missed Lines %
tappsbt/create.go 2 53.22%
asset/asset.go 2 81.8%
commitment/tap.go 4 83.91%
universe/interface.go 12 50.22%
Totals Coverage Status
Change from base Build 10783848836: 0.02%
Covered Lines: 23999
Relevant Lines: 59739

💛 - Coveralls

Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice bug hunting and fix!

tapchannel/auf_leaf_signer_test.go Outdated Show resolved Hide resolved
tapchannel/auf_leaf_signer_test.go Outdated Show resolved Hide resolved
tapchannel/auf_leaf_signer_test.go Outdated Show resolved Hide resolved
tapchannel/aux_leaf_signer.go Outdated Show resolved Hide resolved
log.Tracef("Processing %d aux sig jobs", len(sigJobs))

for idx := range sigJobs {
sigJob := sigJobs[idx]
cancelAndErr := func(err error) {
log.Errorf("Error processing aux sig job: %v", err)

close(sigJob.Cancel)
// Check that the cancel signal was not already sent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're expected to be the only ones closing the sigJob.Cancel channel, we could also do this with a sync.Once. But I guess this works too and would also work in case someone else closes the channel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think we need to straighten out exactly who/what can close the quit channel, otherwise we'll invariably run into a double close panic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lnd can close the channel, it is shared amongst the 'classic' sig jobs and these aux sig jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated s.t. lnd is the only one that will be closing the channel (simpler flow, having tapd close was not really needed)

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job digging into this!

I think the changes re the quit/cancel operations make sense, and one of the added tests does demonstrate a problem in this area.

The one thing I'm not sure if we've figured out is: why was the goroutine blocking on a channel that should always be buffered?

tapchannel/auf_leaf_signer_test.go Show resolved Hide resolved
tapchannel/auf_leaf_signer_test.go Show resolved Hide resolved
tapchannel/auf_leaf_signer_test.go Outdated Show resolved Hide resolved
tapchannel/auf_leaf_signer_test.go Outdated Show resolved Hide resolved
tapchannel/auf_leaf_signer_test.go Show resolved Hide resolved
log.Tracef("Processing %d aux sig jobs", len(sigJobs))

for idx := range sigJobs {
sigJob := sigJobs[idx]
cancelAndErr := func(err error) {
log.Errorf("Error processing aux sig job: %v", err)

close(sigJob.Cancel)
// Check that the cancel signal was not already sent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think we need to straighten out exactly who/what can close the quit channel, otherwise we'll invariably run into a double close panic.

tapchannel/aux_leaf_signer.go Show resolved Hide resolved
@jharveyb
Copy link
Contributor Author

Ok, updated to address initial comments.

I think this should get merged as-is and then have a follow-up PR, following the sketch here:

#1114 (comment)

@jharveyb jharveyb force-pushed the aux_signer_batching_fixes branch from 1c6f95e to 4b03773 Compare September 11, 2024 10:40
@dstadulis dstadulis added this to the v0.4.2 milestone Sep 11, 2024
In this commit, we add checks of the aux signer cancel and quit signals
at all points during aux sig batch processing when a response may be
sent. This mirrors the signal handling used in the lnwallet sigpool
worker goroutine. We also update the early exit logic to not close the
cancel channel; only the caller, lnd, should mutate that channel.
@jharveyb jharveyb force-pushed the aux_signer_batching_fixes branch from 4b03773 to 4eb1aeb Compare September 11, 2024 15:53
@jharveyb jharveyb requested a review from Roasbeef September 11, 2024 15:55
log.Errorf("Error processing aux sig job: %v", err)

close(sigJob.Cancel)
sigJob.Resp <- lnwallet.AuxSigJobResp{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wherever we send to a channel, we should also read on the Quit channel to make sure we never block on a send (even if it's buffered).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, so if we receive Quit we abandon sending the err? that may then block lnd then IIUC.

Also, if we select on Quit, and we entered this function from a Quit case statement, we would never send the error, since we're reading an already closed channel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah, you're right. We have to assume the Resp channel is buffered here though (which it is), then this should never block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to assume the Resp channel is buffered here though (which it is), then this should never block.

Yeah this is what was puzzling w.r.t the original stack trace that led us down this line of inquiry: unless some weird mutation happened, why would tapd block on the channel send?

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Roasbeef Roasbeef merged commit 2f0947d into main Sep 12, 2024
18 checks passed
@guggero guggero deleted the aux_signer_batching_fixes branch September 13, 2024 06:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

tapchannel - [bug]: potential deadlock when signing second level HTLCs
5 participants