-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peer backup #8490
base: master
Are you sure you want to change the base?
Peer backup #8490
Conversation
Important Auto Review SkippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Chinwendu20 - in general, the code that adds a new feature bit to the set of features that we advertise should happen after the code that implements the functionality. So I think it would make sense to put the implementation in this PR too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job on identifying the first steps here. I'll echo what @ellemouton said and say that you shouldn't implement the BOLT9 changes until everything else is done.
Main feedback here on the actual implementation, though, is that this is one of those situations where I'd recommend not newtyping (one of the few times I'll ever say this). The reason for this is that the blob storage is truly opaque so the value of newtyping is not helpful. We newtype things because we might have two pieces of data that are "representationally identical" but "semantically different".
I anticipate that as we start to figure out how we want to use this protocol feature we will want to further parse the data in these buffers and so we will either wrap them in newtypes during parsing or we will actually explode the structure out to its constituent rows.
I think if you remove the PeerStorageBlob
and YourPeerStorageBlob
types, pull that thread and run all the issues to ground, you'll arrive at something that represents a complete and correct first step here.
lnwire/features.go
Outdated
OptionWantStorageOptional: "option_want_storage", | ||
OptionWantStorageRequired: "option_want_storage", | ||
OptionProvideStorageOptional: "option_provide_storage", | ||
OptionProvideStorageRequired: "option_provide_storage", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this consistent with the rest of the map you see above by dropping the Option
/option
prefix and replacing the underscores (_
) with dashes (-
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you I will fix this in when I submit a PR
lnwire/peer_storage.go
Outdated
var ErrPeerStorageBytesExceeded = fmt.Errorf("peer storage bytes exceede" + | ||
"d") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use errors.New
since there's nothing to format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out we should leave this as is. I'm retracting my comments here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I know why you retracted it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The errors package was included as a stand-in and we would like to phase it out over time. At least that's my understanding CC @Roasbeef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt.Errorf
calls errors
under the hood though... so not sure we can actually phase it out.
lnwire/peer_storage.go
Outdated
// PeerStorageBlob is the type of the data sent by peers to other peers for | ||
// backup. | ||
type PeerStorageBlob []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be one of the few times that I actually think you don't want to newtype this. Since we are dealing with truly opaque bytes, I don't see the value of wrapping it with a newtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way of encoding this is different from normal bytes in the readElement
function. We write the length of the byte first then the byte if I understand the spec correctly.
lnwire/your_peer_storage.go
Outdated
// YourPeerStorageBlob is the type of the data stored by peers as backup for | ||
// other peers. This message is sent in response to the PeerStorage message. | ||
type YourPeerStorageBlob []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I especially think we should avoid having two distinct newtypes for this since this should be identical in both places. Namely, the contents of YourPeerStorageBlob should exactly match the contents of PeerStorageBlob in a prior message.
* [Implement feature bits and message in the peer backup proposal](https://github.com/lightningnetwork/lnd/pull/8490) | ||
This PR implements the feature bits and messages in the peer backup proposal | ||
referenced here: https://github.com/lightning/bolts/pull/1110 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a matter of policy, you should never add the advertisement bits until the entire proposal is implemented properly. It should be the last step in the process. Any change you implement should always be done "back to front". Like so:
- Implement core axioms/operations
- Plug it into the rest of the system
- Implement the on switch
- Advertise
55f7ec1
to
5d2ef8f
Compare
!! This is not gofmt'ed yet and there might be formatting nits. I have converted this to a draft PR, I will like to get initial feedback on the design before I submit a proper one. TODO:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I have to Approach NACK this one. I think the purpose of this has been misunderstood. I suggest we get on a call ASAP to clear up the misunderstanding so you can proceed with better clarity on what needs to be done.
The short version here is that there are a handful of things that need to be handled:
- We need to implement a storage layer that we use to load and store data that our peer tells us to store. This should not be done via the
chanbackup
package. - We need to implement an API that can load and store data that we choose to store with our peer on a per peer basis.
- We would like to implement a more advanced striping system on top of the basic single message blob.
- We need to think about what data needs to go into that backup layer so that we can recover in-flight htlcs when the SCB is restored
- We need to think about how we want to structure our striping in a way that is both robust to peers going offline, and is incentive aligned.
There is a lot to be done here and it won't all be done in a single PR. I would like to see the first PR only implement 1 and 2. From there we will add follow up PRs that address later bullets here. Let's just start with the minimum viable protocol implementation.
peer/brontide.go
Outdated
|
||
// peerStorageDelay is the required for a peer to stay before acking a | ||
// YourPeerStorageMessage. This required to reduce spamming. | ||
peerStorageDelay = 2 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this for the time being. This is a security related implementation detail that may or may not be the approach we want to go with. Don't commit to this approach at this stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just that it was part of the bolt document
peer/brontide.go
Outdated
// SendBackup indicates if to send back up to this field. | ||
SendBackup int | ||
|
||
// LatestPeerBackup is the latest backup we have stored with a peer. | ||
LatestPeerBackup []byte | ||
|
||
// SendBackupMtx provides a concurrency safe way to update the | ||
// `sendBackup` field as well as `LatestPeerBackup` field. | ||
SendBackupMtx sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields need a lot more explanation. This commit seems split between whether or not it is implementing the request layer for storing our data on our peer using this protocol, or if it is implementing the storage layer locally.
peer/brontide.go
Outdated
|
||
func (p *Brontide) handleYourPeerStorageMessage(msg *lnwire.YourPeerStorage) { | ||
|
||
if p.LatestPeerBackup == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it ought to be !=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, note that this untested yet, I just wanted to show a rough design of the implementation.
peer/brontide.go
Outdated
SendBackup int | ||
|
||
// LatestPeerBackup is the latest backup we have stored with a peer. | ||
LatestPeerBackup []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear on what this field's purpose is, or why it is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An in-memory way to show the latest backup we have with the peer. It would also help with recovering the backed up data from the peer. As that field is assigned the data when we receive a yourpeerstorage
message from the peer.
chanbackup/peerbackup_multi.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit is premature. We have a lot more intermediate work to do before this commit is appropriate.
chanbackup/peerbackup.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you'll need to drop this commit too.
server.go
Outdated
func (s *server) sendBackupToPeer(data []byte) error { | ||
serverPeers := s.Peers() | ||
|
||
if len(data) > lnwire.MaxPeerStorageBytes { | ||
data = data[:lnwire.MaxPeerStorageBytes+1] | ||
} | ||
|
||
for _, p := range serverPeers { | ||
p.SendBackupMtx.Lock() | ||
if p.RemoteFeatures().HasFeature( | ||
lnwire.OptionProvideStorageOptional) && | ||
p.SendBackup == 0 { | ||
|
||
if err := p.SendMessage(false, &lnwire.PeerStorage{ | ||
Blob: data, | ||
}); err != nil { | ||
return fmt.Errorf("unable to send backup "+ | ||
"storage to peer(%v), received error: "+ | ||
"%v", p.PubKey(), err) | ||
} | ||
p.SendBackup = 1 | ||
p.LatestPeerBackup = data | ||
|
||
} | ||
p.SendBackupMtx.Unlock() | ||
} | ||
|
||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is absolutely not what we want to do. It opaquely truncates the data passed in (which is not something we should do without warning. It also seems to send things when the SendBackup flag is disabled. Finally it replicates the same data across all peers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a log message for the warning, when sendBackup flag is zero it is not disabled, if there is any point of confusion in the code that contradicts, it should be corrected by me when I submit another iteration for this.
server.go
Outdated
func (s *server) prunePeerBackup(data []byte) error { | ||
channels, err := s.chanStateDB.FetchClosedChannels(false) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
var peerBackupToDelete []string | ||
for _, channel := range channels { | ||
p, err := s.FindPeer(channel.RemotePub) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
bestHeight, err := s.cc.BestBlockTracker.BestHeight() | ||
|
||
if err != nil { | ||
return err | ||
} | ||
if len(s.peerBackup.RetrieveBackupForPeer(p.String())) > 0 && | ||
channel.CloseHeight+minChainBlocksForWipe >= | ||
bestHeight { | ||
|
||
peerBackupToDelete = append(peerBackupToDelete, p. | ||
String()) | ||
} | ||
} | ||
|
||
return s.peerBackup.PrunePeerStorage(peerBackupToDelete) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this is supposed to accomplish. This says that if there is a single channel that is past the wipe window for a peer that we delete the storage for that peer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, maybe adding an extra check of len(p.activeChannels) == 0
would address the concern?
server.go
Outdated
func (s *server) retrievePeerBackup() []byte { | ||
serverPeers := s.Peers() | ||
|
||
for _, p := range serverPeers { | ||
p.SendBackupMtx.Lock() | ||
if p.LatestPeerBackup != nil { | ||
return p.LatestPeerBackup | ||
} | ||
p.SendBackupMtx.Unlock() | ||
} | ||
|
||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like there are a lot of assumptions being made here and I don't think they are right. This is treating all peer backups as interchangeable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By interchangeable do you mean the same?
server.go
Outdated
backup := s.peerBackup.RetrieveBackupForPeer(p.String()) | ||
if backup != nil { | ||
err := p.SendMessage(false, lnwire.NewYourPeerStorageMsg( | ||
backup)) | ||
|
||
if err != nil { | ||
srvrLog.Errorf("unable to send backup to peer "+ | ||
"%v", err) | ||
return | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, we never actually store our peer's blobs so this won't do anything will it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please point me to any part of this PR that indicates an assumption that the blob is not stored
That is where the
Thanks for the review and there is nothing unfortunate about it. I agree we need to get on the call and understand both our perspectives on this. |
Here is what needs to happen in this project overall. We will focus on each step individually. Do not do more than the required scope for each PR. Each step should be a separate PR. We will focus on merging the earlier ones before reviewing the later ones. PreliminariesThis peer backup protocol outlined in bolts#1110 really has two different things going on. First, there is the service provider part of the protocol, where we accept requests to store blobs for our peers, and retrieve them later and give them back upon reconnection. Second, there is the service consumer where our node requests that our peer store data for us. These two parts of the protocol basically have no overlap. The only code sharing they should have is the wire message format. Everything else is going to be different Step 1: Allow our peers to store data with us.I request that you reduce the scope of this PR to only handle this case. The key components we need are as follows:
Step 2: Allow our node to store data with our peer. (NOT THIS PR!)
CommentaryNotice how none of this really mentions backups yet. That's because how we choose to use this peer storage layer is a decision that happens once we have the capability. We will get there in due time, but before we can we have to make sure these parts are right. LND is mission critical software and so it's important that we make all of these changes to it methodically and not take any shortcuts. |
Chiming in here to co-sign @ProofOfKeags's comment above. We need to zoom out a bit to do some more upfront planning to make sure each PR that progressively implements and integrates this feature is well scoped. In the first phase (this PR), we just want to be able to read and write the blobs we store with the remote peer (and them for us) that we have/had channels open with. This can be shipped by itself, as it implements the generic protocol feature, but doesn't add any lnd specific logic yet. He's describe this phase above, and it can be broken down into two PRs as noted. In the second phase, we'll start to actually store our In the third phase, we'd combine the work in the prior phases to implement the peer storage aware SCB recovery. At this point, we're storing blobs for recovery with our peers. If we assume a user only has a seed, then we can use the channel graph to try to find our old channel peers, crawling each peer until we think we have a complete SCB picture. These blobs can then be used to boostrap the SCB protocol as we know it today. In a fourth later phase, we can implement the ideas on #8472 to start to store HTLC state on a best effort basis. This changes more frequently (potentially several times a second, or even minute) so we need something to aggregate the updates, upload them, and then later on sweeper awareness to start sweeping the HTLCs. |
Okay thank you @Roasbeef and @ProofOfKeags
What about if we still have their blob on file but no active channels with the peer? I think we should still send it since it has not been cleared yet. That means we would only advertise one feature bit for this PR, the one that advertises that we are storing other peer's data, right? I would also like to understand why we would prefer using kvdb over files to store the data. Thank you. |
Yeah this is perfectly alright to include in this PR. We can provide it if we have it on file and then it can be up to us to clear that data when we deem appropriate (taking into account the 2016 block delay recommended in the spec).
We use kvdb for just about all of our persistence in LND. The SCB file is an exception and that exception is made due to its use context. In the case of SCB we want to partition it away from other node state so that users can easily save it to another storage medium. We do not need that as a property of this system here because users need not back up the data our peers send us here. That said, if you do your interface design well we should be able to easily interchange between a file based and a kvdb based scheme with minimal consequences to the contact surface between this code and the rest of the codebase. 🙂 |
5d2ef8f
to
7344ebd
Compare
peer/brontide.go
Outdated
if err := p.writeMessage(&lnwire.YourPeerStorage{ | ||
Blob: data, | ||
}); err != nil { | ||
return fmt.Errorf("unable to send "+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So no delay here? Even though it was included in the spec @ProofOfKeags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec simply says that we may delay things to rate limit. IMO rate limiting is a decision we should make but the delay is not necessarily the angle we have to go after. I am suggesting we defer that decision to a later commit. Here we just need to get the bones in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also yeah I wouldn't put the delay here at all. The idea of the delay would be to withhold the ack of a peer storage message not delay the retrieval.
peerstorage.go
Outdated
|
||
for { | ||
select { | ||
case e := <-k.Epochs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know if I should do this or use a time.Ticker to call the bestHeight on blockViewer at a particular time interval but I heard something about it introducing flaky tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo you should keep the epochs, don't use a timer
I do not know if we should use a garbageCollector or let the node operator themselves make a decision that they would like to release their peer storage via issuing an lncli command. The spec does say at least, so maybe we can leave it for the node operator to decide. |
Signed-off-by: Ononiwu Maureen <[email protected]>
This commit introduces new feature bits to enable backing up data with peers. Signed-off-by: Ononiwu Maureen <[email protected]>
This commit adds the peer backup storage message as well as functions to encode and decode them. Signed-off-by: Ononiwu Maureen <[email protected]>
Signed-off-by: Ononiwu Maureen <[email protected]>
Signed-off-by: Ononiwu Maureen <[email protected]>
@Chinwendu20 - the race condition is coming from this PR. Perhaps fix that first & then re-request review 🙏 |
Signed-off-by: Ononiwu Maureen <[email protected]>
Signed-off-by: Ononiwu Maureen <[email protected]>
Signed-off-by: Ononiwu Maureen <[email protected]>
Signed-off-by: Ononiwu Maureen <[email protected]>
Signed-off-by: Ononiwu Maureen <[email protected]>
Signed-off-by: Ononiwu Maureen <[email protected]>
Signed-off-by: Ononiwu Maureen <[email protected]>
Signed-off-by: Ononiwu Maureen <[email protected]>
In this commit, a new goroutine is added to manage the delay in persisting backupData shared by peers. This change serves as a safety check to ensure that a flood of PeerStorage messages from peers does not degrade our performance by causing multiple database transactions within a short period. Signed-off-by: Ononiwu Maureen <[email protected]>
Signed-off-by: Ononiwu Maureen <[email protected]>
Signed-off-by: Ononiwu Maureen <[email protected]>
I have updated the PR but I think a new subsystem or ticker would not be needed we can store the data in memory then persist when we want to quit connection. |
After discussing with @saubyk I think this might not be the best approach because there is no guarantee that lnd would gracefully shut down. |
Yeah we can't guarantee that there won't be power failure or some other fatal event. We should persist eagerly, although it need not be synchronous. |
warning := "received peer storage message but not " + | ||
"advertising required feature bit" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leaks data to the peer. We should not give indication that we understand the feature here.
case <-p.quit: | ||
if data == nil { | ||
return | ||
} | ||
|
||
// Store the data immediately and exit. | ||
err := p.cfg.PeerDataStore.Store(data) | ||
if err != nil { | ||
peerLog.Warnf("Failed to store peer "+ | ||
"backup data: %v", err) | ||
} | ||
|
||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the main PR thread this is not the approach we want to take. You shouldn't need a separate peerStorageWriter thread. It is fine to persist it in the main handler thread, or fork a single goroutine for overwriting the data so it doesn't block the main readHandler thread.
This will also alleviate the need to use Cond
s which are notoriously difficult to use correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean in the Store
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing review request until previous review addressed 👍
Looks like there is a hanging discussion re the data storage
@Chinwendu20, remember to re-request review from reviewers when ready |
!lightninglabs-deploy mute |
Change Description
This PR implements the feature bits and messages in the peer backup proposal:
lightning/bolts#1110
Read comment for more info: #8490 (comment)
This PR adds support for storing other peer's backup data. It introduces a storage layer which persists this data and resends this data o peers on reconnection.
Start reviewing from here: a9388032baa72a044adc5256d2633151b8012798
Steps to Test
Interoperability test with cln: https://github.com/Chinwendu20/Miscellanous/blob/master/peerstoragetest/README.md
Pull Request Checklist
Testing
Code Style and Documentation
[skip ci]
in the commit message for small changes.📝 Please see our Contribution Guidelines for further guidance.