Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer backup #8490

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open

Conversation

Chinwendu20
Copy link
Contributor

@Chinwendu20 Chinwendu20 commented Feb 19, 2024

Change Description

This PR implements the feature bits and messages in the peer backup proposal:
lightning/bolts#1110

Read comment for more info: #8490 (comment)

This PR adds support for storing other peer's backup data. It introduces a storage layer which persists this data and resends this data o peers on reconnection.

Start reviewing from here: a9388032baa72a044adc5256d2633151b8012798

Steps to Test

Interoperability test with cln: https://github.com/Chinwendu20/Miscellanous/blob/master/peerstoragetest/README.md

Pull Request Checklist

Testing

  • Your PR passes all CI checks.
  • Tests covering the positive and negative (error paths) are included.
  • Bug fixes contain tests triggering the bug to prevent regressions.

Code Style and Documentation

📝 Please see our Contribution Guidelines for further guidance.

Copy link
Contributor

coderabbitai bot commented Feb 19, 2024

Important

Auto Review Skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Chinwendu20 - in general, the code that adds a new feature bit to the set of features that we advertise should happen after the code that implements the functionality. So I think it would make sense to put the implementation in this PR too

Copy link
Collaborator

@ProofOfKeags ProofOfKeags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job on identifying the first steps here. I'll echo what @ellemouton said and say that you shouldn't implement the BOLT9 changes until everything else is done.

Main feedback here on the actual implementation, though, is that this is one of those situations where I'd recommend not newtyping (one of the few times I'll ever say this). The reason for this is that the blob storage is truly opaque so the value of newtyping is not helpful. We newtype things because we might have two pieces of data that are "representationally identical" but "semantically different".

I anticipate that as we start to figure out how we want to use this protocol feature we will want to further parse the data in these buffers and so we will either wrap them in newtypes during parsing or we will actually explode the structure out to its constituent rows.

I think if you remove the PeerStorageBlob and YourPeerStorageBlob types, pull that thread and run all the issues to ground, you'll arrive at something that represents a complete and correct first step here.

Comment on lines 340 to 343
OptionWantStorageOptional: "option_want_storage",
OptionWantStorageRequired: "option_want_storage",
OptionProvideStorageOptional: "option_provide_storage",
OptionProvideStorageRequired: "option_provide_storage",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this consistent with the rest of the map you see above by dropping the Option/option prefix and replacing the underscores (_) with dashes (-)

Copy link
Contributor Author

@Chinwendu20 Chinwendu20 Feb 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you I will fix this in when I submit a PR

var ErrPeerStorageBytesExceeded = fmt.Errorf("peer storage bytes exceede" +
"d")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use errors.New since there's nothing to format.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out we should leave this as is. I'm retracting my comments here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I know why you retracted it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The errors package was included as a stand-in and we would like to phase it out over time. At least that's my understanding CC @Roasbeef

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt.Errorf calls errors under the hood though... so not sure we can actually phase it out.

Comment on lines 16 to 18
// PeerStorageBlob is the type of the data sent by peers to other peers for
// backup.
type PeerStorageBlob []byte
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be one of the few times that I actually think you don't want to newtype this. Since we are dealing with truly opaque bytes, I don't see the value of wrapping it with a newtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way of encoding this is different from normal bytes in the readElement function. We write the length of the byte first then the byte if I understand the spec correctly.

Comment on lines 9 to 11
// YourPeerStorageBlob is the type of the data stored by peers as backup for
// other peers. This message is sent in response to the PeerStorage message.
type YourPeerStorageBlob []byte
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I especially think we should avoid having two distinct newtypes for this since this should be identical in both places. Namely, the contents of YourPeerStorageBlob should exactly match the contents of PeerStorageBlob in a prior message.

Comment on lines 281 to 295
* [Implement feature bits and message in the peer backup proposal](https://github.com/lightningnetwork/lnd/pull/8490)
This PR implements the feature bits and messages in the peer backup proposal
referenced here: https://github.com/lightning/bolts/pull/1110
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of policy, you should never add the advertisement bits until the entire proposal is implemented properly. It should be the last step in the process. Any change you implement should always be done "back to front". Like so:

  1. Implement core axioms/operations
  2. Plug it into the rest of the system
  3. Implement the on switch
  4. Advertise

@Chinwendu20
Copy link
Contributor Author

Chinwendu20 commented Feb 25, 2024

!! This is not gofmt'ed yet and there might be formatting nits.

I have converted this to a draft PR, I will like to get initial feedback on the design before I submit a proper one.

TODO:

  • Add prunepeerbackup command (That would be the way we get rid of irrelevant peer backups)
  • Add restorefrompeer in lnd config. That is on start up we attempt to restore channel backups from peers.
    format: restorefrompeer = 1 means restore from peers and 0 means do not.

Copy link
Collaborator

@ProofOfKeags ProofOfKeags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I have to Approach NACK this one. I think the purpose of this has been misunderstood. I suggest we get on a call ASAP to clear up the misunderstanding so you can proceed with better clarity on what needs to be done.

The short version here is that there are a handful of things that need to be handled:

  1. We need to implement a storage layer that we use to load and store data that our peer tells us to store. This should not be done via the chanbackup package.
  2. We need to implement an API that can load and store data that we choose to store with our peer on a per peer basis.
  3. We would like to implement a more advanced striping system on top of the basic single message blob.
  4. We need to think about what data needs to go into that backup layer so that we can recover in-flight htlcs when the SCB is restored
  5. We need to think about how we want to structure our striping in a way that is both robust to peers going offline, and is incentive aligned.

There is a lot to be done here and it won't all be done in a single PR. I would like to see the first PR only implement 1 and 2. From there we will add follow up PRs that address later bullets here. Let's just start with the minimum viable protocol implementation.

peer/brontide.go Outdated
Comment on lines 78 to 81

// peerStorageDelay is the required for a peer to stay before acking a
// YourPeerStorageMessage. This required to reduce spamming.
peerStorageDelay = 2 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this for the time being. This is a security related implementation detail that may or may not be the approach we want to go with. Don't commit to this approach at this stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just that it was part of the bolt document

peer/brontide.go Outdated
Comment on lines 512 to 520
// SendBackup indicates if to send back up to this field.
SendBackup int

// LatestPeerBackup is the latest backup we have stored with a peer.
LatestPeerBackup []byte

// SendBackupMtx provides a concurrency safe way to update the
// `sendBackup` field as well as `LatestPeerBackup` field.
SendBackupMtx sync.RWMutex
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields need a lot more explanation. This commit seems split between whether or not it is implementing the request layer for storing our data on our peer using this protocol, or if it is implementing the storage layer locally.

peer/brontide.go Outdated

func (p *Brontide) handleYourPeerStorageMessage(msg *lnwire.YourPeerStorage) {

if p.LatestPeerBackup == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it ought to be !=

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, note that this untested yet, I just wanted to show a rough design of the implementation.

peer/brontide.go Outdated
SendBackup int

// LatestPeerBackup is the latest backup we have stored with a peer.
LatestPeerBackup []byte
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on what this field's purpose is, or why it is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An in-memory way to show the latest backup we have with the peer. It would also help with recovering the backed up data from the peer. As that field is assigned the data when we receive a yourpeerstorage message from the peer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit is premature. We have a lot more intermediate work to do before this commit is appropriate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you'll need to drop this commit too.

server.go Outdated
Comment on lines 3443 to 3471
func (s *server) sendBackupToPeer(data []byte) error {
serverPeers := s.Peers()

if len(data) > lnwire.MaxPeerStorageBytes {
data = data[:lnwire.MaxPeerStorageBytes+1]
}

for _, p := range serverPeers {
p.SendBackupMtx.Lock()
if p.RemoteFeatures().HasFeature(
lnwire.OptionProvideStorageOptional) &&
p.SendBackup == 0 {

if err := p.SendMessage(false, &lnwire.PeerStorage{
Blob: data,
}); err != nil {
return fmt.Errorf("unable to send backup "+
"storage to peer(%v), received error: "+
"%v", p.PubKey(), err)
}
p.SendBackup = 1
p.LatestPeerBackup = data

}
p.SendBackupMtx.Unlock()
}

return nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely not what we want to do. It opaquely truncates the data passed in (which is not something we should do without warning. It also seems to send things when the SendBackup flag is disabled. Finally it replicates the same data across all peers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a log message for the warning, when sendBackup flag is zero it is not disabled, if there is any point of confusion in the code that contradicts, it should be corrected by me when I submit another iteration for this.

server.go Outdated
Comment on lines 3473 to 3501
func (s *server) prunePeerBackup(data []byte) error {
channels, err := s.chanStateDB.FetchClosedChannels(false)
if err != nil {
return err
}

var peerBackupToDelete []string
for _, channel := range channels {
p, err := s.FindPeer(channel.RemotePub)
if err != nil {
return err
}

bestHeight, err := s.cc.BestBlockTracker.BestHeight()

if err != nil {
return err
}
if len(s.peerBackup.RetrieveBackupForPeer(p.String())) > 0 &&
channel.CloseHeight+minChainBlocksForWipe >=
bestHeight {

peerBackupToDelete = append(peerBackupToDelete, p.
String())
}
}

return s.peerBackup.PrunePeerStorage(peerBackupToDelete)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this is supposed to accomplish. This says that if there is a single channel that is past the wipe window for a peer that we delete the storage for that peer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, maybe adding an extra check of len(p.activeChannels) == 0 would address the concern?

server.go Outdated
Comment on lines 3503 to 3515
func (s *server) retrievePeerBackup() []byte {
serverPeers := s.Peers()

for _, p := range serverPeers {
p.SendBackupMtx.Lock()
if p.LatestPeerBackup != nil {
return p.LatestPeerBackup
}
p.SendBackupMtx.Unlock()
}

return nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there are a lot of assumptions being made here and I don't think they are right. This is treating all peer backups as interchangeable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By interchangeable do you mean the same?

server.go Outdated
Comment on lines 3984 to 3994
backup := s.peerBackup.RetrieveBackupForPeer(p.String())
if backup != nil {
err := p.SendMessage(false, lnwire.NewYourPeerStorageMsg(
backup))

if err != nil {
srvrLog.Errorf("unable to send backup to peer "+
"%v", err)
return
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, we never actually store our peer's blobs so this won't do anything will it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please point me to any part of this PR that indicates an assumption that the blob is not stored

@Chinwendu20
Copy link
Contributor Author

Unfortunately I have to Approach NACK this one. I think the purpose of this has been misunderstood. I suggest we get on a call ASAP to clear up the misunderstanding so you can proceed with better clarity on what needs to be done.

The short version here is that there are a handful of things that need to be handled:

  1. We need to implement a storage layer that we use to load and store data that our peer tells us to store. This should not be done via the chanbackup package.

That is where the swapper comes in.

  1. We need to implement an API that can load and store data that we choose to store with our peer on a per peer basis.
  2. We would like to implement a more advanced striping system on top of the basic single message blob.
  3. We need to think about what data needs to go into that backup layer so that we can recover in-flight htlcs when the SCB is restored
  4. We need to think about how we want to structure our striping in a way that is both robust to peers going offline, and is incentive aligned.

There is a lot to be done here and it won't all be done in a single PR. I would like to see the first PR only implement 1 and 2. From there we will add follow up PRs that address later bullets here. Let's just start with the minimum viable protocol implementation.

Thanks for the review and there is nothing unfortunate about it. I agree we need to get on the call and understand both our perspectives on this.

@ProofOfKeags
Copy link
Collaborator

Here is what needs to happen in this project overall. We will focus on each step individually. Do not do more than the required scope for each PR. Each step should be a separate PR. We will focus on merging the earlier ones before reviewing the later ones.

Preliminaries

This peer backup protocol outlined in bolts#1110 really has two different things going on.

First, there is the service provider part of the protocol, where we accept requests to store blobs for our peers, and retrieve them later and give them back upon reconnection.

Second, there is the service consumer where our node requests that our peer store data for us.

These two parts of the protocol basically have no overlap. The only code sharing they should have is the wire message format. Everything else is going to be different

Step 1: Allow our peers to store data with us.

I request that you reduce the scope of this PR to only handle this case. The key components we need are as follows:

  1. An interface that manages this data on disk. It should be a simple key value store, making use of the kvdb package. The key should be our peer's public key, and the value should be the blob.
  2. A message handler on the Brontide that handles the peer_storage message. This should make use of the above interface to store the data on disk.
  3. In the Start method of the Brontide add logic after loadActiveChannels that if there are any active channels, we yield to them your_peer_storage if we have one on file.
  4. Inside the layer that manages the disk storage, we should includee a coroutine that periodically runs and garbage collects blobs that are no longer in use (what your refer to as pruning in this PR).

Step 2: Allow our node to store data with our peer. (NOT THIS PR!)

  1. Add to the Brontide a method that takes an arbitrary blob and then creates a peer_storage message and sends it to the peer.
  2. Upon initial send we need to track it and await an initial your_peer_storage message. When we receive this message we will mark it as fully committed.
  3. If we receive a your_peer_storage message (without an outstanding verification step) then we will verify that blob (more on that below) and if it passes verification, then we will (only for now) attach it to the Brontide struct. This will change later as we design more advanced ways to make use of this API.
  4. The API that sends out our peer_storage message should verify that the requested message is sufficiently small, should append a fixed length checksum to the end, and then encrypt it.

Commentary

Notice how none of this really mentions backups yet. That's because how we choose to use this peer storage layer is a decision that happens once we have the capability. We will get there in due time, but before we can we have to make sure these parts are right. LND is mission critical software and so it's important that we make all of these changes to it methodically and not take any shortcuts.

@Roasbeef
Copy link
Member

Chiming in here to co-sign @ProofOfKeags's comment above. We need to zoom out a bit to do some more upfront planning to make sure each PR that progressively implements and integrates this feature is well scoped.

In the first phase (this PR), we just want to be able to read and write the blobs we store with the remote peer (and them for us) that we have/had channels open with. This can be shipped by itself, as it implements the generic protocol feature, but doesn't add any lnd specific logic yet. He's describe this phase above, and it can be broken down into two PRs as noted.

In the second phase, we'll start to actually store our lnd-specific data amongst our supporting peers. We need to deploy the first phase before this, as otherwise there'll be no nodes out there we can use to implement and test this PR in the wild. In this phase, we'll need to do some upfront design to determine how we select a peer for storage (maybe we only want high uptime peers as an example), and also how we select which SCBs to store with which peers (see the RAID striping idea in this issue).

In the third phase, we'd combine the work in the prior phases to implement the peer storage aware SCB recovery. At this point, we're storing blobs for recovery with our peers. If we assume a user only has a seed, then we can use the channel graph to try to find our old channel peers, crawling each peer until we think we have a complete SCB picture. These blobs can then be used to boostrap the SCB protocol as we know it today.

In a fourth later phase, we can implement the ideas on #8472 to start to store HTLC state on a best effort basis. This changes more frequently (potentially several times a second, or even minute) so we need something to aggregate the updates, upload them, and then later on sweeper awareness to start sweeping the HTLCs.

@Chinwendu20
Copy link
Contributor Author

Okay thank you @Roasbeef and @ProofOfKeags

In the Start method of the Brontide add logic after loadActiveChannels that if there are any active channels, we yield to them your_peer_storage if we have one on file.

What about if we still have their blob on file but no active channels with the peer? I think we should still send it since it has not been cleared yet.

That means we would only advertise one feature bit for this PR, the one that advertises that we are storing other peer's data, right?

I would also like to understand why we would prefer using kvdb over files to store the data.

Thank you.

@ProofOfKeags
Copy link
Collaborator

What about if we still have their blob on file but no active channels with the peer? I think we should still send it since it has not been cleared yet.

Yeah this is perfectly alright to include in this PR. We can provide it if we have it on file and then it can be up to us to clear that data when we deem appropriate (taking into account the 2016 block delay recommended in the spec).

I would also like to understand why we would prefer using kvdb over files to store the data.

We use kvdb for just about all of our persistence in LND. The SCB file is an exception and that exception is made due to its use context. In the case of SCB we want to partition it away from other node state so that users can easily save it to another storage medium. We do not need that as a property of this system here because users need not back up the data our peers send us here.

That said, if you do your interface design well we should be able to easily interchange between a file based and a kvdb based scheme with minimal consequences to the contact surface between this code and the rest of the codebase. 🙂

peer/brontide.go Outdated
if err := p.writeMessage(&lnwire.YourPeerStorage{
Blob: data,
}); err != nil {
return fmt.Errorf("unable to send "+
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So no delay here? Even though it was included in the spec @ProofOfKeags?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec simply says that we may delay things to rate limit. IMO rate limiting is a decision we should make but the delay is not necessarily the angle we have to go after. I am suggesting we defer that decision to a later commit. Here we just need to get the bones in place.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also yeah I wouldn't put the delay here at all. The idea of the delay would be to withhold the ack of a peer storage message not delay the retrieval.

peerstorage.go Outdated

for {
select {
case e := <-k.Epochs:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know if I should do this or use a time.Ticker to call the bestHeight on blockViewer at a particular time interval but I heard something about it introducing flaky tests

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo you should keep the epochs, don't use a timer

@Chinwendu20
Copy link
Contributor Author

I do not know if we should use a garbageCollector or let the node operator themselves make a decision that they would like to release their peer storage via issuing an lncli command. The spec does say at least, so maybe we can leave it for the node operator to decide.

Ononiwu Maureen and others added 5 commits May 12, 2024 06:34
Signed-off-by: Ononiwu Maureen <[email protected]>
This commit introduces new feature bits to enable
backing up data with peers.

Signed-off-by: Ononiwu Maureen <[email protected]>
This commit adds the peer backup storage message
as well as functions to encode and decode them.

Signed-off-by: Ononiwu Maureen <[email protected]>
@ellemouton
Copy link
Collaborator

@Chinwendu20 - the race condition is coming from this PR. Perhaps fix that first & then re-request review 🙏

@ellemouton ellemouton removed their request for review May 14, 2024 10:06
In this commit, a new goroutine is added to manage the delay in
persisting backupData shared by peers.

This change serves as a safety check to ensure that a flood of
PeerStorage messages from peers does not degrade our performance by
causing multiple database transactions within a short period.

Signed-off-by: Ononiwu Maureen <[email protected]>
Signed-off-by: Ononiwu Maureen <[email protected]>
@Chinwendu20
Copy link
Contributor Author

I have updated the PR but I think a new subsystem or ticker would not be needed we can store the data in memory then persist when we want to quit connection.

@Chinwendu20
Copy link
Contributor Author

I have updated the PR but I think a new subsystem or ticker would not be needed we can store the data in memory then persist when we want to quit connection.

After discussing with @saubyk I think this might not be the best approach because there is no guarantee that lnd would gracefully shut down.

@ProofOfKeags
Copy link
Collaborator

Yeah we can't guarantee that there won't be power failure or some other fatal event. We should persist eagerly, although it need not be synchronous.

Comment on lines +4187 to +4188
warning := "received peer storage message but not " +
"advertising required feature bit"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leaks data to the peer. We should not give indication that we understand the feature here.

Comment on lines +808 to +820
case <-p.quit:
if data == nil {
return
}

// Store the data immediately and exit.
err := p.cfg.PeerDataStore.Store(data)
if err != nil {
peerLog.Warnf("Failed to store peer "+
"backup data: %v", err)
}

return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the main PR thread this is not the approach we want to take. You shouldn't need a separate peerStorageWriter thread. It is fine to persist it in the main handler thread, or fork a single goroutine for overwriting the data so it doesn't block the main readHandler thread.

This will also alleviate the need to use Conds which are notoriously difficult to use correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean in the Store method?

Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing review request until previous review addressed 👍

Looks like there is a hanging discussion re the data storage

@lightninglabs-deploy
Copy link

@Chinwendu20, remember to re-request review from reviewers when ready

@saubyk
Copy link
Collaborator

saubyk commented Sep 3, 2024

!lightninglabs-deploy mute

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants