Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer storage #2888

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Peer storage #2888

wants to merge 5 commits into from

Conversation

thomash-acinq
Copy link
Member

@thomash-acinq thomash-acinq commented Jul 25, 2024

Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was an easy PR, but as I dived into it I think there are many subtle aspects that have a big impact on scenarios where users restore from seed. That's why I have multiple comments that aren't trivial, for which we need to decide how we'd like eclair to behave.

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 67.85714% with 36 lines in your changes missing coverage. Please review.

Project coverage is 86.01%. Comparing base (f02c98b) to head (103fc65).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
...cala/fr/acinq/eclair/db/sqlite/SqlitePeersDb.scala 69.69% 10 Missing ⚠️
...c/main/scala/fr/acinq/eclair/db/pg/PgPeersDb.scala 70.00% 9 Missing ⚠️
.../main/scala/fr/acinq/eclair/db/DualDatabases.scala 0.00% 6 Missing ⚠️
.../scala/fr/acinq/eclair/db/PeerStorageCleaner.scala 0.00% 5 Missing ⚠️
...-core/src/main/scala/fr/acinq/eclair/io/Peer.scala 79.16% 5 Missing ⚠️
...ir-core/src/main/scala/fr/acinq/eclair/Setup.scala 50.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2888      +/-   ##
==========================================
- Coverage   86.16%   86.01%   -0.15%     
==========================================
  Files         224      227       +3     
  Lines       20073    20198     +125     
  Branches      813      829      +16     
==========================================
+ Hits        17295    17373      +78     
- Misses       2778     2825      +47     
Files with missing lines Coverage Δ
...core/src/main/scala/fr/acinq/eclair/Features.scala 100.00% <100.00%> (ø)
...re/src/main/scala/fr/acinq/eclair/NodeParams.scala 91.78% <100.00%> (+0.04%) ⬆️
.../eclair/wire/protocol/LightningMessageCodecs.scala 99.47% <100.00%> (+0.01%) ⬆️
...q/eclair/wire/protocol/LightningMessageTypes.scala 99.14% <ø> (ø)
...fr/acinq/eclair/wire/protocol/PeerStorageTlv.scala 100.00% <100.00%> (ø)
...ir-core/src/main/scala/fr/acinq/eclair/Setup.scala 75.00% <50.00%> (-0.30%) ⬇️
.../scala/fr/acinq/eclair/db/PeerStorageCleaner.scala 0.00% <0.00%> (ø)
...-core/src/main/scala/fr/acinq/eclair/io/Peer.scala 89.09% <79.16%> (-0.54%) ⬇️
.../main/scala/fr/acinq/eclair/db/DualDatabases.scala 9.60% <0.00%> (-0.34%) ⬇️
...c/main/scala/fr/acinq/eclair/db/pg/PgPeersDb.scala 80.43% <70.00%> (-5.28%) ⬇️
... and 1 more

... and 19 files with indirect coverage changes

---- 🚨 Try these New Features:

Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit afraid of the DoS risks associated with this feature: our node will be a very likely target (everyone will want to store some data with us), so we must be very careful that this cannot be exploited.

@@ -517,6 +527,14 @@ class Peer(val nodeParams: NodeParams,
d.peerConnection forward unknownMsg
stay()

case Event(store: PeerStorageStore, d: ConnectedData) if nodeParams.features.hasFeature(Features.ProvideStorage) && d.channels.nonEmpty =>
startSingleTimer("peer-storage-write", WritePeerStorage, nodeParams.peerStorageWriteDelay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this works as you intended: since startSingleTimer replaces a previous timer, if our peer sends PeerStorageStore more frequently than nodeParams.peerStorageWriteDelay, we will never persist any of their backup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed in #2945 with some other nits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was on purpose. If a peer updates the storage all the time there is no need to persist it, better wait for the updates to stop.

connect(remoteNodeId, peer, peerConnection2, switchboard, channels = Set(ChannelCodecsSpec.normal), sendInit = false, peerStorage = Some(hex"0123456789"))
peerConnection2.send(peer, PeerStorageStore(hex"1111"))
connect(remoteNodeId, peer, peerConnection3, switchboard, channels = Set(ChannelCodecsSpec.normal), sendInit = false, peerStorage = Some(hex"1111"))
assert(nodeParams.db.peers.getStorage(remoteNodeId).contains(hex"abcdef")) // Because of the delayed writes, the original value hasn't been updated yet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be flaky, depending on the machine timing, this could contain any of the peer storage version. We only need the eventually condition IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed in #2945

Comment on lines +112 to +116
using(pg.prepareStatement("UPDATE local.peer_storage SET removed_peer_at = ? WHERE node_id = ?")) { statement =>
statement.setTimestamp(1, TimestampSecond.now().toSqlTimestamp)
statement.setString(2, nodeId.value.toHex)
statement.executeUpdate()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing the storage removal this way creates the following bug:

  • we close our channel with a peer, thus setting removed_peer_at
  • our peer then opens a new channel with us
  • they keep using this active channel
  • after 30 days, we will delete their data from our DB even though we now have an active channel

I think this should be done differently:

  • remove the removed_peer_at column from peer_storage
  • create a dedicated table local.peer_storage_cleanup (node_id TEXT NOT NULL PRIMARY KEY, peer_removed_at TIMESTAMP WITH TIME ZONE NOT NULL)
  • in removePeer add a row in local.peer_storage_cleanup
  • in the PeerStorageCleaner, once per day:
    • list all entries in local.peer_storage_cleanup
    • for each node_id in this list:
      • go through the Swichboard to obtain the peer actor (if it exists) and request GetPeerChannels
      • if we have a NORMAL channel, remove this node_id from the peer_storage_cleanup table
      • otherwise, if we've exceeded removalDelay, remove this node_id from the peer_storage_cleanup table and remote the associated peer storage from local.peer_storage

This guarantees that we don't delete storage from peers with whom we re-open a channel. This is a bit cumbersome to do, if you can think of a better way to avoid this issue, feel free to implement that!

@@ -517,6 +527,14 @@ class Peer(val nodeParams: NodeParams,
d.peerConnection forward unknownMsg
stay()

case Event(store: PeerStorageStore, d: ConnectedData) if nodeParams.features.hasFeature(Features.ProvideStorage) && d.channels.nonEmpty =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking d.channels.nonEmpty isn't enough, this exposes us to the following DoS attack:

  • Mallory create a dummy node_id, sends open_channel and peer_storage_store
  • We store her blob of data in our DB
  • Mallory disconnects, aborting the channel open, so it didn't cost her anything
  • She does the same thing with other node_ids in a loop

We end up storing a large amount of data in our peer_storage DB that we will only clean-up after the removalDelay. I think we should only store our peer's data when we have an active channel (in the NORMAL state): this ensures that some on-chain fees have been paid which protects against DoS. This matches our current behavior, where we only store Phoenix data as part of an active channel's commitments object.

This is a bit cumbersome to do because checking that we have an active channel is asynchronous (we need to send a message to our d.channels actors and wait for their response (we can use the PeerChannelsCollector for that). I'm not sure yet how exactly we can rework the code to achieve this without introducing too much complexity, how do you think we should do this? It probably makes sense to brainstorm this before actually implementing it.

@t-bast t-bast mentioned this pull request Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants