-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
contractcourt: Persist Resolution Outcomes #4157
contractcourt: Persist Resolution Outcomes #4157
Conversation
5a31554
to
31f4f30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed the design of this PR offline. One thing that I found important in the discussion is about the PendingChannels
rpc output. It is likely that there are some gaps in that output at the moment. Once resolvers are marked as resolved, their data disappears from the report.
This PR would store that data on disk. That could possibly be used as another data source for PendingChannels
, but will there be problems with race conditions?
Fixing PendingChannels
looks like a good intermediate step to me before exposing the data of fully closed channels. As a check that the design fits that too.
Another thought is that concerns could be more separated by keeping all the db code outside of the resolvers. Let them just expose a struct that channel arbitrator takes and persists.
Finally, thinking about this issue without too much consideration for the existing code: if resolvers were db-stateless (I think they should be and #3688 brings them very close to that) and we also get rid of the 'morphing' of one resolver into the other, I think it could be good to keep all the resolver in-memory for as long as the channel is still pending close. Do all the reporting to PendingChannels
directly from the in-memory resolver state. Then when everything is finished, ChannelArbitrator
fetches those in-memory reports and persists them all to disk in the same transaction where the channel is marked fully closed.
In general, i think there is a risk with this issue of building new functionality on top of a not so solid foundation and not getting the best value for the effort. Especially when adding new db code, the cost to change it again later can be high.
My proposal would be to do things in this order:
- Remove nursery
- Make resolvers stateless (probably already done when nursery is removed)
- Keep resolvers in memory as long as channel is still pending close (fix
PendingChannels
report) - Atomically persist resolver reports when channel is fully resolved and report that along with the close summary
While I think it is valuable to do things in order, I think we're currently throwing away a lot of data that is really useful for people to have. Would we be able to reproduce it entirely in this new stateless world order? And even if we can, would it be worth reprocessing all these special case channels (where we'd have to dig them out of historical chan bucket etc)? If not, I'd like to explore how we can keeping this data around in the most future proof way.
That's what I like about adding a new bucket (rather than keeping around the old revolvers). I'm pretty confident that the information stored in this bucket is what is needed user side to get good insight into what went down on chain for each of their channels. Even if we do get rid of the morphing resolver (contest resolver -> success/timeout), that stage still does need to be recorded for the user; they need to know whether we went on chain with a
Not sure if this can be done atomically in the current setup (which is a prio). But agreed in the theoretical stateless case, handoff of a report/outcome of some kind to channel arbitrator makes sense.
The potential race here would be that something is unresolved when we get our current set of resolvers from memory, it resolves but does not write to disk on time for when we get the already resolved set from disk. I don't think this is a huge concern, because Potential options here:
If we're going to be rehauling the channel arbitrator in the next few months, I'd say that it's worth waiting. But if that's not going to happen, I would like to start storing this information. |
Great points @carlaKC, I don't think we cut off future possibilities of removing the nursery or removing the checkpointing from resolvers if we go through with this change. Those changes are IMO nice to haves, as the existing system does work, but is hauling around a bit of technical debt. The changes proposed in this PR would scratch several needs on our end, and also our major users w.r.t wanting to have air-tight accounting w.r.t on-chain sweeping and fees (also I'm all for filling in any low hanging gaps along the way). I worry that if we go the "long way around", then we'd risk landing this core feature, since as we all now, at times things have a tendency to sprawl. There's also the matter of the extra review cycles (particularly the nursery changes) as they touch a pretty critical area w.r.t safe operation of In the end, I think #2 straddles a happy medium of making a small-er change here which allows us to finally have sane records of past channel history, without cutting off any possible refactors or slight re-designs in this area. Ultimately, I think we'll need to be storing this data anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed an initial cursory pass, also occurs to me that this'll be rather useful for debug weird conditions related to chain claims. As is right now, we only have logs to go off, which themselves at times are missing critical information.
f5dae08
to
690f9c6
Compare
01166e6
to
9d7c576
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to take a pass through the tests, but the diff is looking pretty good now IMO. The main comment I have is if we actually need that new method to scan for transactions in a height range, given that the sign desc will always contain the input value information we need. I might be missing something here though.
// Checkpoint the resolver with a closure that will write the outcome | ||
// of the resolver and its sweep transaction to disk. | ||
return nil, c.Checkpoint(c, func(tx kvdb.RwTx) error { | ||
return c.PutResolverReport(tx, report) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the resolvers now end with this PutResolverReport
call. I'd try to move that to the caller of Resolve
so that it isn't duplicated and always guaranteed to be executed. If you unify resolver report and contract report, the caller can request the report through the existing report()
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the PutReport
call to be called in the same tx as our final Checkpoint
(otherwise we could checkpoint, restart and then won't resume the complete resolver but also won't have the report). I'm unsure whether it makes sense to move the final Checkpoint
out of each resolver? The came could have been said of Checkpoint
itself before we started writing reports, but it leaves Resolve
incomplete to move it out imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer moving the CheckPoint
out from the resolver rather than complicating the resolver with the closure.
6cdc1f3
to
f5a1e79
Compare
3c6d0ab
to
2d777aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main comment is still the ResolverType
/ ResolverOutcome
structure. I know we've seen many iterations already, but as this is part of the database and exposed over rpc, imo we need to be sure it is the optimal way to represent the resolution results.
// claimed the outpoint. This may be a sweep transaction, or a first | ||
// stage success/timeout transaction. | ||
SpendTxID *chainhash.Hash | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still interested in an answer to this.
channeldb/reports.go
Outdated
|
||
// ResolverOutcomeTimeout indicates that a contract was timed out on | ||
// chain. | ||
ResolverOutcomeTimeout ResolverOutcome = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That it is timeout, is already encoded in the resolver type. Isn't this lost?
channeldb/reports.go
Outdated
// ResolverOutcomeFirstStage indicates that a htlc had to be claimed | ||
// over two stages, with this outcome representing the confirmation | ||
// of our success/timeout tx. | ||
ResolverOutcomeFirstStage ResolverOutcome = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there will be two ResolverTypeIncomingHtlc
entries in the resolversBucket
. One with outcome ResolverOutcomeFirstStage
and another one with ResolverOutcomeClaimed
?
select { | ||
case preimage := <-preimageSubscription.WitnessUpdates: | ||
case preimage := <-witnessUpdates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I miss writing of the success resolution outcome in this resolver. Shouldn't the happy path also be concluded with a write to the new bucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The success resolver handles the write, since the claim has not confirmed yet at this stage.
|
||
// If we have a success tx, we append a report to represent our first | ||
// stage claim. | ||
if h.htlcResolution.SignedSuccessTx != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When exactly is the outcome of the first stage written to the bucket? I would expect that after the presigned tx confirms
|
||
// Once our success tx has confirmed, we add a resolution for | ||
// our success tx first stage transaction. | ||
successTx := h.htlcResolution.SignedTimeoutTx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confusing to call the timeout tx successTx
?
OutPoint: h.htlcResolution.ClaimOutpoint, | ||
Amount: amt, | ||
ResolverType: channeldb.ResolverTypeOutgoingHtlc, | ||
ResolverOutcome: channeldb.ResolverOutcomeTimeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still struggling with the definition of those outcomes. Inside the context of an outgoing htlc, we can only get it or not get it.
c.reportLock.Unlock() | ||
|
||
c.resolved = true | ||
return nil, nil | ||
return nil, c.PutResolverReport(nil, report) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't fully understand why PutResolverReport
needs to be called directly. I'd think that always calling CheckPoint
in resolvers makes it clearer what is going on. It checkpoints the full state (internal + reports) of a resolver and that is the only function that is used.
I know there is access via the arb config to other functions, but maybe that isn't a good thing anyway.
stages. This outcome represents the broadcast of a timeout or success | ||
transaction for this two stage htlc claim. | ||
*/ | ||
FIRST_STAGE = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to CLAIMED_FIRST_STAGE
to make it clear that this is a claim result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this a precursor to CLAIMED
and TIMEOUT
tho?
contractcourt/anchor_resolver.go
Outdated
|
||
// Anchor was swept by someone else. This is possible after the | ||
// 16 block csv lock. | ||
case sweep.ErrRemoteSpend: | ||
c.log.Warnf("our anchor spent by someone else") | ||
outcome = channeldb.ResolverOutcomeAbandoned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclaimed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed some of the outstanding comments offline. Still not totally happy with the PR, but approving anyway to no longer hold up progress. That makes for the required two approvals, but I think it would still be good if @cfromknecht takes a final look at just the .proto
changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few small nits
stages. This outcome represents the broadcast of a timeout or success | ||
transaction for this two stage htlc claim. | ||
*/ | ||
FIRST_STAGE = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this a precursor to CLAIMED
and TIMEOUT
tho?
2d777aa
to
b888fa9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🙌
hold off on merging until #971 is in |
Add a new top level bucket which holds closed channels nested by chain hash which contains additional information about channel closes. We add resolver resolutions under their own key so that we can extend the bucket with additional information if required.
To allow us to write the outcome of our resolver to disk, we add optional resolver reports to the CheckPoint function. Variadic params are used because some checkpoints may have no reports (when the resolver is not yet complete) and some may have two (in the case of a two stage resolution).
Our current set of reports contain much of the information we will need to persist contract resolutions. We add a function to create resolver reports from our exiting set of resolutions.
Incoming htlcs that are timed out or failed (invalid htlc or invoice condition not met), save a single on chain resolution because we don't need to take any actions on them ourselves (we don't need to worry about 2 stage claims since this is the success path for our peer).
Checkpoint our htlc claims with on chain reasolutions, including our first stage success tx where required.
When a remote peer claims one of our outgoing htlcs on chain, we do not care whether they claimed with multiple stages. We simply store the claim outgome then forget the resolver.
@carlaKC needs rebase |
b888fa9
to
177c314
Compare
This PR adds a new bucket to store the outcome of on-chain resolutions. The original thought for this was just to keep the entire briefcase around, beginning of that approach coded up here.
Reasons for a new bucket:
This change does not currently persist fees for resolutions. This will be done in a follow up, and the reports are serialized on disk in a manner which allows us to easily add fees at a later stage without a migration. Fees are a non-trivial element to add because we batch sweeps, so would need to modify our spend notifications to attribute a portion of fees to a specific output. We also have cases (like successTx) where fees are not supplied by the wallet, so are not detected as fees by btcwallet.
Fixes #2472