-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds auto-retry logic to watchtower-plugin #168
Conversation
74cc35d
to
558c236
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review,
Will need to give retrier.rs
another pass.
558c236
to
1b11635
Compare
Fixed the typos, will work on the fundamental issues next |
57c680c
to
2aae5d8
Compare
Added the changes requested on discord so the state of the retriers can be accessed by the I've slightly adapted the code so it also takes advantage of this to, for instance, send the retry notification to the This should be ready modulo discussing the register errors severity (and any found issue or bug ofc). |
94f087e
to
9f39e02
Compare
f3e4d71
to
3d193aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can now make a distinction between user initiated retry requests and retry requests coming from on_commitment_revocation
.
When we knew a retrier was failed we ignore on_commitment_revocation
's retry requests because we learned what the outcome will be already (another fail) and we don't want to keep in this failure loop of retrys forever so we ignore the request to cut it off.
The cutting off had this edge case race condition where a user-made retry request will also be ignored (bad ux). We introduced the additional complexity of not allowing retriers to set their tower statuses if they failed, but let the manager set the tower statues for them instead in the manager's logic loop.
I think since we now know who initiated the requests from the newly introduced data type (RevocationData
) we can make it so that we never ignore a users request and avoid the bad UX issue. And also not defer setting tower status for failed retriers but do it immediately.
Edit: I don't know why I press buttons with green ticks unconsciously 🤦.
watchtower-plugin/src/retrier.rs
Outdated
crate::AppointmentStatus::Pending, | ||
), | ||
); | ||
retrier.set_status(RetrierStatus::Stopped); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flagging the retrier as stopped in here will cause the retrier not to start in this cycle but the next after 1s.
This can be fixed with: self.start_retrying(retrier.clone())
after this line.
I feel this code block (else if let Some(t) = retrier.get_el....
) fits in Retrier::should_start()
method.
watchtower-plugin/src/retrier.rs
Outdated
// Notice that setting the state as Unreachable, when needed, is delayed after the Retrier is evicted from the retriers map. | ||
// Not doing so will allow the user to manually retry before the Retrier has been deleted, making that retry to instantly fail. | ||
// Also, there is nothing to do here for Subscription errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue was that we considered unreachable
to be complete failure and we won't try again (Failed
: we insist on removing the retrier even if we got new appointments for it - that was because we generally think that new appointments come from on_commitment_revocation
and not the user), but now we consider unreachable
to go to idle which is retryable just fine (i think xD).
watchtower-plugin/src/retrier.rs
Outdated
@@ -261,23 +449,25 @@ impl Retrier { | |||
.await | |||
.map_err(|e| { | |||
log::debug!("Cannot renew registration with tower. Error: {:?}", e); | |||
Error::permanent("Cannot renew registration with tower") | |||
Error::permanent(RetryError::Register( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make this one Error::transient
to help in network errors and stuff.
watchtower-plugin/src/retrier.rs
Outdated
if e.is_permanent() { | ||
self.set_status(RetrierStatus::Failed); | ||
} else { | ||
log::debug!("Starting to idle"); | ||
self.set_status(RetrierStatus::Idle(Instant::now())); | ||
// Clear all pending appointments so they do not waste any memory while idling | ||
self.pending_appointments.lock().unwrap().clear(); | ||
self.wt_client | ||
.lock() | ||
.unwrap() | ||
.set_tower_status(self.tower_id, crate::TowerStatus::Unreachable); | ||
} | ||
|
||
match e { | ||
RetryError::Register(_) => { | ||
log::info!("Setting {} status as subscription error", self.tower_id); | ||
self.wt_client | ||
.lock() | ||
.unwrap() | ||
.set_tower_status(self.tower_id, TowerStatus::SubscriptionError) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so let's assume for a second that RetryError::Register
is permanent and it occurred.
So we will do:
if e.is_permanent() {
// This will remove the retrier from the active retriers map held by `WTClient`.
// Now users will be granted manual retry based on `TowerStatus::is_retryable`.
self.set_status(RetrierStatus::Failed);
}
Now in the match:
match e
RetryError::Register(_) => {
log::info!("Setting {} status as subscription error", self.tower_id);
self.wt_client
.lock()
.unwrap()
// We set the tower status to subscription error which is `retryable`.
.set_tower_status(self.tower_id, TowerStatus::SubscriptionError)
}
...
We effectively created the same manual retry race condition we had before.
The user will be allowed to retry (neither the retrier is in the map of active retriers held by WTClient
, nor its tower is in a non-retryable state), but if a user actually issues a manual retry at this point (pre failed retrier being removed by the retry manager), the request will go through but the retry manager will ignore it anyways and remove the failed retrier in the next cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sr-gi Does this make more sense?
This should be general over any permanent error and not just Register
(if it even were to be permanent).
Generally:
permanent error -> failed retrier
fail retrier -> unsafe to set the tower status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, my bad, I failed to see that Retrier::set_status
was still removing the data from WTClient::retriers
. I thought I moved it to Retrier::remove_if_failed
.
I think we can simply update remove_if_failed
to remove_if_done
, and make it remove data from retriers
either if the retrier has failed or if the retrier has finished (status == Stopped and no pending appointments). Then simply take out the retriers removing part from set_status
.
Otherwise, we need to carry around whether an error is permanent out of Retrier::start
and check it by the RetryManager
fn set_status(&self, status: RetrierStatus) {
*self.status.lock().unwrap() = status.clone();
// Add active retriers to the WTClient
if self.is_running() || self.is_idle() {
log::debug!("Adding {} to active retriers", self.tower_id);
self.wt_client
.lock()
.unwrap()
.retriers
.insert(self.tower_id, status);
}
}
/// Removed our retrier identifier from the WTClient if the retrier is done (either stopped with no appointments or failed)
pub fn remove_if_done(&self) {
if self.failed() | !self.should_start() {
log::debug!("Removing {} from active retriers", self.tower_id);
self.wt_client
.lock()
.unwrap()
.retriers
.remove(&self.tower_id);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that, if done like this, the retrier won't be removed from retriers
if it's stopped and has pending data, but in this case the retrier will be re-started in the next iteration cycle, so we may be saving one operation against retriers
:
self.stopped
- with pending appointments -> remove, re-add in the next cycle
- without pending appointments -> remove
self.should_start
- true -> retrier stays in
retriers
as stopped and is updated in the next iteration (it's a hashmap) - false -> remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, no, nvm, this actually moves the race condition to another place of the codebase. The retrier should be removed from retriers
when it is stopped, otherwise, the race condition will exist in the idle->stopped->running
transition.
pending_appointments
will be loaded when the idle-stopped
transition happens, but this is not reflected in the retriers
map, so send_to_retrier
may miss some data. This should be better:
fn set_status(&self, status: RetrierStatus) {
*self.status.lock().unwrap() = status.clone();
// Add or remove retriers from WTClient based on the RetrierStatus
if self.is_running() || self.is_idle() {
log::debug!("Adding {} to active retriers", self.tower_id);
self.wt_client
.lock()
.unwrap()
.retriers
.insert(self.tower_id, status);
} else if self.is_stopped() {
log::debug!("Removing retrier {} from active retriers", self.tower_id);
self.wt_client
.lock()
.unwrap()
.retriers
.remove(&self.tower_id);
}
}
/// Removed our retrier identifier from the WTClient if the retrier has failed
pub fn remove_if_failed(&self) {
if self.failed() {
log::debug!(
"Removing failed retrier {} from active retriers",
self.tower_id
);
self.wt_client
.lock()
.unwrap()
.retriers
.remove(&self.tower_id);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sr-gi That would do.
Another solution I think of is identifying user-made retry requests and force execute them (even with failed retriers). This is easy now since we have an enum for the revocation data.
Approaching it this way would relief us from thinking about whether a race condition will happen in manual retry or not, and we can safely set the tower status from any retrier/thread other than the RetryManager knowing that if a user executes a retry request on this very same tower, the RetryManager will force-execute their request.
Some (incomplete/pseudocode) example of what I mean:
loop {
match self.unreachable_towers.try_recv() {
Ok((tower_id, data)) => {
match data {
// None for user-made retry requests.
RevocationData::None => {
if let Some(retrier) = self.retriers.get(&tower_id) {
// Indicates that the retry manager should force start that
// retrier in the next retry cycle.
retrier.force_start = true;
// Now load the pending appointments and extend the retrier with them.
retrier.pending_appointments.extend(
load_appointment_locators(
tower_id,
crate::AppointmentStatus::Pending,
),
);
} else {
// Just add a new retrier
// ...
}
}
// Handle stale and fresh ...
_ => unreachable!()
}
}
Err(TryRecvError::Empty) => {
// 1- Remove useless retriers (but keep force_start ones even if failed).
// ...
// 2- Start the shut down retriers one by one.
for retrier in self.retriers.values() {
if retrier.force_start {
retrier.set_status(RetrierStatus::Stopped);
retrier.force_start = false;
}
if retrier.should_start() {
self.start_retrying(retrier.clone())
}
}
}
_ => panic!(),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more solution is to allow or disallow manual retries based on retriers and not tower status.
Thus if there is a retrier for tower X we don't let the user retry (the retrier in any status generally - other than idle) and if there is not, the user can sent RevocationData::None
peacefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, but I think it requires a pretty big refactor overall, doesn't it?
I would feel more comfortable merging this incremental change and considering a followup for a complete refactor of this logic based on the RevocationData
type. If that's something you'd like to work on I'd be more than happy to review it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, but I think it requires a pretty big refactor overall, doesn't it?
I would feel more comfortable merging this incremental change and considering a followup > for a complete refactor of this logic based on the RevocationData type.
+1
If that's something you'd like to work on I'd be more than happy to review it.
Yeah sure 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving this as unresolved, we can open an issue to keep track of it
Pushed a fix of the race condition that was going on with the |
Cool, there is only one thing missing tho. sub and register errors being permanent or tempo? My take on this is (note: i might be a bit biased to the temporary/idle side xD), sub error go tempo/idle and register errors can be either tempo or permanent:
If we go by this, we would need to introduce some sort of logic saying whether each individual variant in the Edit: Sorry, I completely missed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
97aadbc
to
b86797b
Compare
b86797b
to
3447081
Compare
Rebased. This should be good for a final diff and check. |
// FIXME: Here we should be able to check this, however, due to httpmock limitations, we cannot return a response based on the request. | ||
// Therefore, both requests will be responded with the same data. Given pending_appointments is a HashSet, we cannot even know which request | ||
// will be sent first (sets are initialized with a random state, which decided the order or iteration). | ||
// https://github.com/alexliesenfeld/httpmock/issues/49 | ||
// assert!(!wt_client.lock().unwrap().retriers.contains_key(&tower_id)); | ||
// assert!(wt_client | ||
// .lock() | ||
// .unwrap() | ||
// .towers | ||
// .get(&tower_id) | ||
// .unwrap() | ||
// .pending_appointments | ||
// .is_empty()); | ||
|
||
// This is not much tbh, but looks like its the best we can do at the moment without experiencing random errors. | ||
// Depending on what appointment is sent first the api will be hit either one or two times. | ||
assert!(api_mock.hits() >= 1 && api_mock.hits() <= 2); | ||
task.abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: What retrier error happen when the receipt doesn't match the appointment? Is it a signature error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AddAppointmentError::SignatureError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits, LGTM otherwise.
watchtower-plugin/src/retrier.rs
Outdated
self.wt_client | ||
.lock() | ||
.unwrap() | ||
.set_tower_status(self.tower_id, crate::TowerStatus::Unreachable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can omit crate::
.
.unwrap() | ||
.retriers | ||
.remove(&self.tower_id); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put a comment here saying we shouldn't remove on status being Failed
so not to allow a user to manual retry just yet/until the retrier is removed.
watchtower-plugin/src/retrier.rs
Outdated
// Notice that setting the state as Unreachable, when needed, is delayed after the Retrier is evicted from the retriers map. | ||
// Not doing so will allow the user to manually retry before the Retrier has been deleted, making that retry to instantly fail. | ||
// Also, there is nothing to do here for Subscription errors. | ||
_ => (), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not doing so will allow the user to manually retry before the Retrier has been deleted
It's now based on retrier being there in retriers map and not solely on tower status. Let's update this piece to:
No need to set the tower status to unreachable because we already set it when idling above
Or we can actually move the piece that sets the tower status to unreachable from the idle branch in e.is_permanent()
check to this place instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm, this made me realize that for RetryError::Subscription(_, false)
we were setting the tower status to Unreachable
and right after to SubscriptionError
(given RetryError::Subscription(_, false)
is not permanent).
So we'll need to change the match to RetryError::Subscription(_, false)
and put the aforementioned code in the default branch:
log::warn!("Retry strategy gave up for {}. {}", self.tower_id, e);
if e.is_permanent() {
self.set_status(RetrierStatus::Failed);
}
match e {
RetryError::Subscription(_, false) => {
log::info!("Setting {} status as subscription error", self.tower_id);
self.wt_client
.lock()
.unwrap()
.set_tower_status(self.tower_id, TowerStatus::SubscriptionError)
}
RetryError::Misbehaving(p) => {
log::warn!("Cannot recover known tower_id from the appointment receipt. Flagging tower as misbehaving");
self.wt_client
.lock()
.unwrap()
.flag_misbehaving_tower(self.tower_id, p);
}
RetryError::Abandoned => {
log::info!("Skipping retrying abandoned tower {}", self.tower_id)
}
// This covers `RetryError::Unreachable` and `RetryError::Subscription(_, false)`
_ => {
log::debug!("Starting to idle");
self.set_status(RetrierStatus::Idle(Instant::now()));
// Clear all pending appointments so they do not waste any memory while idling
self.pending_appointments.lock().unwrap().clear();
self.wt_client
.lock()
.unwrap()
.set_tower_status(self.tower_id, TowerStatus::Unreachable);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point,
So we'll need to change the match to RetryError::Subscription(_, false) and put the
I guess you meant RetryError::Subscription(_, true)
and in the code snippet as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed
3447081
to
575cc34
Compare
- Updates `Retrier::run` to return more meaningful errors. `Retrier::run` used to simply return a message, revamps it to return RetryError variants so we can handle return cases better. - Adds an additional state to `RetrierStatus`: Idle. Retries that fail due to an accumulation of transient errors will be flagged as Idle instead of Failed and retried later on (based on `auto_retry_delay`). Notice Retrier data is not kept in memory while a retrier is Idle. Instead, data is cleared and loaded again from the database when the `Retrier` is ready to run again. - Revamps how revocation data is sent to the `RetryManager`: The RetrierManager used to received locators one by one via unreachable_towers. This is due to them being mainly fed by `on_commitment_revocation`, which generates them one by one. However, both when manually retrying or when bootstrapping from an already populated database, multiple appointments may be pending for the same tower, hence needing to call `unreachable_towers.send` multiple times for the same tower. This itself was not a big deal, given we didn't really needed to differentiate between the cases. We do now though. In order to implement periodic retries while allowing manual retries we need to be able to signal the state transition to the `Retrier` without providing any new data: - If a Retrier is idle and we receive data trough `on_commitment_revocation` we need to append that data to the `Retrier`. - If a Retrier is iddle and we receive data trough a manual retry, we need to change the state of the `Retrier` without adding any new data to it. In order to implement this we've added an additional map to `WTClient` that reports the state of the active retriers. Retriers are active only if they are running or idle. - Also reworks `WTClient::set_tower_status` to update the status only if the new one does not match the old one. This is simply to reduce the boiler plate of having to perform this check in other pats of the plugin codebase.
The re-register logic had a bug where the `TowerSummary` that was kept in memory after re-registering was sweeping the references to both the pending and invalid appointments. While this data was still in the database, re-registering may have made it look like the data was not there.
575cc34
to
39e817c
Compare
Current State
The current version of the
watchtower-plugin
for CLN will try to re-send data to towers that are not reachable for various reasons using an exponential backoff strategy. However, once the strategy is given up on, the tower will be flagged as unreachable and not further retried (unless the user manually triggers it through theretrytower
rpc).Improvements
This PR adds auto-retry logic, so once a specific unreachable tower is given up on, the retrier is not removed but remains idle and it's launched again after a specific delay (controlled by the
watchtower-auto-retry-delay
). From this point on, retriers will only be removed from theRetryManager
is they fail due to a permanent error, but not due to a transient one (in which case they will be auto-retried).In order to achieve this, the retriers now return more meaningful errors (instead of strings) that are used to distinguish why it had to give up retrying, if so.
The way data is shared between the
WTClient
and theRetryManager
has also been updated in order to achieve this. Beforehand, data were sent one by one in (tower_id, locator) tuples, now, it is sent in (tower_id, RevocationData) tuples, in order to distinguish between retries/bootstraps and sending data because of a channel update.Bug fixes
This also fixes a bug related to re-registering, where data regarding pending and invalid appointments was being swept from memory after re-registering.