Adds auto-retry logic to watchtower-plugin #168

sr-gi · 2022-12-22T17:33:11Z

Current State

The current version of the watchtower-plugin for CLN will try to re-send data to towers that are not reachable for various reasons using an exponential backoff strategy. However, once the strategy is given up on, the tower will be flagged as unreachable and not further retried (unless the user manually triggers it through the retrytower rpc).

Improvements

This PR adds auto-retry logic, so once a specific unreachable tower is given up on, the retrier is not removed but remains idle and it's launched again after a specific delay (controlled by the watchtower-auto-retry-delay). From this point on, retriers will only be removed from the RetryManager is they fail due to a permanent error, but not due to a transient one (in which case they will be auto-retried).

In order to achieve this, the retriers now return more meaningful errors (instead of strings) that are used to distinguish why it had to give up retrying, if so.

The way data is shared between the WTClient and the RetryManager has also been updated in order to achieve this. Beforehand, data were sent one by one in (tower_id, locator) tuples, now, it is sent in (tower_id, RevocationData) tuples, in order to distinguish between retries/bootstraps and sending data because of a channel update.

Bug fixes

This also fixes a bug related to re-registering, where data regarding pending and invalid appointments was being swept from memory after re-registering.

mariocynicys

Initial review,
Will need to give retrier.rs another pass.

watchtower-plugin/src/main.rs

watchtower-plugin/src/retrier.rs

sr-gi · 2022-12-27T09:50:14Z

Fixed the typos, will work on the fundamental issues next

sr-gi · 2022-12-29T12:37:55Z

Added the changes requested on discord so the state of the retriers can be accessed by the WTClient.

I've slightly adapted the code so it also takes advantage of this to, for instance, send the retry notification to the RetryManager without the need to include any associated data to it.

This should be ready modulo discussing the register errors severity (and any found issue or bug ofc).

watchtower-plugin/src/main.rs

mariocynicys

We can now make a distinction between user initiated retry requests and retry requests coming from on_commitment_revocation.

When we knew a retrier was failed we ignore on_commitment_revocation's retry requests because we learned what the outcome will be already (another fail) and we don't want to keep in this failure loop of retrys forever so we ignore the request to cut it off.

The cutting off had this edge case race condition where a user-made retry request will also be ignored (bad ux). We introduced the additional complexity of not allowing retriers to set their tower statuses if they failed, but let the manager set the tower statues for them instead in the manager's logic loop.

I think since we now know who initiated the requests from the newly introduced data type (RevocationData) we can make it so that we never ignore a users request and avoid the bad UX issue. And also not defer setting tower status for failed retriers but do it immediately.

Edit: I don't know why I press buttons with green ticks unconsciously 🤦.

watchtower-plugin/tests/test.py

mariocynicys · 2023-01-05T10:16:03Z

watchtower-plugin/src/retrier.rs

+                                            crate::AppointmentStatus::Pending,
+                                        ),
+                                );
+                                retrier.set_status(RetrierStatus::Stopped);


Flagging the retrier as stopped in here will cause the retrier not to start in this cycle but the next after 1s.
This can be fixed with: self.start_retrying(retrier.clone()) after this line.

I feel this code block (else if let Some(t) = retrier.get_el....) fits in Retrier::should_start() method.

watchtower-plugin/src/retrier.rs

mariocynicys · 2023-01-05T13:18:33Z

watchtower-plugin/src/retrier.rs

+                        // Notice that setting the state as Unreachable, when needed, is delayed after the Retrier is evicted from the retriers map.
+                        // Not doing so will allow the user to manually retry before the Retrier has been deleted, making that retry to instantly fail.
+                        // Also, there is nothing to do here for Subscription errors.


The issue was that we considered unreachable to be complete failure and we won't try again (Failed: we insist on removing the retrier even if we got new appointments for it - that was because we generally think that new appointments come from on_commitment_revocation and not the user), but now we consider unreachable to go to idle which is retryable just fine (i think xD).

mariocynicys · 2023-01-05T13:21:12Z

watchtower-plugin/src/retrier.rs

@@ -261,23 +449,25 @@ impl Retrier {
                .await
                .map_err(|e| {
                    log::debug!("Cannot renew registration with tower. Error: {:?}", e);
-                    Error::permanent("Cannot renew registration with tower")
+                    Error::permanent(RetryError::Register(


We can make this one Error::transient to help in network errors and stuff.

watchtower-plugin/src/retrier.rs

watchtower-plugin/src/main.rs

Clicked approve by mistake

watchtower-plugin/src/main.rs

mariocynicys · 2023-01-06T16:13:07Z

watchtower-plugin/src/retrier.rs

+                    if e.is_permanent() {
+                        self.set_status(RetrierStatus::Failed);
+                    } else {
+                        log::debug!("Starting to idle");
+                        self.set_status(RetrierStatus::Idle(Instant::now()));
+                        // Clear all pending appointments so they do not waste any memory while idling
+                        self.pending_appointments.lock().unwrap().clear();
+                        self.wt_client
+                            .lock()
+                            .unwrap()
+                            .set_tower_status(self.tower_id, crate::TowerStatus::Unreachable);
+                    }
+
+                    match e {
+                        RetryError::Register(_) => {
+                            log::info!("Setting {} status as subscription error", self.tower_id);
+                            self.wt_client
+                                .lock()
+                                .unwrap()
+                                .set_tower_status(self.tower_id, TowerStatus::SubscriptionError)
+                        }


Okay so let's assume for a second that RetryError::Register is permanent and it occurred.
So we will do:

if e.is_permanent() { // This will remove the retrier from the active retriers map held by `WTClient`. // Now users will be granted manual retry based on `TowerStatus::is_retryable`. self.set_status(RetrierStatus::Failed); }

Now in the match:

match e RetryError::Register(_) => { log::info!("Setting {} status as subscription error", self.tower_id); self.wt_client .lock() .unwrap() // We set the tower status to subscription error which is `retryable`. .set_tower_status(self.tower_id, TowerStatus::SubscriptionError) } ...

We effectively created the same manual retry race condition we had before.

The user will be allowed to retry (neither the retrier is in the map of active retriers held by WTClient, nor its tower is in a non-retryable state), but if a user actually issues a manual retry at this point (pre failed retrier being removed by the retry manager), the request will go through but the retry manager will ignore it anyways and remove the failed retrier in the next cycle.

@sr-gi Does this make more sense?

This should be general over any permanent error and not just Register (if it even were to be permanent).

Generally:
permanent error -> failed retrier
fail retrier -> unsafe to set the tower status

Ok, my bad, I failed to see that Retrier::set_status was still removing the data from WTClient::retriers. I thought I moved it to Retrier::remove_if_failed.

I think we can simply update remove_if_failed to remove_if_done, and make it remove data from retriers either if the retrier has failed or if the retrier has finished (status == Stopped and no pending appointments). Then simply take out the retriers removing part from set_status.

Otherwise, we need to carry around whether an error is permanent out of Retrier::start and check it by the RetryManager

fn set_status(&self, status: RetrierStatus) { *self.status.lock().unwrap() = status.clone(); // Add active retriers to the WTClient if self.is_running() || self.is_idle() { log::debug!("Adding {} to active retriers", self.tower_id); self.wt_client .lock() .unwrap() .retriers .insert(self.tower_id, status); } }

/// Removed our retrier identifier from the WTClient if the retrier is done (either stopped with no appointments or failed) pub fn remove_if_done(&self) { if self.failed() | !self.should_start() { log::debug!("Removing {} from active retriers", self.tower_id); self.wt_client .lock() .unwrap() .retriers .remove(&self.tower_id); } }

Notice that, if done like this, the retrier won't be removed from retriers if it's stopped and has pending data, but in this case the retrier will be re-started in the next iteration cycle, so we may be saving one operation against retriers:

self.stopped

with pending appointments -> remove, re-add in the next cycle

without pending appointments -> remove

self.should_start

true -> retrier stays in retriers as stopped and is updated in the next iteration (it's a hashmap)

false -> remove

Ok, no, nvm, this actually moves the race condition to another place of the codebase. The retrier should be removed from retriers when it is stopped, otherwise, the race condition will exist in the idle->stopped->running transition.

pending_appointments will be loaded when the idle-stopped transition happens, but this is not reflected in the retriers map, so send_to_retrier may miss some data. This should be better:

fn set_status(&self, status: RetrierStatus) { *self.status.lock().unwrap() = status.clone(); // Add or remove retriers from WTClient based on the RetrierStatus if self.is_running() || self.is_idle() { log::debug!("Adding {} to active retriers", self.tower_id); self.wt_client .lock() .unwrap() .retriers .insert(self.tower_id, status); } else if self.is_stopped() { log::debug!("Removing retrier {} from active retriers", self.tower_id); self.wt_client .lock() .unwrap() .retriers .remove(&self.tower_id); } }

/// Removed our retrier identifier from the WTClient if the retrier has failed pub fn remove_if_failed(&self) { if self.failed() { log::debug!( "Removing failed retrier {} from active retriers", self.tower_id ); self.wt_client .lock() .unwrap() .retriers .remove(&self.tower_id); } }

@sr-gi That would do.

Another solution I think of is identifying user-made retry requests and force execute them (even with failed retriers). This is easy now since we have an enum for the revocation data.

Approaching it this way would relief us from thinking about whether a race condition will happen in manual retry or not, and we can safely set the tower status from any retrier/thread other than the RetryManager knowing that if a user executes a retry request on this very same tower, the RetryManager will force-execute their request.

Some (incomplete/pseudocode) example of what I mean:

loop { match self.unreachable_towers.try_recv() { Ok((tower_id, data)) => { match data { // None for user-made retry requests. RevocationData::None => { if let Some(retrier) = self.retriers.get(&tower_id) { // Indicates that the retry manager should force start that // retrier in the next retry cycle. retrier.force_start = true; // Now load the pending appointments and extend the retrier with them. retrier.pending_appointments.extend( load_appointment_locators( tower_id, crate::AppointmentStatus::Pending, ), ); } else { // Just add a new retrier // ... } } // Handle stale and fresh ... _ => unreachable!() } } Err(TryRecvError::Empty) => { // 1- Remove useless retriers (but keep force_start ones even if failed). // ... // 2- Start the shut down retriers one by one. for retrier in self.retriers.values() { if retrier.force_start { retrier.set_status(RetrierStatus::Stopped); retrier.force_start = false; } if retrier.should_start() { self.start_retrying(retrier.clone()) } } } _ => panic!(), }

One more solution is to allow or disallow manual retries based on retriers and not tower status.

Thus if there is a retrier for tower X we don't let the user retry (the retrier in any status generally - other than idle) and if there is not, the user can sent RevocationData::None peacefully.

I like the idea, but I think it requires a pretty big refactor overall, doesn't it?

I would feel more comfortable merging this incremental change and considering a followup for a complete refactor of this logic based on the RevocationData type. If that's something you'd like to work on I'd be more than happy to review it.

I like the idea, but I think it requires a pretty big refactor overall, doesn't it?

I would feel more comfortable merging this incremental change and considering a followup > for a complete refactor of this logic based on the RevocationData type.

+1

If that's something you'd like to work on I'd be more than happy to review it.

Yeah sure 😊

Leaving this as unresolved, we can open an issue to keep track of it

sr-gi · 2023-01-09T12:25:20Z

Pushed a fix of the race condition that was going on with the Retrier. @mariocynicys if you're happy with all the changes I'll squash the fixing commits and give this a final look before merging.

mariocynicys · 2023-01-09T12:49:15Z

Pushed a fix of the race condition that was going on with the Retrier. @mariocynicys if you're happy with all the changes I'll squash the fixing commits and give this a final look before merging.

Cool, there is only one thing missing tho. sub and register errors being permanent or tempo?

My take on this is (note: i might be a bit biased to the temporary/idle side xD), sub error go tempo/idle and register errors can be either tempo or permanent:

tempo if=> couldn't reach the tower, just like unreachable.
permanent if=> receipt.verify(&tower_id) fails

If we go by this, we would need to introduce some sort of logic saying whether each individual variant in the RetryError enum is temporary or fatal. We can then add this as a followup but keep [sub => tempo & register => either permanent or tempo (have no strong choice)] for now.

Edit: Sorry, I completely missed 3c2b292. It implements what I wrote in this comment already.

mariocynicys

LGTM

sr-gi · 2023-01-09T15:29:49Z

LGTM

Rebased. This should be good for a final diff and check.

mariocynicys · 2023-01-09T17:22:11Z

watchtower-plugin/src/retrier.rs

+        // FIXME: Here we should be able to check this, however, due to httpmock limitations, we cannot return a response based on the request.
+        // Therefore, both requests will be responded with the same data. Given pending_appointments is a HashSet, we cannot even know which request
+        // will be sent first (sets are initialized with a random state, which decided the order or iteration).
+        // https://github.com/alexliesenfeld/httpmock/issues/49
+        // assert!(!wt_client.lock().unwrap().retriers.contains_key(&tower_id));
+        // assert!(wt_client
+        //     .lock()
+        //     .unwrap()
+        //     .towers
+        //     .get(&tower_id)
+        //     .unwrap()
+        //     .pending_appointments
+        //     .is_empty());
+
+        // This is not much tbh, but looks like its the best we can do at the moment without experiencing random errors.
+        // Depending on what appointment is sent first the api will be hit either one or two times.
+        assert!(api_mock.hits() >= 1 && api_mock.hits() <= 2);
+        task.abort();


Q: What retrier error happen when the receipt doesn't match the appointment? Is it a signature error?

AddAppointmentError::SignatureError

mariocynicys

Some nits, LGTM otherwise.

mariocynicys · 2023-01-09T18:03:38Z

watchtower-plugin/src/retrier.rs

+                        self.wt_client
+                            .lock()
+                            .unwrap()
+                            .set_tower_status(self.tower_id, crate::TowerStatus::Unreachable);


nit: you can omit crate::.

mariocynicys · 2023-01-09T18:10:19Z

watchtower-plugin/src/retrier.rs

+                .unwrap()
+                .retriers
+                .remove(&self.tower_id);
+        }


can you put a comment here saying we shouldn't remove on status being Failed so not to allow a user to manual retry just yet/until the retrier is removed.

mariocynicys · 2023-01-09T18:27:56Z

watchtower-plugin/src/retrier.rs

+                        // Notice that setting the state as Unreachable, when needed, is delayed after the Retrier is evicted from the retriers map.
+                        // Not doing so will allow the user to manually retry before the Retrier has been deleted, making that retry to instantly fail.
+                        // Also, there is nothing to do here for Subscription errors.
+                        _ => (),


Not doing so will allow the user to manually retry before the Retrier has been deleted

It's now based on retrier being there in retriers map and not solely on tower status. Let's update this piece to:
No need to set the tower status to unreachable because we already set it when idling above
Or we can actually move the piece that sets the tower status to unreachable from the idle branch in e.is_permanent() check to this place instead.

Umm, this made me realize that for RetryError::Subscription(_, false) we were setting the tower status to Unreachable and right after to SubscriptionError (given RetryError::Subscription(_, false) is not permanent).

So we'll need to change the match to RetryError::Subscription(_, false) and put the aforementioned code in the default branch:

log::warn!("Retry strategy gave up for {}. {}", self.tower_id, e); if e.is_permanent() { self.set_status(RetrierStatus::Failed); } match e { RetryError::Subscription(_, false) => { log::info!("Setting {} status as subscription error", self.tower_id); self.wt_client .lock() .unwrap() .set_tower_status(self.tower_id, TowerStatus::SubscriptionError) } RetryError::Misbehaving(p) => { log::warn!("Cannot recover known tower_id from the appointment receipt. Flagging tower as misbehaving"); self.wt_client .lock() .unwrap() .flag_misbehaving_tower(self.tower_id, p); } RetryError::Abandoned => { log::info!("Skipping retrying abandoned tower {}", self.tower_id) } // This covers `RetryError::Unreachable` and `RetryError::Subscription(_, false)` _ => { log::debug!("Starting to idle"); self.set_status(RetrierStatus::Idle(Instant::now())); // Clear all pending appointments so they do not waste any memory while idling self.pending_appointments.lock().unwrap().clear(); self.wt_client .lock() .unwrap() .set_tower_status(self.tower_id, TowerStatus::Unreachable); } }

Good point,

So we'll need to change the match to RetryError::Subscription(_, false) and put the

I guess you meant RetryError::Subscription(_, true) and in the code snippet as well.

- Updates `Retrier::run` to return more meaningful errors. `Retrier::run` used to simply return a message, revamps it to return RetryError variants so we can handle return cases better. - Adds an additional state to `RetrierStatus`: Idle. Retries that fail due to an accumulation of transient errors will be flagged as Idle instead of Failed and retried later on (based on `auto_retry_delay`). Notice Retrier data is not kept in memory while a retrier is Idle. Instead, data is cleared and loaded again from the database when the `Retrier` is ready to run again. - Revamps how revocation data is sent to the `RetryManager`: The RetrierManager used to received locators one by one via unreachable_towers. This is due to them being mainly fed by `on_commitment_revocation`, which generates them one by one. However, both when manually retrying or when bootstrapping from an already populated database, multiple appointments may be pending for the same tower, hence needing to call `unreachable_towers.send` multiple times for the same tower. This itself was not a big deal, given we didn't really needed to differentiate between the cases. We do now though. In order to implement periodic retries while allowing manual retries we need to be able to signal the state transition to the `Retrier` without providing any new data: - If a Retrier is idle and we receive data trough `on_commitment_revocation` we need to append that data to the `Retrier`. - If a Retrier is iddle and we receive data trough a manual retry, we need to change the state of the `Retrier` without adding any new data to it. In order to implement this we've added an additional map to `WTClient` that reports the state of the active retriers. Retriers are active only if they are running or idle. - Also reworks `WTClient::set_tower_status` to update the status only if the new one does not match the old one. This is simply to reduce the boiler plate of having to perform this check in other pats of the plugin codebase.

The re-register logic had a bug where the `TowerSummary` that was kept in memory after re-registering was sweeping the references to both the pending and invalid appointments. While this data was still in the database, re-registering may have made it look like the data was not there.

sr-gi force-pushed the cln-plugin-auto-retry branch from 74cc35d to 558c236 Compare December 22, 2022 17:36

sr-gi requested a review from mariocynicys December 22, 2022 17:50

sr-gi added feature request New feature or request Seeking Code Review review me pls cln-plugin Stuff related to watchtower-plugin labels Dec 22, 2022

sr-gi linked an issue Dec 22, 2022 that may be closed by this pull request

Implement periodic retries #137

Closed

mariocynicys reviewed Dec 24, 2022

View reviewed changes

sr-gi force-pushed the cln-plugin-auto-retry branch from 558c236 to 1b11635 Compare December 27, 2022 09:49

sr-gi force-pushed the cln-plugin-auto-retry branch 3 times, most recently from 57c680c to 2aae5d8 Compare December 29, 2022 12:35

sr-gi force-pushed the cln-plugin-auto-retry branch 6 times, most recently from 94f087e to 9f39e02 Compare December 30, 2022 16:11

mariocynicys reviewed Dec 30, 2022

View reviewed changes

watchtower-plugin/src/main.rs Outdated Show resolved Hide resolved

sr-gi force-pushed the cln-plugin-auto-retry branch from f3e4d71 to 3d193aa Compare January 2, 2023 10:27

mariocynicys previously approved these changes Jan 5, 2023

View reviewed changes

mariocynicys self-requested a review January 5, 2023 14:07

mariocynicys reviewed Jan 6, 2023

View reviewed changes

watchtower-plugin/src/main.rs Outdated Show resolved Hide resolved

mariocynicys reviewed Jan 6, 2023

View reviewed changes

mariocynicys approved these changes Jan 9, 2023

View reviewed changes

sr-gi force-pushed the cln-plugin-auto-retry branch from 97aadbc to b86797b Compare January 9, 2023 15:23

sr-gi force-pushed the cln-plugin-auto-retry branch from b86797b to 3447081 Compare January 9, 2023 15:25

mariocynicys reviewed Jan 9, 2023

View reviewed changes

mariocynicys approved these changes Jan 9, 2023

View reviewed changes

sr-gi force-pushed the cln-plugin-auto-retry branch from 3447081 to 575cc34 Compare January 10, 2023 12:13

mariocynicys approved these changes Jan 10, 2023

View reviewed changes

sr-gi added 4 commits January 10, 2023 13:20

Adds watchtower-plugin autoretry tests

cc7acf6

Fixes clippy related issues

39e817c

sr-gi force-pushed the cln-plugin-auto-retry branch from 575cc34 to 39e817c Compare January 10, 2023 12:20

sr-gi merged commit 8c08b5f into talaia-labs:master Jan 10, 2023

sr-gi removed the Seeking Code Review review me pls label Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds auto-retry logic to watchtower-plugin #168

Adds auto-retry logic to watchtower-plugin #168

sr-gi commented Dec 22, 2022 •

edited

Loading

mariocynicys left a comment

sr-gi commented Dec 27, 2022

sr-gi commented Dec 29, 2022

mariocynicys left a comment •

edited

Loading

mariocynicys Jan 5, 2023

mariocynicys Jan 5, 2023

mariocynicys Jan 5, 2023

mariocynicys Jan 6, 2023

mariocynicys Jan 9, 2023

sr-gi Jan 9, 2023

sr-gi Jan 9, 2023 •

edited

Loading

sr-gi Jan 9, 2023

mariocynicys Jan 9, 2023 •

edited

Loading

mariocynicys Jan 9, 2023

sr-gi Jan 9, 2023

mariocynicys Jan 9, 2023

sr-gi Jan 10, 2023

sr-gi commented Jan 9, 2023

mariocynicys commented Jan 9, 2023 •

edited

Loading

mariocynicys left a comment

sr-gi commented Jan 9, 2023

mariocynicys Jan 9, 2023

sr-gi Jan 10, 2023

mariocynicys left a comment

mariocynicys Jan 9, 2023

mariocynicys Jan 9, 2023

mariocynicys Jan 9, 2023

sr-gi Jan 10, 2023

mariocynicys Jan 10, 2023

sr-gi Jan 10, 2023

Adds auto-retry logic to watchtower-plugin #168

Adds auto-retry logic to watchtower-plugin #168

Conversation

sr-gi commented Dec 22, 2022 • edited Loading

Current State

Improvements

Bug fixes

mariocynicys left a comment

Choose a reason for hiding this comment

sr-gi commented Dec 27, 2022

sr-gi commented Dec 29, 2022

mariocynicys left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sr-gi Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariocynicys Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sr-gi commented Jan 9, 2023

mariocynicys commented Jan 9, 2023 • edited Loading

mariocynicys left a comment

Choose a reason for hiding this comment

sr-gi commented Jan 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariocynicys left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sr-gi commented Dec 22, 2022 •

edited

Loading

mariocynicys left a comment •

edited

Loading

sr-gi Jan 9, 2023 •

edited

Loading

mariocynicys Jan 9, 2023 •

edited

Loading

mariocynicys commented Jan 9, 2023 •

edited

Loading