Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds auto-retry logic to watchtower-plugin #168

Merged
merged 4 commits into from
Jan 10, 2023

Conversation

sr-gi
Copy link
Member

@sr-gi sr-gi commented Dec 22, 2022

Current State

The current version of the watchtower-plugin for CLN will try to re-send data to towers that are not reachable for various reasons using an exponential backoff strategy. However, once the strategy is given up on, the tower will be flagged as unreachable and not further retried (unless the user manually triggers it through the retrytower rpc).

Improvements

This PR adds auto-retry logic, so once a specific unreachable tower is given up on, the retrier is not removed but remains idle and it's launched again after a specific delay (controlled by the watchtower-auto-retry-delay). From this point on, retriers will only be removed from the RetryManager is they fail due to a permanent error, but not due to a transient one (in which case they will be auto-retried).

In order to achieve this, the retriers now return more meaningful errors (instead of strings) that are used to distinguish why it had to give up retrying, if so.

The way data is shared between the WTClient and the RetryManager has also been updated in order to achieve this. Beforehand, data were sent one by one in (tower_id, locator) tuples, now, it is sent in (tower_id, RevocationData) tuples, in order to distinguish between retries/bootstraps and sending data because of a channel update.

Bug fixes

This also fixes a bug related to re-registering, where data regarding pending and invalid appointments was being swept from memory after re-registering.

@sr-gi sr-gi force-pushed the cln-plugin-auto-retry branch from 74cc35d to 558c236 Compare December 22, 2022 17:36
@sr-gi sr-gi requested a review from mariocynicys December 22, 2022 17:50
@sr-gi sr-gi added feature request New feature or request Seeking Code Review review me pls cln-plugin Stuff related to watchtower-plugin labels Dec 22, 2022
@sr-gi sr-gi linked an issue Dec 22, 2022 that may be closed by this pull request
Copy link
Collaborator

@mariocynicys mariocynicys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review,
Will need to give retrier.rs another pass.

watchtower-plugin/src/main.rs Outdated Show resolved Hide resolved
watchtower-plugin/src/retrier.rs Outdated Show resolved Hide resolved
watchtower-plugin/src/retrier.rs Outdated Show resolved Hide resolved
watchtower-plugin/src/retrier.rs Outdated Show resolved Hide resolved
watchtower-plugin/src/retrier.rs Outdated Show resolved Hide resolved
@sr-gi sr-gi force-pushed the cln-plugin-auto-retry branch from 558c236 to 1b11635 Compare December 27, 2022 09:49
@sr-gi
Copy link
Member Author

sr-gi commented Dec 27, 2022

Fixed the typos, will work on the fundamental issues next

@sr-gi sr-gi force-pushed the cln-plugin-auto-retry branch 3 times, most recently from 57c680c to 2aae5d8 Compare December 29, 2022 12:35
@sr-gi
Copy link
Member Author

sr-gi commented Dec 29, 2022

Added the changes requested on discord so the state of the retriers can be accessed by the WTClient.

I've slightly adapted the code so it also takes advantage of this to, for instance, send the retry notification to the RetryManager without the need to include any associated data to it.

This should be ready modulo discussing the register errors severity (and any found issue or bug ofc).

@sr-gi sr-gi force-pushed the cln-plugin-auto-retry branch 6 times, most recently from 94f087e to 9f39e02 Compare December 30, 2022 16:11
@sr-gi sr-gi force-pushed the cln-plugin-auto-retry branch from f3e4d71 to 3d193aa Compare January 2, 2023 10:27
mariocynicys
mariocynicys previously approved these changes Jan 5, 2023
Copy link
Collaborator

@mariocynicys mariocynicys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can now make a distinction between user initiated retry requests and retry requests coming from on_commitment_revocation.

When we knew a retrier was failed we ignore on_commitment_revocation's retry requests because we learned what the outcome will be already (another fail) and we don't want to keep in this failure loop of retrys forever so we ignore the request to cut it off.

The cutting off had this edge case race condition where a user-made retry request will also be ignored (bad ux). We introduced the additional complexity of not allowing retriers to set their tower statuses if they failed, but let the manager set the tower statues for them instead in the manager's logic loop.

I think since we now know who initiated the requests from the newly introduced data type (RevocationData) we can make it so that we never ignore a users request and avoid the bad UX issue. And also not defer setting tower status for failed retriers but do it immediately.

Edit: I don't know why I press buttons with green ticks unconsciously 🤦.

watchtower-plugin/tests/test.py Outdated Show resolved Hide resolved
crate::AppointmentStatus::Pending,
),
);
retrier.set_status(RetrierStatus::Stopped);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging the retrier as stopped in here will cause the retrier not to start in this cycle but the next after 1s.
This can be fixed with: self.start_retrying(retrier.clone()) after this line.

I feel this code block (else if let Some(t) = retrier.get_el....) fits in Retrier::should_start() method.

watchtower-plugin/src/retrier.rs Outdated Show resolved Hide resolved
watchtower-plugin/src/retrier.rs Outdated Show resolved Hide resolved
Comment on lines 417 to 425
// Notice that setting the state as Unreachable, when needed, is delayed after the Retrier is evicted from the retriers map.
// Not doing so will allow the user to manually retry before the Retrier has been deleted, making that retry to instantly fail.
// Also, there is nothing to do here for Subscription errors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue was that we considered unreachable to be complete failure and we won't try again (Failed: we insist on removing the retrier even if we got new appointments for it - that was because we generally think that new appointments come from on_commitment_revocation and not the user), but now we consider unreachable to go to idle which is retryable just fine (i think xD).

@@ -261,23 +449,25 @@ impl Retrier {
.await
.map_err(|e| {
log::debug!("Cannot renew registration with tower. Error: {:?}", e);
Error::permanent("Cannot renew registration with tower")
Error::permanent(RetryError::Register(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this one Error::transient to help in network errors and stuff.

watchtower-plugin/src/retrier.rs Outdated Show resolved Hide resolved
watchtower-plugin/src/retrier.rs Outdated Show resolved Hide resolved
watchtower-plugin/src/main.rs Show resolved Hide resolved
@mariocynicys mariocynicys self-requested a review January 5, 2023 14:07
@mariocynicys mariocynicys dismissed their stale review January 5, 2023 14:12

Clicked approve by mistake

Comment on lines 386 to 405
if e.is_permanent() {
self.set_status(RetrierStatus::Failed);
} else {
log::debug!("Starting to idle");
self.set_status(RetrierStatus::Idle(Instant::now()));
// Clear all pending appointments so they do not waste any memory while idling
self.pending_appointments.lock().unwrap().clear();
self.wt_client
.lock()
.unwrap()
.set_tower_status(self.tower_id, crate::TowerStatus::Unreachable);
}

match e {
RetryError::Register(_) => {
log::info!("Setting {} status as subscription error", self.tower_id);
self.wt_client
.lock()
.unwrap()
.set_tower_status(self.tower_id, TowerStatus::SubscriptionError)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so let's assume for a second that RetryError::Register is permanent and it occurred.
So we will do:

if e.is_permanent() {
	// This will remove the retrier from the active retriers map held by `WTClient`.
	// Now users will be granted manual retry based on `TowerStatus::is_retryable`.
	self.set_status(RetrierStatus::Failed);
}

Now in the match:

match e
	RetryError::Register(_) => {
		log::info!("Setting {} status as subscription error", self.tower_id);
		self.wt_client
			.lock()
			.unwrap()
			// We set the tower status to subscription error which is `retryable`.
			.set_tower_status(self.tower_id, TowerStatus::SubscriptionError)
	}
	...

We effectively created the same manual retry race condition we had before.

The user will be allowed to retry (neither the retrier is in the map of active retriers held by WTClient, nor its tower is in a non-retryable state), but if a user actually issues a manual retry at this point (pre failed retrier being removed by the retry manager), the request will go through but the retry manager will ignore it anyways and remove the failed retrier in the next cycle.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sr-gi Does this make more sense?

This should be general over any permanent error and not just Register (if it even were to be permanent).

Generally:
permanent error -> failed retrier
fail retrier -> unsafe to set the tower status

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, my bad, I failed to see that Retrier::set_status was still removing the data from WTClient::retriers. I thought I moved it to Retrier::remove_if_failed.

I think we can simply update remove_if_failed to remove_if_done, and make it remove data from retriers either if the retrier has failed or if the retrier has finished (status == Stopped and no pending appointments). Then simply take out the retriers removing part from set_status.

Otherwise, we need to carry around whether an error is permanent out of Retrier::start and check it by the RetryManager

fn set_status(&self, status: RetrierStatus) {
    *self.status.lock().unwrap() = status.clone();

    // Add active retriers to the WTClient
    if self.is_running() || self.is_idle() {
        log::debug!("Adding {} to active retriers", self.tower_id);
        self.wt_client
            .lock()
            .unwrap()
            .retriers
            .insert(self.tower_id, status);
    }
}
/// Removed our retrier identifier from the WTClient if the retrier is done (either stopped with no appointments or failed)
pub fn remove_if_done(&self) {
    if self.failed() | !self.should_start() {
        log::debug!("Removing {} from active retriers", self.tower_id);
        self.wt_client
            .lock()
            .unwrap()
            .retriers
            .remove(&self.tower_id);
    }
}

Copy link
Member Author

@sr-gi sr-gi Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that, if done like this, the retrier won't be removed from retriers if it's stopped and has pending data, but in this case the retrier will be re-started in the next iteration cycle, so we may be saving one operation against retriers:

self.stopped

  • with pending appointments -> remove, re-add in the next cycle
  • without pending appointments -> remove

self.should_start

  • true -> retrier stays in retriers as stopped and is updated in the next iteration (it's a hashmap)
  • false -> remove

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, no, nvm, this actually moves the race condition to another place of the codebase. The retrier should be removed from retriers when it is stopped, otherwise, the race condition will exist in the idle->stopped->running transition.

pending_appointments will be loaded when the idle-stopped transition happens, but this is not reflected in the retriers map, so send_to_retrier may miss some data. This should be better:

fn set_status(&self, status: RetrierStatus) {
    *self.status.lock().unwrap() = status.clone();

    // Add or remove retriers from WTClient based on the RetrierStatus
    if self.is_running() || self.is_idle() {
        log::debug!("Adding {} to active retriers", self.tower_id);
        self.wt_client
            .lock()
            .unwrap()
            .retriers
            .insert(self.tower_id, status);
    } else if self.is_stopped() {
        log::debug!("Removing retrier {} from active retriers", self.tower_id);
        self.wt_client
            .lock()
            .unwrap()
            .retriers
            .remove(&self.tower_id);
    }
}
/// Removed our retrier identifier from the WTClient if the retrier has failed
pub fn remove_if_failed(&self) {
    if self.failed() {
        log::debug!(
            "Removing failed retrier {} from active retriers",
            self.tower_id
        );
        self.wt_client
            .lock()
            .unwrap()
            .retriers
            .remove(&self.tower_id);
    }
}

Copy link
Collaborator

@mariocynicys mariocynicys Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sr-gi That would do.

Another solution I think of is identifying user-made retry requests and force execute them (even with failed retriers). This is easy now since we have an enum for the revocation data.

Approaching it this way would relief us from thinking about whether a race condition will happen in manual retry or not, and we can safely set the tower status from any retrier/thread other than the RetryManager knowing that if a user executes a retry request on this very same tower, the RetryManager will force-execute their request.

Some (incomplete/pseudocode) example of what I mean:

loop {
    match self.unreachable_towers.try_recv() {
        Ok((tower_id, data)) => {
            match data {
                // None for user-made retry requests.
                RevocationData::None => {
                    if let Some(retrier) = self.retriers.get(&tower_id) {
                        // Indicates that the retry manager should force start that
                        // retrier in the next retry cycle.
                        retrier.force_start = true;
                        // Now load the pending appointments and extend the retrier with them.
                        retrier.pending_appointments.extend(
                            load_appointment_locators(
                                tower_id,
                                crate::AppointmentStatus::Pending,
                            ),
                        );
                    } else {
                        // Just add a new retrier
                        // ...
                    }
                }
                // Handle stale and fresh ...
                _ => unreachable!()
            }
        }
        Err(TryRecvError::Empty) => {
            // 1- Remove useless retriers (but keep force_start ones even if failed).
            // ...
            // 2- Start the shut down retriers one by one.
            for retrier in self.retriers.values() {
                if retrier.force_start {
                    retrier.set_status(RetrierStatus::Stopped);
                    retrier.force_start = false;
                }
                if retrier.should_start() {
                    self.start_retrying(retrier.clone())
                }
            }
        }
        _ => panic!(),
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more solution is to allow or disallow manual retries based on retriers and not tower status.

Thus if there is a retrier for tower X we don't let the user retry (the retrier in any status generally - other than idle) and if there is not, the user can sent RevocationData::None peacefully.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, but I think it requires a pretty big refactor overall, doesn't it?

I would feel more comfortable merging this incremental change and considering a followup for a complete refactor of this logic based on the RevocationData type. If that's something you'd like to work on I'd be more than happy to review it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, but I think it requires a pretty big refactor overall, doesn't it?

I would feel more comfortable merging this incremental change and considering a followup > for a complete refactor of this logic based on the RevocationData type.

+1

If that's something you'd like to work on I'd be more than happy to review it.

Yeah sure 😊

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this as unresolved, we can open an issue to keep track of it

@sr-gi
Copy link
Member Author

sr-gi commented Jan 9, 2023

Pushed a fix of the race condition that was going on with the Retrier. @mariocynicys if you're happy with all the changes I'll squash the fixing commits and give this a final look before merging.

@mariocynicys
Copy link
Collaborator

mariocynicys commented Jan 9, 2023

Pushed a fix of the race condition that was going on with the Retrier. @mariocynicys if you're happy with all the changes I'll squash the fixing commits and give this a final look before merging.

Cool, there is only one thing missing tho. sub and register errors being permanent or tempo?

My take on this is (note: i might be a bit biased to the temporary/idle side xD), sub error go tempo/idle and register errors can be either tempo or permanent:

  • tempo if=> couldn't reach the tower, just like unreachable.
  • permanent if=> receipt.verify(&tower_id) fails

If we go by this, we would need to introduce some sort of logic saying whether each individual variant in the RetryError enum is temporary or fatal. We can then add this as a followup but keep [sub => tempo & register => either permanent or tempo (have no strong choice)] for now.

Edit: Sorry, I completely missed 3c2b292. It implements what I wrote in this comment already.

Copy link
Collaborator

@mariocynicys mariocynicys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sr-gi sr-gi force-pushed the cln-plugin-auto-retry branch from 97aadbc to b86797b Compare January 9, 2023 15:23
@sr-gi sr-gi force-pushed the cln-plugin-auto-retry branch from b86797b to 3447081 Compare January 9, 2023 15:25
@sr-gi
Copy link
Member Author

sr-gi commented Jan 9, 2023

LGTM

Rebased. This should be good for a final diff and check.

Comment on lines +1264 to +1281
// FIXME: Here we should be able to check this, however, due to httpmock limitations, we cannot return a response based on the request.
// Therefore, both requests will be responded with the same data. Given pending_appointments is a HashSet, we cannot even know which request
// will be sent first (sets are initialized with a random state, which decided the order or iteration).
// https://github.com/alexliesenfeld/httpmock/issues/49
// assert!(!wt_client.lock().unwrap().retriers.contains_key(&tower_id));
// assert!(wt_client
// .lock()
// .unwrap()
// .towers
// .get(&tower_id)
// .unwrap()
// .pending_appointments
// .is_empty());

// This is not much tbh, but looks like its the best we can do at the moment without experiencing random errors.
// Depending on what appointment is sent first the api will be hit either one or two times.
assert!(api_mock.hits() >= 1 && api_mock.hits() <= 2);
task.abort();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: What retrier error happen when the receipt doesn't match the appointment? Is it a signature error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddAppointmentError::SignatureError

Copy link
Collaborator

@mariocynicys mariocynicys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, LGTM otherwise.

self.wt_client
.lock()
.unwrap()
.set_tower_status(self.tower_id, crate::TowerStatus::Unreachable);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can omit crate::.

.unwrap()
.retriers
.remove(&self.tower_id);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a comment here saying we shouldn't remove on status being Failed so not to allow a user to manual retry just yet/until the retrier is removed.

Comment on lines 423 to 426
// Notice that setting the state as Unreachable, when needed, is delayed after the Retrier is evicted from the retriers map.
// Not doing so will allow the user to manually retry before the Retrier has been deleted, making that retry to instantly fail.
// Also, there is nothing to do here for Subscription errors.
_ => (),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not doing so will allow the user to manually retry before the Retrier has been deleted

It's now based on retrier being there in retriers map and not solely on tower status. Let's update this piece to:
No need to set the tower status to unreachable because we already set it when idling above
Or we can actually move the piece that sets the tower status to unreachable from the idle branch in e.is_permanent() check to this place instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, this made me realize that for RetryError::Subscription(_, false) we were setting the tower status to Unreachable and right after to SubscriptionError (given RetryError::Subscription(_, false) is not permanent).

So we'll need to change the match to RetryError::Subscription(_, false) and put the aforementioned code in the default branch:

log::warn!("Retry strategy gave up for {}. {}", self.tower_id, e);
if e.is_permanent() {
    self.set_status(RetrierStatus::Failed);
}

match e {
    RetryError::Subscription(_, false) => {
        log::info!("Setting {} status as subscription error", self.tower_id);
        self.wt_client
            .lock()
            .unwrap()
            .set_tower_status(self.tower_id, TowerStatus::SubscriptionError)
    }
    RetryError::Misbehaving(p) => {
        log::warn!("Cannot recover known tower_id from the appointment receipt. Flagging tower as misbehaving");
        self.wt_client
            .lock()
            .unwrap()
            .flag_misbehaving_tower(self.tower_id, p);
    }
    RetryError::Abandoned => {
        log::info!("Skipping retrying abandoned tower {}", self.tower_id)
    }
    // This covers `RetryError::Unreachable` and `RetryError::Subscription(_, false)`
    _ => {
        log::debug!("Starting to idle");
        self.set_status(RetrierStatus::Idle(Instant::now()));
        // Clear all pending appointments so they do not waste any memory while idling
        self.pending_appointments.lock().unwrap().clear();
        self.wt_client
            .lock()
            .unwrap()
            .set_tower_status(self.tower_id, TowerStatus::Unreachable);
    }
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point,

So we'll need to change the match to RetryError::Subscription(_, false) and put the

I guess you meant RetryError::Subscription(_, true) and in the code snippet as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed

@sr-gi sr-gi force-pushed the cln-plugin-auto-retry branch from 3447081 to 575cc34 Compare January 10, 2023 12:13
sr-gi added 4 commits January 10, 2023 13:20
- Updates `Retrier::run` to return more meaningful errors. `Retrier::run` used to simply return a message,
revamps it to return RetryError variants so we can handle return cases better.

- Adds an additional state to `RetrierStatus`: Idle. Retries that fail due to an accumulation of transient errors will be flagged
as Idle instead of Failed and retried later on (based on `auto_retry_delay`). Notice Retrier data is not kept in memory while a retrier is Idle. Instead, data is cleared and loaded again from the database when the `Retrier` is ready to run again.

- Revamps how revocation data is sent to the `RetryManager`: The RetrierManager used to received locators one by one via unreachable_towers. This is due to them being mainly fed by `on_commitment_revocation`, which generates them one by one. However, both when manually retrying or when bootstrapping from an already populated database, multiple appointments may be pending for the same tower, hence needing to call `unreachable_towers.send` multiple times for the same tower. This itself was not a big deal, given we didn't really needed to differentiate between the cases. We do now though.
In order to implement periodic retries while allowing manual retries we need to be able to signal the state transition to the
`Retrier` without providing any new data:

- If a Retrier is idle and we receive data trough `on_commitment_revocation` we need to append that data to the `Retrier`.
- If a Retrier is iddle and we receive data trough a manual retry, we need to change the state of the `Retrier` without
  adding any new data to it.

In order to implement this we've added an additional map to `WTClient` that reports the state of the active retriers. Retriers are active only if they are running or idle.

- Also reworks `WTClient::set_tower_status` to update the status only if the new one does not match the old one.
This is simply to reduce the boiler plate of having to perform this check in other pats of the plugin codebase.
The re-register logic had a bug where the `TowerSummary` that was kept in memory
after re-registering was sweeping the references to both the pending and invalid appointments.

While this data was still in the database, re-registering may have made it look like the data was not there.
@sr-gi sr-gi force-pushed the cln-plugin-auto-retry branch from 575cc34 to 39e817c Compare January 10, 2023 12:20
@sr-gi sr-gi merged commit 8c08b5f into talaia-labs:master Jan 10, 2023
@sr-gi sr-gi removed the Seeking Code Review review me pls label Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cln-plugin Stuff related to watchtower-plugin feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement periodic retries
2 participants