Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit how fxa-client recovers from expired push subscriptions #3127

Closed
rfk opened this issue May 15, 2020 · 25 comments
Closed

Audit how fxa-client recovers from expired push subscriptions #3127

rfk opened this issue May 15, 2020 · 25 comments
Assignees

Comments

@rfk
Copy link
Contributor

rfk commented May 15, 2020

Thanks to the vagaries of The Cloud, it's possible for an FxA client device to register a push subscription endpoint with the FxA server, but for that push subscription endpoint to later become invalid. The FxA server detects this when it tries to send a push notification to that device and fails, and it reacts by setting a pushEndpointExpired flag on the device record.

What mechanisms do we have for detecting this on the client and updating the push subscription, and are they working as expected?

Analysis of push notification metrics for send-tab suggest that up to 25% 20% of attempts to send a tab, fail to send the corresponding push notification, and that it's mostly due to this "expired push endpoint" issue.


Summary of identified follow-up work:

┆Issue is synchronized with this Jira Task
┆Story Points: 8
┆Epic: Trusted Send Tab Telemetry
┆Sprint: SYNC - end 2020-06-05

@rfk
Copy link
Contributor Author

rfk commented May 15, 2020

Here's where desktop applies this logic, as part of refreshing its list of peer devices:

https://dxr.mozilla.org/mozilla-central/rev/80ba3f3cfaf9b31341bca28e2e6aea383cec8acc/services/fxaccounts/FxAccountsDevice.jsm#222

Notice how it just reaches on in to the push service, zaps the subscription and requests a new one. We can't do the same for the FxA rust component, because the push service lives outside of FxA and is managed by the app. So we probably need some events that trigger to cause to app to re-register the push subscription.

@rfk
Copy link
Contributor Author

rfk commented May 15, 2020

@mhammond suggested in slack, that we could also take a look at the device records on the server, look for ones that have pushEndpointExpired=true, and see if there are any patterns in the data there (e.g. if it's way more common on certain platforms).

@rfk
Copy link
Contributor Author

rfk commented May 21, 2020

@ckarlof described a situation recently where he hadn't used Fenix for several days, and when he tried to send a tab from desktop to Fenix it did not arrive. Launching the Fenix app and allowing it to sync seemed to fix send-tab without any other interaction, and subsequent sends from desktop to Fenix worked as expected.

What I suspect happened here, is that Fenix's push subscription became invalid while the app was not in use, meaning that it stopped being able to receive sent tabs in a timely manner. When the app was started and synced, it discovered the expired push endpoint, automatically repaired it, and was once again able to properly receive tabs.

Interestingly, Chris also reported that Fenec did not seem to have this problem, which makes me wonder if there's some background smarts in the Fennec app to keep the push subscriptions active while the app is not in use.

@ckarlof
Copy link

ckarlof commented May 21, 2020

I tested it again. Starting from Desktop, I sent a tab to Fennec and Fenix. I got no notification for either from my phone.

After opening Fennec, and triggering a manual sync, I got the tab I sent to Fennec. I did a manual sync on Fenix, and did not receive the tab.

Trying again on Fennec, I now get instant notifications for tabs sent to Fennec. Fenix is still broken in receiving tabs sent from Desktop.

@eoger
Copy link
Contributor

eoger commented May 21, 2020

@jonalmeida are there Android logs that @ckarlof can pull to figure out what's happening here?

I did a manual sync on Fenix, and did not receive the tab.

I don't think Fenix polls the send tab commands on manual sync. Maybe we should? On the other hand, if the user has to rely on Sync Now to get their tabs we've already failed.

Fenix is still broken in receiving tabs sent from Desktop.

On the Desktop side, in the browser toolbox await fxAccounts.device.refreshDeviceList({ignoreCached: true}); fxAccounts.device.recentDeviceList would print the devices list, including the pushEndpointExpired flag.

@jonalmeida
Copy link
Collaborator

@eoger Yes, you should see a warning log in there if that flag was propagated to us from FxaPushSupportFeature:

logger.warn("We have been notified that our push subscription has expired; renewing registration.")

I mentioned this in another ticket I think, but we try to avoid DDOS-ing FxA if we already went down this flow within the last 24 hours. So you might not see that log again until then.

@rfk
Copy link
Contributor Author

rfk commented May 21, 2020

I mentioned this in another ticket I think,

It was in a now-resolved comment in this google doc, for anyone going hunting for that context.

@jonalmeida
Copy link
Collaborator

I don't think Fenix polls the send tab commands on manual sync. Maybe we should? On the other hand, if the user has to rely on Sync Now to get their tabs we've already failed.

Sorry, I missed replying to this. Sync Now button should be polling for device commands. If that's not working, then something else there is in a bad state.

@rfk
Copy link
Contributor Author

rfk commented May 27, 2020

Sounds like there's more going on in @ckarlof's case than my initial theory, so let's spin that off into a standalone issue to ensure we don't lose track of the overall thread here: https://bugzilla.mozilla.org/show_bug.cgi?id=1641147

@rfk
Copy link
Contributor Author

rfk commented May 27, 2020

Turns out I had a Firefox Beta install that's in a similar boat to what I described above - connected to my account, but idle for a couple of weeks. Here's what I observed:

  • Sending tabs from Desktop to this device didn't seem to work. Logs on the desktop device said they were successfully sent, but no notification was received on the target device. Inspecting the raw state of my device list, I saw that this device had the pushendpointExpired: true flag in its device record, meaning that FxA would not even attempt to send it push notifications.
  • After launch the app on my phone, I went into Account Settings and observed that it said "synchronizing now". It stayed in this state for a long time, much longer than I think is reasonable.
    • logcat suggests that it spend 10 minutes apply 5000 history records, in many small transactions.
    • pleasantly, the browser seemed quite responsive during this process, including e.g. awesomebar suggestions (which is kind of the point of the small transactions)
  • After it finished syncing, I was able to "sync now". Doing so caused the previously-sent tab (and some others from earlier that I'd forgotten about!) to show up immediately.
  • Sending more tabs from Desktop to Fenix does not trigger a notification, but they do show up when doing "sync now".

So as far as I can tell, this Fenix instance is not successfully restoring its push endpoint on the FxA server. I'll dig into logcat and see what else I can find out.

@rfk
Copy link
Contributor Author

rfk commented May 27, 2020

logger.warn("We have been notified that our push subscription has expired; renewing registration.")

I found instances of this log line being emitted from my Fenix, which is good, but it doesn't seem to be successfully resolving the problem.

@rfk
Copy link
Contributor Author

rfk commented May 27, 2020

@jonalmeida so I can confirm from logs that my Fenix hit the code you linked here to discover that its push subscription is expired, and it appears to actually get a new subscription with the push service. Some log snippets:

$ grep -i push logcat_beta.txt | tail -n 20
05-27 18:32:44.496  5490  7062 D WifiPermissionsUtil: canAccessScanResults: pkgName = com.sec.spp.push, uid = 10053
05-27 18:32:44.502  6659  6659 E SPPClientService: [[PushClientService]] reqType : 0
05-27 18:32:44.515  6659  6659 E SPPClientService: [[PushClientService]] F:false, D:false, E:true, T:false, S:true, R:false
05-27 18:32:45.508  5490  7330 D SamsungAlarmManager: Cancel Alarm calling from uid:10053 pid :6659 / op:PendingIntent{2d6347a: PendingIntentRecord{55058a5 com.sec.spp.push broadcastIntent}}
05-27 18:32:45.571  5490  7330 D SamsungAlarmManager: Cancel Alarm calling from uid:10053 pid :6659 / op:PendingIntent{692da21: PendingIntentRecord{c717dc0 com.sec.spp.push broadcastIntent}}
05-27 18:32:57.198 27653 27704 I FxaDeviceConstellation: Current device needs push endpoint registration
05-27 18:32:57.200 27653 27653 W ConstellationObserver: We have been notified that our push subscription has expired; renewing registration.
05-27 18:32:57.200 27653 27653 W AutoPushFeature: Forcing registration renewal by deleting our (cached) token.
05-27 18:33:31.097 27653 27694 I AutoPushFeature: Received a new registration token from push service.
05-27 18:33:31.098 27653 27860 I [INFO][Leanplum]: [com.leanplum.LeanplumCloudMessagingProvider::onRegistrationIdReceived::3]: Device registered for push notifications with registration token, [SOME BASE64 TOKEN HERE]
05-27 18:33:31.098 27653 27694 D push_ffi: push_update
05-27 18:33:31.957 27653 27694 D push_ffi: push_get_subscription
05-27 18:33:49.489 27653 27688 I FxaDeviceConstellation: Current device needs push endpoint registration
05-27 18:33:49.489 27653 27653 W ConstellationObserver: We have been notified that our push subscription has expired; renewing registration.
05-27 18:33:49.489 27653 27653 W AutoPushFeature: Forcing registration renewal by deleting our (cached) token.
05-27 18:33:51.576 27653 27860 I [INFO][Leanplum]: [com.leanplum.LeanplumCloudMessagingProvider::onRegistrationIdReceived::3]: Device registered for push notifications with registration token, [A DIFFERENT BASE64 TOKEN HERE]
05-27 18:33:51.577 27653 27694 I AutoPushFeature: Received a new registration token from push service.
05-27 18:33:51.578 27653 27694 D push_ffi: push_update
05-27 18:33:51.801 27653 27694 D push_ffi: push_get_subscription
05-27 18:34:00.023  5490  5490 V AlarmManagerEXT: (AppSync) com.sec.spp.push : 900(900)

However, this seems to just repeat each time the device syncs.

Naively, I would expect the onSubscriptionChanged handler to be called somewhere in this process, in order to send the new push subscription endpoint to the FxA server and clear the pushEndpointExpired flag. But the logs suggest this isn't triggering, and I don't know enough about the chain of dependencies here to dig in further at this stage. Does that sound right to you?

@ckarlof
Copy link

ckarlof commented May 27, 2020

Turns out I had a Firefox Beta install that's in a similar boat to what I described above - connected to my account, but idle for a couple of weeks. Here's what I observed:

* Sending tabs from Desktop to this device didn't seem to work. Logs on the desktop device said they were successfully sent, but no notification was received on the target device. Inspecting the raw state of my device list, I saw that this device had the `pushendpointExpired: true` flag in its device record, meaning that FxA would not even _attempt_ to send it push notifications.

* After launch the app on my phone, I went into Account Settings and observed that it said "synchronizing now". It stayed in this state for a long time, much longer than I think is reasonable.
  
  * logcat suggests that it spend 10 minutes apply 5000 history records, in many small transactions.
  * pleasantly, the browser seemed quite responsive during this process, including e.g. awesomebar suggestions (which is kind of the point of the small transactions)

* After it finished syncing, I was able to "sync now". Doing so caused the previously-sent tab (and some others from earlier that I'd forgotten about!) to show up immediately.

* Sending more tabs from Desktop to Fenix does not trigger a notification, but they do show up when doing "sync now".

So as far as I can tell, this Fenix instance is not successfully restoring its push endpoint on the FxA server. I'll dig into logcat and see what else I can find out.

Ryan, this roughly captures what I experienced as well.

@rfk
Copy link
Contributor Author

rfk commented May 28, 2020

Naively, I would expect the onSubscriptionChanged handler to be called somewhere in this
process, in order to send the new push subscription endpoint to the FxA server and clear the
pushEndpointExpired flag. But the logs suggest this isn't triggering

I filed mozilla-mobile/android-components#7143 to follow up further on this specific thread of reasoning, since it's easy for individual items to get lost in this kind of general "audit" bug.

@jonalmeida
Copy link
Collaborator

jonalmeida commented May 28, 2020

Sorry about the slow response!

However, this seems to just repeat each time the device syncs.

Naively, I would expect the onSubscriptionChanged handler to be called somewhere in this process, in order to send the new push subscription endpoint to the FxA server and clear the pushEndpointExpired flag. But the logs suggest this isn't triggering, and I don't know enough about the chain of dependencies here to dig in further at this stage. Does that sound right to you?

This is interesting! It's good to also know our checks for the pushEndpointExpired work in a-c.

05-27 18:32:57.200 27653 27653 W AutoPushFeature: Forcing registration renewal by deleting our (cached) token.
05-27 18:33:31.097 27653 27694 I AutoPushFeature: Received a new registration token from push service.

This flag is a bit special, in that, we don't request a new subscription change immediately. We nuke the app's push token to get a fresh token from the push provider (Firebase) to trigger the re-subscription. That's what we see above.

05-27 18:33:31.098 27653 27860 I [INFO][Leanplum]: [com.leanplum.LeanplumCloudMessagingProvider::onRegistrationIdReceived::3]: Device registered for push notifications with registration token, [SOME BASE64 TOKEN HERE]

That's also why you can see the marketing SDK recover from this with the new push token that they send to their own push servers. The [SOME BASE64 TOKEN HERE] is token we receive as well that needs to go to our autopush servers too.

05-27 18:33:31.098 27653 27694 D push_ffi: push_update
05-27 18:33:31.957 27653 27694 D push_ffi: push_get_subscription

This is great! We received the token and provided it to native layer.

05-27 18:33:49.489 27653 27688 I FxaDeviceConstellation: Current device needs push endpoint registration

It seems here we're trying to perform some action on the account (send tab I assume) and we run into this flow again.

What's missing is the call to check if our subscriptions are still valid.

This happens when the feature is initialized which in turn happens on app startup in Fenix.

What should happen then, is during the next cold boot (swipe the app away before starting the app), we should trigger that mechanism. This will fire the onSubscriptionChanged when we get back the new subscriptions from the native layer as you rightfully mentioned above which should provide the new subscription for Sync.

@rfk did you device eventually get push working in a day? (You could try now since that's how long it took me to get back 😅 )

@jonalmeida
Copy link
Collaborator

If this isn't happening, then we would have to look into why verify_connection does not give us the new subscriptions after this takes place.

@rfk
Copy link
Contributor Author

rfk commented May 28, 2020

did you device eventually get push working in a day?

Nope, the FxA server still reports pushEndpointExpired: true, indicating that it hasn't received an updated subscription.

@jonalmeida
Copy link
Collaborator

Right, so when we provide a new push token, we're not getting an indication that the subscriptions need to be renewed to pass it on to the consumers.

Btw, thanks for looking into this @rfk !

@rfk
Copy link
Contributor Author

rfk commented May 29, 2020

Glad to be able to reproduce! Just re-upping, I think we should spin this into a dedicated bug (e.g. mozilla-mobile/android-components#7143) for followup since I expect that we may have several other threads of discussion under the "audit" banner of this issue.

@rfk
Copy link
Contributor Author

rfk commented Jun 4, 2020

I spent a bit of time digging into how the moving parts fit together here on iOS; /cc @garvankeeley for context. Some observations (with the caveat that I could easily be missing something):

  • The app calls the FxA setDevicePushSubscription method in exactly one place: here, in the "on successful registration" handler of its push machinery.
    • Thus, we assume that if the device ever needs to re-register with APNS again, the didRegisterForRemoteNotificationsWithDeviceToken will be invoked a second time. That seems reasonable to me, and I see a few places where the app detects a push-related error, clears its APNS-related state, and allows it to re-register itself.
    • We also assume that re-registering with APNS is the only reason why our FxA push subscription URL might change. It's not clear to me if that's true.
  • When the app refreshes its FxA device list, there is logic that checks whether its push subscription has been marked as expired here. However we don't seem to do anything with that information except for write a debug log message. It feels like we should be dispatching some sort of notification here to that the apps knows it needs repair (or alternately, we should have telemetry assuring us that this doesn't occur in practice).
  • I wasn't able to find any code for checking the connection between firefox-ios and the autopush bridge server. AFAICT the code assumes that once we've registered with the bridge server then it will keep our default subscription endpoint alive indefinitely.

I'm going to file a followup bug to get some telemetry around the subscriptionExpired logging, to help understand how much this may be happening in practice.

@garvankeeley
Copy link
Contributor

When this is called

from client code, I could add a check (client-side) to localDevice.subscriptionExpired and if false, log that to Sentry. Thoughts?

Or if we know we should be doing something at that point, we could log to Sentry and take necessary action to refresh the subscription.

@rfk
Copy link
Contributor Author

rfk commented Jun 4, 2020

I could add a check (client-side) to localDevice.subscriptionExpired and if false, log that to Sentry.
Thoughts?

I think it would be valuable. I believe @eoger is also investigating whether we can found out how many iOS devices might be in this state from server data, and it the answer is "lots" then maybe we don't need any logging here.

Or if we know we should be doing something at that point we could log to Sentry
and take necessary action to refresh the subscription.

I don't know the code well, but I think if we flag the push setup as needing recovery similar to what we do on decryption errors it should help.

@rfk
Copy link
Contributor Author

rfk commented Jun 4, 2020

See also #3199 for an appservices-side bug about doing something other than debug logging here.

@rfk
Copy link
Contributor Author

rfk commented Jun 4, 2020

By contrast, Fenix has this periodic "verify subscriptions" behavior as noted in e.g.
mozilla-mobile/android-components#7236. Do we need something similar for iOS?

From conversation in slack, yes, it would be a good idea to do this periodically on iOS as well. Followup bug incoming.

@rfk
Copy link
Contributor Author

rfk commented Jun 4, 2020

We discussed this bug in triage today, and it feels like we've done all the "auditing" that we can. Let's close this out in favour of the identified followup bugs, which I've edited into the issue description and will summarize here as well for completeness:

@rfk rfk closed this as completed Jun 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants