-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate verify_connection incorrectly returning empty list #3314
Comments
@jrconlin I've spent some time this morning auditing the set of Rust functions involved in the |
@eoger The server returns the list of known subscriptions for a given user. These are subscriptions where the UA has not performed an unsubscribe action. Subscriptions do not have an expiration date and should continue to be valid until either the UA sends an unsubscribe or the bridge specifies that the associated bridge token is no longer valid. In terms of the server code, the verify function fetches the channel list from the server, which reduces to calling the The all_channels test is here. |
I wonder if the problem here is that the FxA server comes to believe a subscription has expired, while the push server itself does not. The logic on the FxA server is:
Are there circumstances under which the push server might return a |
Alternately, re-reading mozilla-services/autopush#1397, I wonder if the client is registering a new FCM token and causing its existing subscriptions to "recover" and be useable again, rather than generating a new subscription that it then needs to re-register with the FxA server. I feel like my mental model of the system here is not quite right, but I'm not sure how to debug it. @jonalmeida could you please help us re-invigorate this bug by summarizing the sequence of events you're expecting to see and steps you're expecting to perform in the client code along this recovery path? |
Sure! Our expected flow is below:
We have verified that at point 4, we are getting a new push token after telling FCM to delete it.
I think you're on to something here! In #2490, we noticed that the native push layer does not communicate with the AutoPush servers in In the Android Component, we create a fake subscription on a fresh install to ensure we received the UAID to avoid that particular bug, but maybe we need to always do this hack when we receive a new token to notify the push servers that our token has now changed, and therefore allowing for the condition in point 6 to be true. I'm going to investigate if that works today! |
Unfortunately, this did not work. In my test, I waited for receiving a new push token (similar to point 4 from my comment above), created a fake subscription right after, and then performed a |
Thanks. I feel like there is definitely a server side bug in this too. I've filed mozilla-services/autopush#1442 to track it. |
Thank you! Let me see if I can break some of your steps down into some substeps involving the server and the appservices client lib:
@jrconlin does that sound right to you from the server's perspective? I'm guessing a little bit at the intent and effects of mozilla-services/autopush#1397. (If this is accurate, I have some thoughts on how to untangle it, but I don't want to proceed further based on any misunderstandings) (Edited to clarify that the appservices push component is the thing that list the local channels and provides the list of which have changed) |
➤ Ryan Kelly commented: I'm taking this issue and pulling it into the next sprint because I think we have an reinvigorated interest and leads on this. |
So, some additional musings from pair debugging with @jonalmeida yesterday: There's a potential for a race condition that I'm not quite sure how to handle. Although it's supposed to be very rare, FCM can change the FCM token at any time. It is up to the client to report this to the server, and that's what the When that happens, it's not known if values for the old token are accepted. It could be that there is a race condition between when that new token is provisioned, when the server has been updated to use that token, and when outbound messages are being sent out. If a message is sent out to FCM with an older token, and it's returned with a 404/410, the server presumes that token to no longer be active and all information for that user is dropped (since that user is no longer reachable). Holding messages for accounts that are being actively rejected in hopes that they might be come back soon would add a fair bit of complexity and some cost to the server. It may be worthwhile for the client to presume that when a new FCM token has been provisioned, and that this is not a new install, it might be worth it for the client to check if there are any pending sync actions, and just presume that there are push messages that have been dropped. I'm not saying that this is the exact reason for the bug we're seeing, but it is a possible cause. |
Yes mozilla-services/autopush#1397 was a test to confirm that a token update does not change the value of any existing channels, endpoints or other data that may already be registered. The only value that's changed is the FCM_token passed to FCM and used to identify the recipient process by FCM. This means that all existing endpoints should continue to work. The one caution would be the (hopefully, very low chance) race condition I outlined above, where the server attempts to send a message to the client in the period between the client getting a new FCM token and the server registering it, which would lead to the user data being dropped as invalid. Perhaps a "pause/resume" function might be useful in that case, but it would be a significant change for the push server. |
Since there doesn't appear to be any direct work required by the Autopush service at this time, going to de-prioritize the associated bugs for now. |
@jrconlin could you please link me to the code in the server that handles this case? I had a poke around and I think I was able to find the places in Are there any circumstances under which the autopush server might report a 404/410 when sending a notification, but not purge user data from the db?
@jonalmeida, a similar question to you for the client-side piece of this - could you please link me to the code that receives this notification and tunnels it through to I really want to be able to get my head around all the moving parts here end-to-end. |
Going one step further, we could have the client assume that it needs to update all of its push subscriptions after getting a new token, rather than the current thing where it tries to cleverly update only those that have changed. |
To clarify why I'm asking this: the "autopush server reports a 404/410 when sending a notification" part is definitely happening and happening with some regularity, because this is what causes the FxA server to set the So I want to confirm our shared understanding that the "all information for that user is dropped" part is also happening with similar regularity. |
Aha, I think I found it here: |
@rfk the In that method we save the token to the component's disk cache and also to the native component here. You will notice there is a bit of logic in there for two things:
You can see the use case for the first point in the AutoPushFeature, where we initialize the native layer with cached token on class initialization. |
(Sorry for the stream of consciousness here but it really helps me think to do it out loud). Thinking through what's supposed to happen when there's some problem with the FCM token, IIUC it's something like this:
Does this match your expectations @jrconlin @jonalmeida? Note that step (8) resurrects the uaid that was deleted in step (4), and I'm wondering if this resurrection behavior may have unintended consequences. For example, IIUC the list of active channels for a uaid is stored in the "message" table. Might resurrecting the uaid in the "router" table risk resurrecting its previously-active channels and invalidating the assumption at step (11)? |
At 12:
Probably a bit pedantic, but we do not re-create the subscription in the push component layer, instead this is done in the application layer - for Android Fxa, this is the The push layer only notifies the channel subscribers in order to follow the same web content expectation as in the docs. The Everything else looks correct to me. 🙂 |
Yeah, there are a lot of moving parts at play here. (Honestly, if I had to do it again, I'd absolutely only use the native push system as a "wake up" and then ask the client to create a websocket connection back to our servers whenever possible. There would still be a few issues around how iOS would have to handle things, but we could treat that as the one off rather than the rule. |
While testing my changes for mozilla-mobile/android-components#8846 (comment) I encountered a strange result that may be related to this issue as well. Below is my expected flow:
What I see instead is:
By step 5, the push subscription on the Autopush server should have invalidated the previous endpoint so the second send tab would have failed to send, thus flipping the This seems to be another side-effect from As a result, I'm not sure now how to reproduce a realistic broken state between Autopush and FxA. |
@jonalmeida how does one do this in practice? I'll try to spend some time today setting myself up to reproduce this issue based on the links you shared with me previously. |
My understanding from @jrconlin's comments (e.g. #3314 (comment)) is that |
This is done in AutoPushFeature: override fun renewRegistration() {
logger.warn("Forcing registration renewal by deleting our (cached) token.")
// Remove the cached token we have.
deleteToken(context)
// Tell the service to delete the token as well, which will trigger a new token to be
// retrieved the next time it hits the server.
service.deleteToken()
// Starts the service if needed to trigger a new registration.
service.start(context)
}
Deleting the FCM token instantly triggers the SDK to retrieve a new token from the FCM servers and trigger the
Ah my mistake, in that case ignore my last comment! Although, that makes it harder to verify if we get a new subscription endpoint from Autopush when FxA sets the |
I spent a few quality hours with this bug today, but I think I'm now even more confused. My sequence of events from #3314 (comment) can't play out quite as I imagined, specifically these steps:
When the appservices push client tries to update the token for a user that has been dropped, its request will fail with an auth error because the push server has thrown away the record for the uaid. So step (9) doesn't happen...but then, I don't really know how the client is supposed to recover from this state at all. I don't see any codepaths by which the appservices push client can change its uaid if that record has been dropped by the server in the manner suggested by #3314 (comment). |
Thanks for the long and detailed discussion in this bug folks! However, I think it has served its purpose and is growing beyond the point where we can keep all the threads straight. I've filed several follow-up bugs based on the discussion here:
The bug as titled here has been completed; |
@jrconlin we saw that when we made the autopush server tell us that a subscription was dropped, verify connection still came up with an empty list, right? It was a few weeks ago and my memory doesn't hold. I realized now that we didn't keep a log of that test. |
Oh, interesting, I did not pick up on this in the thread. How did you make it do this? I can try to repro locally. |
Starting from #3127 (comment), we were trying to find the various causes of push messages not being delivered.
In the investigation ticket in android components, we found some cases (that are now fixed) where we were not calling
PushManager.verify_connection
at the expected interval. However, we still were seeing an empty list of subscriptions returned back to us when we should have been seeing a list of the expired subscriptions.With mozilla-services/autopush#1397 resolved, we are now left to investigate the code path in the native Rust push manager.
┆Issue is synchronized with this Jira Task
┆Story Points: 5
┆Epic: FxA Ecosystem (backlog)
┆Sprint: SYNC - end 2020-11-06
The text was updated successfully, but these errors were encountered: