Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time out initial fetch and go to account-picker screen. #4754

Merged
merged 13 commits into from
Jul 16, 2021

Conversation

chrisbobbe
Copy link
Contributor

@chrisbobbe chrisbobbe commented May 20, 2021

This is meant to follow #4753 Done! I've split #4193 into two PRs, that one and this one, to make it easier to review.

This is the "minimal version" of a fix Greg describes in the issue; as he mentions there, it would be good to make follow-up improvements soon after merging this, or open a new issue for those.

Fixes: #4165

@chrisbobbe chrisbobbe requested review from gnprice and WesleyAC May 20, 2021 21:40
@chrisbobbe chrisbobbe added the blocked on other work To come back to after another related PR, or some other task. label May 20, 2021
@chrisbobbe
Copy link
Contributor Author

chrisbobbe commented May 20, 2021

(edit: this is done)


Also, note to self: I should make it so we do retry the initial fetch on 5xx's if we have stale server data and the UI is visible; otherwise, there's no point in retrying, if we're looking at the full-screen loading indicator where the user can't do anything useful.

@chrisbobbe chrisbobbe force-pushed the pr-initial-fetch-timeout-again branch from 95a4fc6 to 6c8b787 Compare May 21, 2021 16:55
@chrisbobbe
Copy link
Contributor Author

chrisbobbe commented May 21, 2021

Also, note to self: I should make it so we do retry the initial fetch on 5xx's if we have stale server data and the UI is visible; otherwise, there's no point in retrying, if we're looking at the full-screen loading indicator where the user can't do anything useful.

Done; revision pushed!

@chrisbobbe chrisbobbe removed the blocked on other work To come back to after another related PR, or some other task. label May 21, 2021
@chrisbobbe
Copy link
Contributor Author

chrisbobbe commented May 22, 2021

Note to self: Probably a good idea to alert the user (and tell Sentry) which bad outcome occurred: a 5xx error or a timeout.

Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chrisbobbe ! Generally this looks good -- small comments below, and one more-significant question about the set of errors we're handling.

As of the end of the branch I believe we're handling 5xx, 4xx, and timeouts appropriately. But there could be other ways a network failure shows up: if the device is unable to make a TCP connection (perhaps because it's in airplane mode), or a TLS connection atop that (perhaps because there's something wrong operationally with the server), then we'll treat that as an unexpected error that represents a bug. Can we work out what those would look like here, and handle them appropriately so that there's no operational issue that could cause an uncaught exception?

Separately: there's a string of 8 commits at the end that convert over from Lolex to Jest's fake timers. Do those need to be at the end, or can they come earlier -- perhaps even as a separate PR?

Comment on lines 283 to 285
// TODO: This should be narrowed to `!isServerError(e)`; we
// should fail early if we encounter unrecognized / unexpected
// errors.
if (isClientError(e)) {
if (!isServerError(e)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I know this is what the TODO comment said, but it doesn't feel quite right to me.

The issue is: what happens if there's a failure in the network connection? I think we basically want to treat those exactly the same as a 5xx error -- either one can be caused by operational issues, is usually transient, but may not be transient if the server is permanently offline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this reasoning:

If we get some strange, totally unexpected error caused by a bug
somewhere, we shouldn't continue the retry loop. Not only would that
waste the user's time, but it means the error basically gets
swallowed, and it makes it much harder for us to detect it and debug
effectively. So, fail early on exceptions like that.

So I guess the point is that there's another category of specific, expected errors here, which is a network failure. I'm not sure exactly what those look like at this point in the code.

Comment on lines 268 to 269
* Makes a request that retries until success or a non-5xx error;
* times out after `config.requestLongTimeoutMs`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can shorten the last clause to ", with timeout", to help keep this summary line concise. The detail of what the timeout's value is is good information to put in the prose below.

Comment on lines 23 to 26
/**
* Default should be set with `timers: 'modern'` in Jest config.
*/
assertUsingModernFakeTimers();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this isn't a declaration of anything, so there's nothing the comment is describing the interface of -- so it should be a plain non-jsdoc implementation-comment

Comment on lines +32 to +38
{(() => {
if (this.props.error instanceof TimeoutError) {
return <Label style={styles.text} text="Request timed out." />;
} else {
return <Label style={styles.text} text="Oops! Something went wrong." />;
}
})()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably simpler as a … ? … : … expression, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably simpler as a … ? … : … expression, no?

Sure. I used if / else because I thought there might be other cases to consider in the future, and this form would make it easier to expand on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yeah, that seems like a fine reason.

@chrisbobbe chrisbobbe force-pushed the pr-initial-fetch-timeout-again branch from 6c8b787 to 118782f Compare June 2, 2021 20:42
@chrisbobbe
Copy link
Contributor Author

Thanks for the review! Revision pushed.

Two (possibly related?) things have occurred to me:

  • Is INITIAL_FETCH_ABORT quite the right name, or should it be more like INITIAL_FETCH_ERROR? (Compare MESSAGE_FETCH_ERROR.)
  • One place in api.registerForEvents where things could conceivably fail is in the step where we transform the data we receive from the server (with a function named transform). Do we need to change our error handling for that at all?

@chrisbobbe chrisbobbe force-pushed the pr-initial-fetch-timeout-again branch from 118782f to 1683743 Compare June 25, 2021 00:16
@chrisbobbe
Copy link
Contributor Author

(Just rebased.)

Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revision! Sorry I've let this sit for a bit.

The app-level (i.e., outside src/api/) code looks good, with one comment below.

I think the apiErrors code doesn't quite work together properly with the existing code in that file. That existing code should change, I think, to handle this properly. I'll sketch out a version of how that could look, and either push some draft commits to this branch or send a PR.

Comment on lines 83 to 86
* https://fetch.spec.whatwg.org/#ref-for-concept-network-error%E2%91%A5%E2%93%AA
*/
export const isNetworkRequestFailedError = (e: Error): boolean =>
e instanceof TypeError && e.message === 'Network request failed';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, wacky that this is a TypeError.

I worry a bit about conditioning on the message. I don't see that text mentioned in the spec, so I wonder if it might vary between implementations -- or if it might sometimes be translated into the user's chosen language on the device.

I think that probably means:

  • Here inside src/api/, scoped tightly around the fetch, we should catch this error (as just instanceof TypeError, odd as that is), and throw a more specific error we make up.
  • Then app code can look for that.

Comment on lines 76 to 77
export const isServerError = (e: Error): boolean =>
e instanceof ApiError && e.httpStatus >= 500 && e.httpStatus <= 599;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Does this actually happen in practice?

Looking at makeErrorFromApi above, which is where we construct ApiError objects, it looks like on a typical 5xx response we'll currently throw just an Error object, with some text in the message. (We'd throw an ApiError on a 5xx response if it comes with a body that's valid JSON and decodes to a nice Zulip API error object { result: 'error', msg: string, code?: string | void, ... }. But a 5xx is a server failure, and may not carry nice well-formed JSON like that.)

So I think we want the code in src/api/ to be throwing a more distinctive error in that case, and then to have isServerError detect that.

);

describe('if asked to retry', () => {
test('retries a call if there is a recoverable error', async () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetchActions: Fail early on all non–known-retryable errors in `tryFetch`.

I appreciate your careful use of both an en-dash and a hyphen. Wouldn't want a reader to think we're talking about errors that are retryable in a non-known fashion :-P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I still had access to the Chicago Manual of Style; it looks like this would be covered in Chapter 6: https://www.chicagomanualofstyle.org/16/ch06/ch06_toc.html 😛 (but I no longer have my college-sponsored subscription, alas).

Comment on lines 447 to 464
} else {
logging.warn(e, {
message: 'Unexpected error during initial fetch and serverSettings fetch.',
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to be aborting in this case too, right?

Basically it's an internal error in the app. But the user is probably better off getting kicked back to the account picker than having the loading screen continue indefinitely -- it means at least they can try again and see if the issue recurs, or try another account and see if they can at least use that.

@gnprice
Copy link
Member

gnprice commented Jul 14, 2021

I think the apiErrors code doesn't quite work together properly with the existing code in that file. That existing code should change, I think, to handle this properly. I'll sketch out a version of how that could look, and either push some draft commits to this branch or send a PR.

OK, and sent #4896 for that.

chrisbobbe added a commit to chrisbobbe/zulip-mobile that referenced this pull request Jul 15, 2021
As Greg points out [1]:

> Basically it's an internal error in the app. But the user is
> probably better off getting kicked back to the account picker than
> having the loading screen continue indefinitely -- it means at
> least they can try again and see if the issue recurs, or try
> another account and see if they can at least use that.

[1] zulip#4754 (comment)
@chrisbobbe chrisbobbe force-pushed the pr-initial-fetch-timeout-again branch from 1683743 to b47f617 Compare July 15, 2021 19:29
@chrisbobbe
Copy link
Contributor Author

Thanks for the review, and for #4896! Revision pushed.

We'll use this in `tryFetch`, in an upcoming commit, so that we can
interrupt in-progress attempts to contact a server when they take
too long. See discussion [1].

[1]: https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/Stuck.20on.20loading.20screen/near/907693
To be dispatched when we want to give up on the initial fetch.

It navigates to the 'accounts' screen so a user can try a different
account and server. Logging out wouldn't be good; the credentials
may be perfectly fine, and we'd like to keep them around to try
again later.

It sets `needsInitialFetch` to `false` [1], just like
`INITIAL_FETCH_COMPLETE`, while retaining a different meaning than
that action (i.e., that the fetch was aborted instead of completed).

Setting `needsInitialFetch` to false is necessary to ensure that a
subsequent initial fetch can be triggered when we want it to be. As
also noted in 7caa4d0, `needsInitialFetch` is "edge-triggered".
(That edge-triggering logic seems complex and fragile, and it would
be nice to fix that.)

See also discussion at
  https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/Stuck.20on.20loading.20screen/near/907591.
…ch`.

The TODO doesn't specifically mention that we want to retry on
network-failure errors, but we do, so add that too.

If we get some strange, totally unexpected error caused by a bug
somewhere, we shouldn't continue the retry loop. Not only would that
waste the user's time, but it means the error basically gets
swallowed, and it makes it much harder for us to detect it and debug
effectively. So, fail early on exceptions like that.

See our doc on exception handling:
  https://github.com/zulip/zulip-mobile/blob/master/docs/style.md#catch-at-ui.
So far, `tryFetch`'s only callers are in the initial fetch; so, add
handling for the `TimeoutError` there.

The choice of value for `requestLongTimeoutMs` comes from a
suggestion in zulip#4165's description:

> As a minimal version, if the initial fetch takes a completely
> unreasonable amount of time (maybe one minute), we should give up
> and take you to the account-picker screen so you can try a
> different account and server.

Fixes: zulip#4165
As Greg points out [1]:

> Basically it's an internal error in the app. But the user is
> probably better off getting kicked back to the account picker than
> having the loading screen continue indefinitely -- it means at
> least they can try again and see if the issue recurs, or try
> another account and see if they can at least use that.

[1] zulip#4754 (comment)
As Greg points out [1], this makes the most sense conceptually; it's
happening at the bottom of the loop, just before a new iteration
starts. The `return` in the `try` block is enough to ensure that
this wait won't interfere with a successful fetch.

[1]: zulip#4166 (comment)
Greg points out that we don't always want to retry the initial fetch
on server errors [1]:

> Ah, I think in #M4165 the point is that if the server isn't
> responding, we want to give you the option to go choose some other
> account. The context there is that we're in the initial fetch, so
> showing the loading screen, and as long as we're doing that
> there's no other UI.

> So yeah, I think basically we don't want to do any retrying here.
> Instead we can kick you to the account-picker screen, with a toast
> or something to indicate an error, and then you might manually
> retry a time or two or you might bail and switch to some other
> account.

> And in particular if you didn't even want to be using that account
> anymore -- maybe you even know that it's a server which is
> permanently shut down, but it just happened to be the last one
> you'd been using in the app and so it's the one we tried loading
> data from on startup -- then you can go use whatever other account
> you were actually opening the app to use.

If we're in an initial fetch from an already active, logged-in
account that we have cached data for, though, we might as well take
the time to retry (still with the 60 second timeout) because the
user will be seeing a useful, interactive UI.

[1] https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/Stuck.20on.20loading.20screen/near/1178689
A `TimeoutError` will be handled the same way other errors in
`fetchMessages` are handled; if it's a timeout in the fetch
`ChatScreen` does on mount, `ChatScreen` will show the `FetchError`
component we set up in zulip#4205.

There's also been a passing mention on CZO of doing a timeout like
this [1]:

> After a long time, probably like a minute, we'll want that [...]
> fetch to time out and fail in any case.

[1] https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/.23M4156.20Message.20List.20placeholders/near/950853
And add that message and the existing message to messages_en.json;
looks like we forgot to add the existing one.
@gnprice
Copy link
Member

gnprice commented Jul 16, 2021

Looks good, thanks! Merging.

@gnprice gnprice force-pushed the pr-initial-fetch-timeout-again branch from b47f617 to ab62f73 Compare July 16, 2021 00:24
@gnprice gnprice merged commit ab62f73 into zulip:master Jul 16, 2021
@chrisbobbe
Copy link
Contributor Author

Thanks for the review! 🙂

chrisbobbe added a commit to chrisbobbe/zulip-mobile that referenced this pull request Jul 16, 2021
I got the following error when I ran `tools/tx-sync`:

```
Step 3: Upload strings to translate...
+ tx --quiet push -s
tx ERROR: Error received from server: Duplicate string key ('Oops! Something went wrong\.') in line 247
tx ERROR: Could not upload source file. You can use --skip to ignore this error and continue the execution.
tx ERROR: Error received from server: Duplicate string key ('Oops! Something went wrong\.') in line 247
```

Looks like maybe a rebase error when I was working on zulip#4754. The
string was already added, in zulip#4829.
gnprice pushed a commit to chrisbobbe/zulip-mobile that referenced this pull request Jul 17, 2021
I got the following error when I ran `tools/tx-sync`:

```
Step 3: Upload strings to translate...
+ tx --quiet push -s
tx ERROR: Error received from server: Duplicate string key ('Oops! Something went wrong\.') in line 247
tx ERROR: Could not upload source file. You can use --skip to ignore this error and continue the execution.
tx ERROR: Error received from server: Duplicate string key ('Oops! Something went wrong\.') in line 247
```

Looks like maybe a rebase error when I was working on zulip#4754. The
string was already added, in zulip#4829.
chrisbobbe added a commit to chrisbobbe/zulip-mobile that referenced this pull request Sep 9, 2021
This is our first use of `react-test-renderer`. It piggy-backs on
our incorporation of Jest's "modern" fake-timer implementation in
PRs zulip#4754 and zulip#4931. That was handy!

I haven't yet found any test cases that fail with our
implementation. (And I'd been hoping to, to debug an unexpected
error!)

But I did try pasting in an earlier iteration of the hook's
implementation, from zulip#4940, that Greg had found bugs in by reading
the code. Many of these tests failed on that buggy implementation,
which is a good sign.

Might as well keep these new tests, then, if they're not an
unreasonable maintenance burden.
chrisbobbe added a commit to chrisbobbe/zulip-mobile that referenced this pull request Sep 9, 2021
This is our first use of `react-test-renderer`. It piggy-backs on
our incorporation of Jest's "modern" fake-timer implementation in
PRs zulip#4754 and zulip#4931. That was handy!

I haven't yet found any test cases that fail with our
implementation. (And I'd been hoping to, to debug an unexpected
error!)

But I did try pasting in an earlier iteration of the hook's
implementation, from zulip#4940, that Greg had found bugs in by reading
the code. Many of these tests failed on that buggy implementation,
which is a good sign.

Might as well keep these new tests, then, if they're not an
unreasonable maintenance burden.
chrisbobbe added a commit to chrisbobbe/zulip-mobile that referenced this pull request Sep 10, 2021
This is our first use of `react-test-renderer`. It piggy-backs on
our incorporation of Jest's "modern" fake-timer implementation in
PRs zulip#4754 and zulip#4931. That was handy!

I haven't yet found any test cases that fail with our
implementation. (And I'd been hoping to, to debug an unexpected
error!)

But I did try pasting in an earlier iteration of the hook's
implementation, from zulip#4940, that Greg had found bugs in by reading
the code. Many of these tests failed on that buggy implementation,
which is a good sign.

Might as well keep these new tests, then, if they're not an
unreasonable maintenance burden.
chrisbobbe added a commit to chrisbobbe/zulip-mobile that referenced this pull request Sep 10, 2021
This is our first use of `react-test-renderer`. It piggy-backs on
our incorporation of Jest's "modern" fake-timer implementation in
PRs zulip#4754 and zulip#4931. That was handy!

I haven't yet found any test cases that fail with our
implementation. (And I'd been hoping to, to debug an unexpected
error!)

But I did try pasting in an earlier iteration of the hook's
implementation, from zulip#4940, that Greg had found bugs in by reading
the code. Many of these tests failed on that buggy implementation,
which is a good sign.

Might as well keep these new tests, then, if they're not an
unreasonable maintenance burden.
@chrisbobbe chrisbobbe deleted the pr-initial-fetch-timeout-again branch November 4, 2021 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Time out initial fetch, and go to account-picker screen
2 participants