-
-
Notifications
You must be signed in to change notification settings - Fork 660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Time out initial fetch and go to account-picker screen. #4754
Time out initial fetch and go to account-picker screen. #4754
Conversation
(edit: this is done) Also, note to self: I should make it so we do retry the initial fetch on 5xx's if we have stale server data and the UI is visible; otherwise, there's no point in retrying, if we're looking at the full-screen loading indicator where the user can't do anything useful. |
95a4fc6
to
6c8b787
Compare
Done; revision pushed! |
Note to self: Probably a good idea to alert the user (and tell Sentry) which bad outcome occurred: a 5xx error or a timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @chrisbobbe ! Generally this looks good -- small comments below, and one more-significant question about the set of errors we're handling.
As of the end of the branch I believe we're handling 5xx, 4xx, and timeouts appropriately. But there could be other ways a network failure shows up: if the device is unable to make a TCP connection (perhaps because it's in airplane mode), or a TLS connection atop that (perhaps because there's something wrong operationally with the server), then we'll treat that as an unexpected error that represents a bug. Can we work out what those would look like here, and handle them appropriately so that there's no operational issue that could cause an uncaught exception?
Separately: there's a string of 8 commits at the end that convert over from Lolex to Jest's fake timers. Do those need to be at the end, or can they come earlier -- perhaps even as a separate PR?
src/message/fetchActions.js
Outdated
// TODO: This should be narrowed to `!isServerError(e)`; we | ||
// should fail early if we encounter unrecognized / unexpected | ||
// errors. | ||
if (isClientError(e)) { | ||
if (!isServerError(e)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I know this is what the TODO comment said, but it doesn't feel quite right to me.
The issue is: what happens if there's a failure in the network connection? I think we basically want to treat those exactly the same as a 5xx error -- either one can be caused by operational issues, is usually transient, but may not be transient if the server is permanently offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this reasoning:
If we get some strange, totally unexpected error caused by a bug
somewhere, we shouldn't continue the retry loop. Not only would that
waste the user's time, but it means the error basically gets
swallowed, and it makes it much harder for us to detect it and debug
effectively. So, fail early on exceptions like that.
So I guess the point is that there's another category of specific, expected errors here, which is a network failure. I'm not sure exactly what those look like at this point in the code.
src/message/fetchActions.js
Outdated
* Makes a request that retries until success or a non-5xx error; | ||
* times out after `config.requestLongTimeoutMs`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can shorten the last clause to ", with timeout", to help keep this summary line concise. The detail of what the timeout's value is is good information to put in the prose below.
jest/jestSetup.js
Outdated
/** | ||
* Default should be set with `timers: 'modern'` in Jest config. | ||
*/ | ||
assertUsingModernFakeTimers(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this isn't a declaration of anything, so there's nothing the comment is describing the interface of -- so it should be a plain non-jsdoc implementation-comment
{(() => { | ||
if (this.props.error instanceof TimeoutError) { | ||
return <Label style={styles.text} text="Request timed out." />; | ||
} else { | ||
return <Label style={styles.text} text="Oops! Something went wrong." />; | ||
} | ||
})()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably simpler as a … ? … : …
expression, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably simpler as a
… ? … : …
expression, no?
Sure. I used if
/ else
because I thought there might be other cases to consider in the future, and this form would make it easier to expand on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Yeah, that seems like a fine reason.
6c8b787
to
118782f
Compare
Thanks for the review! Revision pushed. Two (possibly related?) things have occurred to me:
|
118782f
to
1683743
Compare
(Just rebased.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the revision! Sorry I've let this sit for a bit.
The app-level (i.e., outside src/api/
) code looks good, with one comment below.
I think the apiErrors
code doesn't quite work together properly with the existing code in that file. That existing code should change, I think, to handle this properly. I'll sketch out a version of how that could look, and either push some draft commits to this branch or send a PR.
src/api/apiErrors.js
Outdated
* https://fetch.spec.whatwg.org/#ref-for-concept-network-error%E2%91%A5%E2%93%AA | ||
*/ | ||
export const isNetworkRequestFailedError = (e: Error): boolean => | ||
e instanceof TypeError && e.message === 'Network request failed'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, wacky that this is a TypeError
.
I worry a bit about conditioning on the message
. I don't see that text mentioned in the spec, so I wonder if it might vary between implementations -- or if it might sometimes be translated into the user's chosen language on the device.
I think that probably means:
- Here inside
src/api/
, scoped tightly around thefetch
, we should catch this error (as justinstanceof TypeError
, odd as that is), and throw a more specific error we make up. - Then app code can look for that.
src/api/apiErrors.js
Outdated
export const isServerError = (e: Error): boolean => | ||
e instanceof ApiError && e.httpStatus >= 500 && e.httpStatus <= 599; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Does this actually happen in practice?
Looking at makeErrorFromApi
above, which is where we construct ApiError
objects, it looks like on a typical 5xx response we'll currently throw just an Error
object, with some text in the message. (We'd throw an ApiError
on a 5xx response if it comes with a body that's valid JSON and decodes to a nice Zulip API error object { result: 'error', msg: string, code?: string | void, ... }
. But a 5xx is a server failure, and may not carry nice well-formed JSON like that.)
So I think we want the code in src/api/
to be throwing a more distinctive error in that case, and then to have isServerError
detect that.
); | ||
|
||
describe('if asked to retry', () => { | ||
test('retries a call if there is a recoverable error', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fetchActions: Fail early on all non–known-retryable errors in `tryFetch`.
I appreciate your careful use of both an en-dash and a hyphen. Wouldn't want a reader to think we're talking about errors that are retryable in a non-known fashion :-P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish I still had access to the Chicago Manual of Style; it looks like this would be covered in Chapter 6: https://www.chicagomanualofstyle.org/16/ch06/ch06_toc.html 😛 (but I no longer have my college-sponsored subscription, alas).
src/message/fetchActions.js
Outdated
} else { | ||
logging.warn(e, { | ||
message: 'Unexpected error during initial fetch and serverSettings fetch.', | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to be aborting in this case too, right?
Basically it's an internal error in the app. But the user is probably better off getting kicked back to the account picker than having the loading screen continue indefinitely -- it means at least they can try again and see if the issue recurs, or try another account and see if they can at least use that.
OK, and sent #4896 for that. |
As Greg points out [1]: > Basically it's an internal error in the app. But the user is > probably better off getting kicked back to the account picker than > having the loading screen continue indefinitely -- it means at > least they can try again and see if the issue recurs, or try > another account and see if they can at least use that. [1] zulip#4754 (comment)
1683743
to
b47f617
Compare
Thanks for the review, and for #4896! Revision pushed. |
We'll use this in `tryFetch`, in an upcoming commit, so that we can interrupt in-progress attempts to contact a server when they take too long. See discussion [1]. [1]: https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/Stuck.20on.20loading.20screen/near/907693
To be dispatched when we want to give up on the initial fetch. It navigates to the 'accounts' screen so a user can try a different account and server. Logging out wouldn't be good; the credentials may be perfectly fine, and we'd like to keep them around to try again later. It sets `needsInitialFetch` to `false` [1], just like `INITIAL_FETCH_COMPLETE`, while retaining a different meaning than that action (i.e., that the fetch was aborted instead of completed). Setting `needsInitialFetch` to false is necessary to ensure that a subsequent initial fetch can be triggered when we want it to be. As also noted in 7caa4d0, `needsInitialFetch` is "edge-triggered". (That edge-triggering logic seems complex and fragile, and it would be nice to fix that.) See also discussion at https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/Stuck.20on.20loading.20screen/near/907591.
…ch`. The TODO doesn't specifically mention that we want to retry on network-failure errors, but we do, so add that too. If we get some strange, totally unexpected error caused by a bug somewhere, we shouldn't continue the retry loop. Not only would that waste the user's time, but it means the error basically gets swallowed, and it makes it much harder for us to detect it and debug effectively. So, fail early on exceptions like that. See our doc on exception handling: https://github.com/zulip/zulip-mobile/blob/master/docs/style.md#catch-at-ui.
So far, `tryFetch`'s only callers are in the initial fetch; so, add handling for the `TimeoutError` there. The choice of value for `requestLongTimeoutMs` comes from a suggestion in zulip#4165's description: > As a minimal version, if the initial fetch takes a completely > unreasonable amount of time (maybe one minute), we should give up > and take you to the account-picker screen so you can try a > different account and server. Fixes: zulip#4165
As Greg points out [1]: > Basically it's an internal error in the app. But the user is > probably better off getting kicked back to the account picker than > having the loading screen continue indefinitely -- it means at > least they can try again and see if the issue recurs, or try > another account and see if they can at least use that. [1] zulip#4754 (comment)
As Greg points out [1], this makes the most sense conceptually; it's happening at the bottom of the loop, just before a new iteration starts. The `return` in the `try` block is enough to ensure that this wait won't interfere with a successful fetch. [1]: zulip#4166 (comment)
Greg points out that we don't always want to retry the initial fetch on server errors [1]: > Ah, I think in #M4165 the point is that if the server isn't > responding, we want to give you the option to go choose some other > account. The context there is that we're in the initial fetch, so > showing the loading screen, and as long as we're doing that > there's no other UI. > So yeah, I think basically we don't want to do any retrying here. > Instead we can kick you to the account-picker screen, with a toast > or something to indicate an error, and then you might manually > retry a time or two or you might bail and switch to some other > account. > And in particular if you didn't even want to be using that account > anymore -- maybe you even know that it's a server which is > permanently shut down, but it just happened to be the last one > you'd been using in the app and so it's the one we tried loading > data from on startup -- then you can go use whatever other account > you were actually opening the app to use. If we're in an initial fetch from an already active, logged-in account that we have cached data for, though, we might as well take the time to retry (still with the 60 second timeout) because the user will be seeing a useful, interactive UI. [1] https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/Stuck.20on.20loading.20screen/near/1178689
A `TimeoutError` will be handled the same way other errors in `fetchMessages` are handled; if it's a timeout in the fetch `ChatScreen` does on mount, `ChatScreen` will show the `FetchError` component we set up in zulip#4205. There's also been a passing mention on CZO of doing a timeout like this [1]: > After a long time, probably like a minute, we'll want that [...] > fetch to time out and fail in any case. [1] https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/.23M4156.20Message.20List.20placeholders/near/950853
And add that message and the existing message to messages_en.json; looks like we forgot to add the existing one.
Looks good, thanks! Merging. |
b47f617
to
ab62f73
Compare
Thanks for the review! 🙂 |
I got the following error when I ran `tools/tx-sync`: ``` Step 3: Upload strings to translate... + tx --quiet push -s tx ERROR: Error received from server: Duplicate string key ('Oops! Something went wrong\.') in line 247 tx ERROR: Could not upload source file. You can use --skip to ignore this error and continue the execution. tx ERROR: Error received from server: Duplicate string key ('Oops! Something went wrong\.') in line 247 ``` Looks like maybe a rebase error when I was working on zulip#4754. The string was already added, in zulip#4829.
I got the following error when I ran `tools/tx-sync`: ``` Step 3: Upload strings to translate... + tx --quiet push -s tx ERROR: Error received from server: Duplicate string key ('Oops! Something went wrong\.') in line 247 tx ERROR: Could not upload source file. You can use --skip to ignore this error and continue the execution. tx ERROR: Error received from server: Duplicate string key ('Oops! Something went wrong\.') in line 247 ``` Looks like maybe a rebase error when I was working on zulip#4754. The string was already added, in zulip#4829.
This is our first use of `react-test-renderer`. It piggy-backs on our incorporation of Jest's "modern" fake-timer implementation in PRs zulip#4754 and zulip#4931. That was handy! I haven't yet found any test cases that fail with our implementation. (And I'd been hoping to, to debug an unexpected error!) But I did try pasting in an earlier iteration of the hook's implementation, from zulip#4940, that Greg had found bugs in by reading the code. Many of these tests failed on that buggy implementation, which is a good sign. Might as well keep these new tests, then, if they're not an unreasonable maintenance burden.
This is our first use of `react-test-renderer`. It piggy-backs on our incorporation of Jest's "modern" fake-timer implementation in PRs zulip#4754 and zulip#4931. That was handy! I haven't yet found any test cases that fail with our implementation. (And I'd been hoping to, to debug an unexpected error!) But I did try pasting in an earlier iteration of the hook's implementation, from zulip#4940, that Greg had found bugs in by reading the code. Many of these tests failed on that buggy implementation, which is a good sign. Might as well keep these new tests, then, if they're not an unreasonable maintenance burden.
This is our first use of `react-test-renderer`. It piggy-backs on our incorporation of Jest's "modern" fake-timer implementation in PRs zulip#4754 and zulip#4931. That was handy! I haven't yet found any test cases that fail with our implementation. (And I'd been hoping to, to debug an unexpected error!) But I did try pasting in an earlier iteration of the hook's implementation, from zulip#4940, that Greg had found bugs in by reading the code. Many of these tests failed on that buggy implementation, which is a good sign. Might as well keep these new tests, then, if they're not an unreasonable maintenance burden.
This is our first use of `react-test-renderer`. It piggy-backs on our incorporation of Jest's "modern" fake-timer implementation in PRs zulip#4754 and zulip#4931. That was handy! I haven't yet found any test cases that fail with our implementation. (And I'd been hoping to, to debug an unexpected error!) But I did try pasting in an earlier iteration of the hook's implementation, from zulip#4940, that Greg had found bugs in by reading the code. Many of these tests failed on that buggy implementation, which is a good sign. Might as well keep these new tests, then, if they're not an unreasonable maintenance burden.
This is meant to follow #4753Done! I've split #4193 into two PRs, that one and this one, to make it easier to review.This is the "minimal version" of a fix Greg describes in the issue; as he mentions there, it would be good to make follow-up improvements soon after merging this, or open a new issue for those.
Fixes: #4165