health checker/ auto-retry bg sync + adjusted timeouts #612

sssoleileraaa · 2019-11-08T09:26:53Z

Description

Resolves #610
Resolves #181

we no longer ask the user to manually refresh if a bg sync job fails, instead we auto retry in the background (this is what resolves Intermittent "The SecureDrop server cannot be reached" errors #610)
adds a manual_refresh option to sync_api so that if a user clicks the refresh icon and the sync fails, we show an error in the status bar with a retry link
bumps the default timeout to 60 seconds used for login and delete (this is what resolves Client times out on first login if VMs are not running (Qubes only) #181)
queue jobs still use a default of 5 seconds for timeouts, so if the server cannot be reached, we will show the error immediately to the user if they click on the refresh icon

I have tested these changes in the appropriate Qubes environment

eloquence · 2019-11-20T18:39:47Z

We discussed this at sprint planning today; this does not mitigate the observed problem, so Allie may close the PR, or amend if it is worth increasing the timeout regardless.

sssoleileraaa · 2019-12-02T18:24:15Z

"The SecureDrop server cannot be reached" with Retry link continues to intermittently appear even with this extended timeout. This tells us that the issue is not a RequestTimeoutError, which means it has to be an ApiInaccessibleError (see

securedrop-client/securedrop_client/queue.py

Lines 120 to 123 in ff394d5

    
           except (RequestTimeoutError, ApiInaccessibleError) as e: 
        
               logger.debug('Job {} raised an exception: {}: {}'.format(self, type(e).__name__, e)) 
        
               self.add_job(PauseQueueJob()) 
        
               self.re_add_job(job)

)

Also, note that if the API were None then we wouldn't see that Retry link, so that is not the issue: https://github.com/freedomofpress/securedrop-client/blob/master/securedrop_client/logic.py#L275

We already automatically try 5 times in a row to resend a request from the queue if we see ApiInaccessibleError but the retries are very quickly happening back-to-back. One mitigation might be to space out our retries or wait X seconds and auto retry again before pausing the queue and asking for user intervention to fix the network issue. This is a bit like a health checker (see #491) but it would be less advanced. We could start off by just auto-retrying once after say 30 seconds, specifically for ApiInaccessableErrors since we're seeing this frequently.

redshiftzero · 2019-12-02T18:26:30Z

are you sure it's an ApiInaccessibleError? can you confirm via the debug logging?

sssoleileraaa · 2019-12-02T18:35:25Z

are you sure it's an ApiInaccessibleError? can you confirm via the debug logging?

Ah actually I just realized since MetadataSyncJob isn't implemented yet that the errors would have to be coming from MessageDownloadJob or ReplyDownloadJob which are triggered during a sync. So either of those jobs could be seeing timeout errors.

If we increased the timeouts for those jobs that would actually work as a quick mitigation if it is a timeout issue. If we're seeing frequent ApiInaccessibleErrors then I think waiting and auto retrying again makes sense.

I can verify what the issue is once I get to my Qubes station.

sssoleileraaa · 2019-12-02T18:40:25Z

It does seem like increasing the default_request_timeout here (which should be applied to both MessageDownloadJob and ReplyDownloadJob) to a very large value doesn't help this issue, which is why I'm thinking the server is just temporarily inaccessible. I will confirm shortly.

sssoleileraaa · 2019-12-03T01:51:21Z

Resolves #181

sssoleileraaa · 2019-12-06T02:04:00Z

I updated the PR description to reflect recent changes and added tests after rebasing to include the metadata sync job changes, so this is ready for review!

redshiftzero

some questions about this approach inline

securedrop_client/queue.py

securedrop_client/logic.py

sssoleileraaa · 2019-12-06T21:35:33Z

Resolves #491

sssoleileraaa · 2019-12-07T00:56:47Z

I just want to mention that when you're ready to re-review that the default api timeouts are:

reply_source: 5 seconds (unchanged)
remove_star: 5 seconds (unchanged)
add_star: 5 seconds (unchanged)
download jobs (except for MetadataSyncJob): adjusted realistic timeout based on download size (unchanged)
MetadataSyncJob get_sources: 5 seconds -> changed to 20 seconds
MetadataSyncJob get_submissions: 5 seconds -> changed to 20 seconds
MetadataSyncJob get_all_replies: 5 seconds -> changed to 20 seconds

Changing the MetadataSyncJob api requests from 5 to 20 seconds makes it so I basically never see a network error. I tested this by running the client for 3 hours. At one point I saw the network error message and did nothing and later saw that it had resolved itself. This demonstrated that if a user walks away or is busy reading/ editing a file in a different vm or something, and there are network errors, that they can potentially resolve themselves. So when a user goes back to the client they don't have to see the a red error message that requires a manual retry.

Changing the MetadataSyncJob api requests from 5 to 20 seconds does mean that a refresh will take up to 1 minute before telling a user that the server could not be reached. However, other operations will tell the user that the server could not be reached within 25 seconds. So that is the tradeoff, and I opened a followup issue to adjust all the defaults in a smarter way, similar to what we did for download jobs: #648

sssoleileraaa · 2019-12-09T19:49:25Z

Latest commit removes extra exception logging now that metadata sync is a queue job in order to avoid repetitive retry messages flooding the logs.

redshiftzero

haven't tested yet, two comments on the diff

securedrop_client/logic.py

redshiftzero

this is looking good to me, I think we have one more success callback to update and then I'll approve for merge

redshiftzero

OK this looks good to me now, thanks!

sssoleileraaa requested review from kushaldas and redshiftzero as code owners November 8, 2019 09:26

sssoleileraaa changed the title ~~mitigation for frequent metadata sync timeouts~~ mitigation for frequent sync errors Dec 2, 2019

sssoleileraaa force-pushed the 610-miti branch from 426ae22 to 136163d Compare December 3, 2019 01:12

sssoleileraaa self-assigned this Dec 3, 2019

sssoleileraaa force-pushed the 610-miti branch from b539b68 to 04289ff Compare December 6, 2019 01:51

redshiftzero suggested changes Dec 6, 2019

View reviewed changes

securedrop_client/queue.py Outdated Show resolved Hide resolved

securedrop_client/logic.py Show resolved Hide resolved

securedrop_client/logic.py Outdated Show resolved Hide resolved

redshiftzero reviewed Dec 6, 2019

View reviewed changes

securedrop_client/logic.py Outdated Show resolved Hide resolved

securedrop_client/logic.py Show resolved Hide resolved

Allie Crevier added 3 commits December 6, 2019 13:18

mitigation for frequent metadata sync timeouts

607a78a

log and auto-retry bg syncs

cf211f8

describe timeouts more accurately

b173a23

sssoleileraaa force-pushed the 610-miti branch from 04289ff to 6235f68 Compare December 6, 2019 21:30

sssoleileraaa requested a review from redshiftzero December 7, 2019 00:44

redshiftzero suggested changes Dec 9, 2019

View reviewed changes

securedrop_client/logic.py Outdated Show resolved Hide resolved

securedrop_client/logic.py Outdated Show resolved Hide resolved

sssoleileraaa force-pushed the 610-miti branch from 798b083 to 61eb93a Compare December 9, 2019 21:45

sssoleileraaa mentioned this pull request Dec 9, 2019

Use decorators for failure and success callbacks #651

Closed

Allie Crevier added 3 commits December 9, 2019 14:34

make sure we hide error msg on success

0793d4e

adjust timeouts

033301b

caller logs and handles exception

2326f4a

sssoleileraaa force-pushed the 610-miti branch from 61eb93a to 2326f4a Compare December 9, 2019 22:34

update comments

18c8898

redshiftzero reviewed Dec 9, 2019

View reviewed changes

securedrop_client/logic.py Show resolved Hide resolved

redshiftzero reviewed Dec 9, 2019

View reviewed changes

clear error status upon file download success

c439af1

sssoleileraaa changed the title ~~mitigation for frequent sync errors~~ health checker/ auto-retry bg sync + adjusted timeouts Dec 9, 2019

redshiftzero approved these changes Dec 9, 2019

View reviewed changes

eloquence mentioned this pull request Dec 9, 2019

health checker - auto resume #491

Closed

redshiftzero merged commit 7e3cfca into master Dec 9, 2019

redshiftzero deleted the 610-miti branch December 9, 2019 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

health checker/ auto-retry bg sync + adjusted timeouts #612

health checker/ auto-retry bg sync + adjusted timeouts #612

sssoleileraaa commented Nov 8, 2019 •

edited

Loading

eloquence commented Nov 20, 2019

sssoleileraaa commented Dec 2, 2019

redshiftzero commented Dec 2, 2019

sssoleileraaa commented Dec 2, 2019

sssoleileraaa commented Dec 2, 2019 •

edited

Loading

sssoleileraaa commented Dec 3, 2019

sssoleileraaa commented Dec 6, 2019

redshiftzero left a comment

sssoleileraaa commented Dec 6, 2019

sssoleileraaa commented Dec 7, 2019

sssoleileraaa commented Dec 9, 2019

redshiftzero left a comment

redshiftzero left a comment

redshiftzero left a comment

health checker/ auto-retry bg sync + adjusted timeouts #612

health checker/ auto-retry bg sync + adjusted timeouts #612

Conversation

sssoleileraaa commented Nov 8, 2019 • edited Loading

Description

eloquence commented Nov 20, 2019

sssoleileraaa commented Dec 2, 2019

redshiftzero commented Dec 2, 2019

sssoleileraaa commented Dec 2, 2019

sssoleileraaa commented Dec 2, 2019 • edited Loading

sssoleileraaa commented Dec 3, 2019

sssoleileraaa commented Dec 6, 2019

redshiftzero left a comment

Choose a reason for hiding this comment

sssoleileraaa commented Dec 6, 2019

sssoleileraaa commented Dec 7, 2019

sssoleileraaa commented Dec 9, 2019

redshiftzero left a comment

Choose a reason for hiding this comment

redshiftzero left a comment

Choose a reason for hiding this comment

redshiftzero left a comment

Choose a reason for hiding this comment

sssoleileraaa commented Nov 8, 2019 •

edited

Loading

sssoleileraaa commented Dec 2, 2019 •

edited

Loading