Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health checker/ auto-retry bg sync + adjusted timeouts #612

Merged
merged 8 commits into from
Dec 9, 2019

Conversation

sssoleileraaa
Copy link
Contributor

@sssoleileraaa sssoleileraaa commented Nov 8, 2019

Description

Resolves #610
Resolves #181

  • I have tested these changes in the appropriate Qubes environment

@eloquence
Copy link
Member

We discussed this at sprint planning today; this does not mitigate the observed problem, so Allie may close the PR, or amend if it is worth increasing the timeout regardless.

@sssoleileraaa
Copy link
Contributor Author

"The SecureDrop server cannot be reached" with Retry link continues to intermittently appear even with this extended timeout. This tells us that the issue is not a RequestTimeoutError, which means it has to be an ApiInaccessibleError (see

except (RequestTimeoutError, ApiInaccessibleError) as e:
logger.debug('Job {} raised an exception: {}: {}'.format(self, type(e).__name__, e))
self.add_job(PauseQueueJob())
self.re_add_job(job)
)

Also, note that if the API were None then we wouldn't see that Retry link, so that is not the issue: https://github.com/freedomofpress/securedrop-client/blob/master/securedrop_client/logic.py#L275

We already automatically try 5 times in a row to resend a request from the queue if we see ApiInaccessibleError but the retries are very quickly happening back-to-back. One mitigation might be to space out our retries or wait X seconds and auto retry again before pausing the queue and asking for user intervention to fix the network issue. This is a bit like a health checker (see #491) but it would be less advanced. We could start off by just auto-retrying once after say 30 seconds, specifically for ApiInaccessableErrors since we're seeing this frequently.

@redshiftzero
Copy link
Contributor

are you sure it's an ApiInaccessibleError? can you confirm via the debug logging?

@sssoleileraaa sssoleileraaa changed the title mitigation for frequent metadata sync timeouts mitigation for frequent sync errors Dec 2, 2019
@sssoleileraaa
Copy link
Contributor Author

are you sure it's an ApiInaccessibleError? can you confirm via the debug logging?

Ah actually I just realized since MetadataSyncJob isn't implemented yet that the errors would have to be coming from MessageDownloadJob or ReplyDownloadJob which are triggered during a sync. So either of those jobs could be seeing timeout errors.

If we increased the timeouts for those jobs that would actually work as a quick mitigation if it is a timeout issue. If we're seeing frequent ApiInaccessibleErrors then I think waiting and auto retrying again makes sense.

I can verify what the issue is once I get to my Qubes station.

@sssoleileraaa
Copy link
Contributor Author

sssoleileraaa commented Dec 2, 2019

It does seem like increasing the default_request_timeout here (which should be applied to both MessageDownloadJob and ReplyDownloadJob) to a very large value doesn't help this issue, which is why I'm thinking the server is just temporarily inaccessible. I will confirm shortly.

@sssoleileraaa
Copy link
Contributor Author

Resolves #181

@sssoleileraaa
Copy link
Contributor Author

I updated the PR description to reflect recent changes and added tests after rebasing to include the metadata sync job changes, so this is ready for review!

Copy link
Contributor

@redshiftzero redshiftzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some questions about this approach inline

securedrop_client/queue.py Outdated Show resolved Hide resolved
securedrop_client/logic.py Show resolved Hide resolved
securedrop_client/logic.py Outdated Show resolved Hide resolved
securedrop_client/logic.py Outdated Show resolved Hide resolved
securedrop_client/logic.py Show resolved Hide resolved
@sssoleileraaa
Copy link
Contributor Author

Resolves #491

@sssoleileraaa
Copy link
Contributor Author

I just want to mention that when you're ready to re-review that the default api timeouts are:

  • reply_source: 5 seconds (unchanged)
  • remove_star: 5 seconds (unchanged)
  • add_star: 5 seconds (unchanged)
  • download jobs (except for MetadataSyncJob): adjusted realistic timeout based on download size (unchanged)
  • MetadataSyncJob get_sources: 5 seconds -> changed to 20 seconds
  • MetadataSyncJob get_submissions: 5 seconds -> changed to 20 seconds
  • MetadataSyncJob get_all_replies: 5 seconds -> changed to 20 seconds

Changing the MetadataSyncJob api requests from 5 to 20 seconds makes it so I basically never see a network error. I tested this by running the client for 3 hours. At one point I saw the network error message and did nothing and later saw that it had resolved itself. This demonstrated that if a user walks away or is busy reading/ editing a file in a different vm or something, and there are network errors, that they can potentially resolve themselves. So when a user goes back to the client they don't have to see the a red error message that requires a manual retry.

Changing the MetadataSyncJob api requests from 5 to 20 seconds does mean that a refresh will take up to 1 minute before telling a user that the server could not be reached. However, other operations will tell the user that the server could not be reached within 25 seconds. So that is the tradeoff, and I opened a followup issue to adjust all the defaults in a smarter way, similar to what we did for download jobs: #648

@sssoleileraaa
Copy link
Contributor Author

Latest commit removes extra exception logging now that metadata sync is a queue job in order to avoid repetitive retry messages flooding the logs.

Copy link
Contributor

@redshiftzero redshiftzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't tested yet, two comments on the diff

securedrop_client/logic.py Outdated Show resolved Hide resolved
securedrop_client/logic.py Outdated Show resolved Hide resolved
Copy link
Contributor

@redshiftzero redshiftzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking good to me, I think we have one more success callback to update and then I'll approve for merge

@sssoleileraaa sssoleileraaa changed the title mitigation for frequent sync errors health checker/ auto-retry bg sync + adjusted timeouts Dec 9, 2019
Copy link
Contributor

@redshiftzero redshiftzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this looks good to me now, thanks!

@redshiftzero redshiftzero merged commit 7e3cfca into master Dec 9, 2019
@redshiftzero redshiftzero deleted the 610-miti branch December 9, 2019 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants