Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync times out with 1000 sources #1007

Open
sssoleileraaa opened this issue Mar 25, 2020 · 8 comments
Open

Sync times out with 1000 sources #1007

sssoleileraaa opened this issue Mar 25, 2020 · 8 comments
Labels
bug Something isn't working scalability
Milestone

Comments

@sssoleileraaa
Copy link
Contributor

sssoleileraaa commented Mar 25, 2020

Description

Sync times out when there are 1000 sources.

STR

  1. run the client (in Qubes preferrably) after you've populated your server with 1000 sources (see Sync times out with 1000 sources #1007 (comment))
  2. reply to a source and send a message as a source

Expected

You should be able to use the client and see new replies and messages succeed.

Actual

The sync times out on a thousand sources, even after a client restart, the sync times out on a 1000 sources.

@sssoleileraaa sssoleileraaa changed the title Sync times out Sync times out with 1000 sources Mar 25, 2020
@sssoleileraaa sssoleileraaa added the bug Something isn't working label Mar 25, 2020
@sssoleileraaa
Copy link
Contributor Author

Here's a tip on how to easily update your staging server with 1000 sources (for those who are working on reproducing this issue)

  1. in the securedrop virtual environment, run molecule login -s libvirt-staging-xenial -h app-staging
  2. log in and run sudo -u www-data bash
  3. cd /var/www/securedrop
  4. NUM_SOURCES=1000 ./create-dev-data.py

@sssoleileraaa
Copy link
Contributor Author

sssoleileraaa commented Mar 28, 2020

@kushaldas or @redshiftzero - here's an update to fill you both in...

Notes

  • the only place the sync could be failing due to a timeout is here so we know that the proxy is either returning a ConnectTimeout or a ReadTimeout
  • We are passing a timeout value of 40 seconds from the client to the SDK, which passes that value in the request JSON to securedrop-proxy
  • It looks like the proxy is not reading in that timeout value and is instead defaulting to its own timeout value of 10 seconds, which is not enough time to grab 1000 or 500 or even 200 sources reliably, see test results below.

The fix

  • We need to change the proxy code so that it sets the timeout to the timeout value passed in via the sdk
  • We should also make sure 40 seconds is adequate (are these measurements still correct?)

What else would help?

  • We should update the sdk so that it provides logs to make troubleshooting easier
  • We should make it easier to find details in the logs about connection issues by passing the original error messages to new exceptions that the proxy and sdk create.

Test results

I ran the client three times, each with a clean local database.

n=100 sources

The sync did not time out on any of the three runs of the client with a clean local db.

n=200 sources

The sync timed out one out of three times I ran the client with a clean local db.

n=300 sources

The sync timed out all three times I ran the client. I left the client running for a few minutes after the third run and the sync recovered with the following results measured in on_sync_success:

update_sources: 0.007s
total: 0.1s

All subsequent syncs (after the local database had all 300 sources and the source widgets had all been added to the source list):

update_sources: 0.5s
total: 0.6s

n=400+ sources

As n gets higher the syncs time out more and more and are less likely to recover the way one sync did in the n=300 case.

@kushaldas
Copy link
Contributor

We are passing a timeout value of 40 seconds from the client to the SDK, which passes that value in the request JSON to securedrop-proxy

We are actually not passing that timeout to securedrop-proxy to the current implementation. We are just using it to timeout the qrexec process command we are running in the sd-app vm, https://github.com/freedomofpress/securedrop-sdk/blob/master/sdclientapi/__init__.py#L64

@kushaldas
Copy link
Contributor

kushaldas commented Mar 30, 2020

In my tests with 500 sources + 2 submission each source, 60 seconds time worked well everytime, tried 5 times.

@rmol
Copy link
Contributor

rmol commented Mar 30, 2020

It's important to remember (which I wasn't Friday 🙂) that the proxy's timeout works differently than the SDK's. As @kushaldas pointed out, the SDK timeout is a hard limit on the amount of time we'll wait for a response from the qrexec connection. The proxy timeout is a Requests timeout which is either the amount of time we'll wait for a connection to be established, or the longest we'll wait to see any data transmitted from the upstream server.

What we're seeing is that slow API endpoints can stay quiet for longer than ten seconds. Right now with up to 1000 sources a 60-second proxy timeout works. I think we can safely specify an even longer proxy timeout, say 120 or 300 seconds, and not implement the ability for the SDK to change it, because the SDK timeout will allow us to stay responsive to the workstation operator. The only downside I can think of to this is possibly running into problems spawning too many qrexec processes if the proxy end of the connection stays alive after the client disconnects, waiting minutes longer for the server's response. We need to test what happens when the qrexec connection is closed by the client.

@sssoleileraaa
Copy link
Contributor Author

The fix

  • We need to change the proxy code so that it sets the timeout to the timeout value passed in via the sdk
  • We should also make sure 40 seconds is adequate (are these measurements still correct?)

I'll create a separate issue for each of these things. For now, we have increased the proxy timeout to 120 seconds because it's a quick and easy change.

@sssoleileraaa
Copy link
Contributor Author

The 120 seconds fix worked my test with 300 sources this morning, and for @redshiftzero's 505-source test this morning. Just want to mention that we still timeout on 1000 sources so I'm leaving this issue open. Also another reason we should prioritize freedomofpress/securedrop-proxy#145, freedomofpress/securedrop-sdk#114, and #1754

@eloquence eloquence added this to the 0.2.0 milestone Apr 1, 2020
@redshiftzero
Copy link
Contributor

related server-side change in freedomofpress/securedrop#5184

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working scalability
Projects
None yet
Development

No branches or pull requests

5 participants