Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtain realistic timeout data for different API endpoints #648

Closed
sssoleileraaa opened this issue Dec 6, 2019 · 14 comments
Closed

Obtain realistic timeout data for different API endpoints #648

sssoleileraaa opened this issue Dec 6, 2019 · 14 comments
Assignees

Comments

@sssoleileraaa
Copy link
Contributor

Description

Right now we use a default of 5 seconds for the following API endpoints:

  • get_sources
  • get_submissions
  • get_all_replies

To come up with a more realistic timeout, we should do some latency testing hitting each endpoint via tor with variable sizes of data (e.g. 50 sources with 200 submissions and 100 replies)

@sssoleileraaa
Copy link
Contributor Author

Also include:

  • reply_source
  • remove_star
  • add_star

Right now we use a default of 5 seconds.

@eloquence
Copy link
Member

Testing against a highly populated server is now tracked in more detail in #656, which leaves us with latency measurement for starring/replies. How to scope this remaining work depends a bit on how we want to do these benchmarks, e.g.:

  1. manual testing of client against a populated server with timing checkpoints for syncs, combined with headless API tests for starring/replies
  2. all testing done via a functional testing framework (see Add functional tests #381)

Any thoughts on this? 1) would get us results faster, but I know we're all itching to get some functional tests in so perhaps it'd be worth to use this as an opportunity to test pytest-qt?

@eloquence
Copy link
Member

Discussed this a bit w/ Jen today; she suggested that we should start by just doing command line testing of the API endpoints, to get typical Tor latency data w/ a populated server (which is also how I interpret your proposal, @creviera).

This is different from the in-client benchmarking described in #656, which I think would also be useful in case there are implementation bottlenecks in the client. I think both issues deserve to exist, with this one at a higher priority since it will give us some important baseline data relatively quickly.

@eloquence eloquence changed the title Use realistic timeouts for different API endpoints Obtain realistic timeout data for different API endpoints Dec 19, 2019
@eloquence eloquence added this to the 0.2.0beta milestone Dec 19, 2019
@eloquence
Copy link
Member

We discussed this at sprint planning today; tentatively, @rmol has offered to take a stab at this. To resolve this issue, a write-up and/or gist is sufficient.

@rmol rmol self-assigned this Jan 13, 2020
@rmol
Copy link
Contributor

rmol commented Jan 16, 2020

I have some preliminary measurements for the endpoints requested, with the server populated as requested, going through sd-proxy on Qubes:

Endpoint Mean response time, ten requests (seconds)
get_sources 4.91
get_submissions 1.61
get_all_replies 2.40
reply_source 1.36
add_star 1.25
remove_star 1.30

The timing is highly variable, so I'd like to take more measurements over a longer period, but right now it looks like the current five-second timeout probably works for everything but get_sources.

The connection latencies with qrexec and Tor are brutal. The qrexec overhead can be around half a second per request, compared to just making direct HTTP requests. It's too complicated for the pilot, but we should look into ways of maintaining connections to the server or sd-proxy or both. I think persistent connections would eliminate most of the time required for most of these calls.

In the client's MetadataSyncJob, we use get_remote_data, which gets a list of all sources, and then iterates them, fetching the submissions for each source, and then collecting all the submissions in a single list. I think we could change get_remote_data to use the SDK's get_all_submissions to produce the same list with one API call. The mean response time of get_all_submissions was 4.09 seconds -- less than three get_submissions calls.

The SD core qa_loader.py currently produces data that is unusable in the client. For example, it uses random numbers for file counters, which can result in unique constraint violations when the client tries to download the submissions.

@rmol
Copy link
Contributor

rmol commented Jan 16, 2020

Loaded the staging environment with 250 sources, 1000 submissions, and 500 replies:

Endpoint Mean response time, ten requests (seconds)
get_sources 20.21
get_submissions 1.38
get_all_replies 3.45
get_all_submissions 4.10
reply_source 1.47
add_star 1.35
remove_star 1.37

The problem with get_sources is that we resort to gpg to fetch each source's fingerprint and public key. Perhaps those could be added to columns on Source, but that change will be entangled with the entropy/flag-for-reply changes.

Given the performance of get_sources, I don't think there's any point in testing a larger database. It's already too slow to be used where a person is waiting for it.

@eloquence
Copy link
Member

This is very helpful. Do all response times include the network transmission time via Tor? Can we break down the network transmission time via Tor vs. server response time components?

@rmol
Copy link
Contributor

rmol commented Jan 17, 2020

All the measurements above were taken while connecting to a staging environment via our proxy, so include the overhead of both qrexec and Tor.

I have not tried connecting directly to the staging environment without Tor, but it would be useful to see the best case performance of these endpoints. I'll do that.

@rmol
Copy link
Contributor

rmol commented Jan 18, 2020

I measured in the staging app VM, using the loopback interface, so eliminating qrexec and Tor:

Endpoint Mean response time, ten requests (seconds)
get_sources 15.24
get_submissions 0.01
get_all_replies 0.17
get_all_submissions 0.13
reply_source 1.20
add_star 0.01
remove_star 0.01

The get_all_submissions and get_all_replies endpoints can be sped up by using eager joins in the SQLAlchemy calls, bringing them down to around 0.05 seconds.

So clearly Tor and qrexec add significant overhead. My next step on that front is to try persistent connections in securedrop-proxy and securedrop-sdk. I don't think it will be as involved as I feared.

@redshiftzero and I talked a bit yesterday and think there are several things we can do to speed up get_sources. The simplest is to just add an endpoint that omits the public key and fingerprint, for use in the metadata sync. The client doesn't need those until trying to reply, so they can be retrieved then.

@rmol
Copy link
Contributor

rmol commented Jan 20, 2020

With securedrop-sdk and securedrop-proxy reworked to use persistent connections:

Endpoint Mean response time, ten requests (seconds)
get_sources 19.75
get_submissions 0.66
get_all_replies 1.82
get_all_submissions 2.00
reply_source 0.78
add_star 0.62
remove_star 0.65

@redshiftzero
Copy link
Contributor

thanks for these results @rmol

The mean response time of get_all_submissions was 4.09 seconds -- less than three get_submissions calls.

this endpoint change is in: #709

regarding the rest of these results, definitely get_sources is what we need to focus on next. It seems like there is a linear scan over all keys occurring in CryptoUtil.getkey (https://github.com/freedomofpress/securedrop/blob/708e4981693a8dc2d31311b5883ad676cb6ad2cc/securedrop/crypto_util.py#L229) which is then performed for all sources hence the get_sources endpoint is scaling quadratically with the number of sources. some ideas:

  1. store gpg key and fingerprint in database, i.e. ditch @property (what @rmol notes above),
  2. store a cache (dict) on the crypto_util object for faster gpg lookup, we construct it on application start (makes get_sources scale linearly with the number of sources),
  3. memoize CryptoUtil.getkey using functools.lru_cache (makes get_sources scale linearly with the number of sources) - basically does 2 but with a way smaller diff

@redshiftzero
Copy link
Contributor

small diff for 3 - goes from ~20s compute to ~2s compute for 200 sources when the cache is warm (after the first call to get_sources):

diff --git a/securedrop/crypto_util.py b/securedrop/crypto_util.py
index 1d730813a..7039d1e70 100644
--- a/securedrop/crypto_util.py
+++ b/securedrop/crypto_util.py
@@ -5,6 +5,7 @@ import pretty_bad_protocol as gnupg
 import os
 import io
 import scrypt
+from functools import lru_cache
 from random import SystemRandom

 from base64 import b32encode
@@ -225,6 +226,7 @@ class CryptoUtil:
         # The subkeys keyword argument deletes both secret and public keys.
         temp_gpg.delete_keys(key, secret=True, subkeys=True)

+    @lru_cache(maxsize=500)
     def getkey(self, name):
         for key in self.gpg.list_keys():
             for uid in key['uids']:

@sssoleileraaa
Copy link
Contributor Author

We are missing a few endpoints in our reporting:

delete_source
authenticate
logout

@zenmonkeykstop
Copy link
Contributor

Closing - will revisit timeout and sync behaviour in other issues. Has also been addressed in code, ie:

def _get_realistic_timeout(self, size_in_bytes: int) -> int:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants