Ensure UI remains responsive at all times during sync with server. #733

ntoll · 2020-01-27T12:27:10Z

Description

Fixes #716

DO NOT MERGE this is a work in progress while further investigation takes place.

The initial removal of update_sources on message download ensures the UI remains responsive.

However, given a server with 200 sources, while the UI is now responsive, I no longer see the results I was able to see with far fewer sources (50) in previous tests. The server activity looks like this:

172.17.0.1 - - [27/Jan/2020 11:51:51] "POST /api/v1/token HTTP/1.1" 200 -
172.17.0.1 - - [27/Jan/2020 11:52:20] "GET /api/v1/sources HTTP/1.1" 200 -
172.17.0.1 - - [27/Jan/2020 11:52:43] "GET /api/v1/sources HTTP/1.1" 200 -
2020-01-27 11:52:45,411 INFO Clearing shredder
2020-01-27 11:52:45,411 INFO Files to delete: 0

Which suggests something is pinging the API every 20 seconds or so. Not sure what though. I also notice from top that the client is taking up around 50% of my CPU and spawning lots of gpg2 related processes. After a while the client (responsively!) reports "The SecureDrop server cannot be reached".

As I understand it, we download everything in one go. Is there an API call we can make to get just the sources with their latest message id and just download them instead. Upon clicking the source, if not already downloaded, perhaps only the messages for the specific source could be downloaded and processed..?

In any case, the end result is that the client responsively hangs and takes up CPU resources when there are a large number of sources. When there are only 50 the application appears to work fine except the latest-message preview in the source list is never updated (which can be coded around, see screenie below).

My gut feeling is while the UI is now responsive, this highlights yet another performance problem -- i.e. the processing of large numbers of sources and their associated conversations.

Thoughts welcome.

Test Plan

Currently, keep poking with a stick and ask more questions.

Checklist

If these changes modify code paths involving cryptography, the opening of files in VMs or network (via the RPC service) traffic, Qubes testing in the staging environment is required. For fine tuning of the graphical user interface, testing in any environment in Qubes is required. Please check as applicable:

I have tested these changes in the appropriate Qubes environment
I do not have an appropriate Qubes OS workstation set up (the reviewer will need to test these changes)
These changes should not need testing in Qubes

If these changes add or remove files other than client code, packaging logic (e.g., the AppArmor profile) may need to be updated. Please check as applicable:

I have submitted a separate PR to the packaging repo
No update to the packaging logic (e.g., AppArmor profile) is required for these changes
I don't know and would appreciate guidance

redshiftzero · 2020-01-27T19:17:38Z

the client is taking up around 50% of my CPU and spawning lots of gpg2 related processes

this is almost certainly due to this logic: after sync, we attempt to re-import every key for every source (from the server). If we rework the logic on the client-side to only import keys for sources that do not yet have keys (to do in a followup) then that should significantly reduce CPU usage

sssoleileraaa · 2020-01-27T19:30:04Z

The initial removal of update_sources on message download ensures the UI remains responsive.

So currently we call update_sources, which gets the sources from the local database and updates the GUI source list, in several locations:

on_message_download_success
on_file_download_success
on_delete_source_success
on_authenticate_success
on_sync_success
login_offline_mode

Since we want to display new messages and replies and new sources on-the-fly, we refresh the source list every time we download something new that needs to be displayed.

I think we should think about removing this call from more than just the on_message_download_success method. We only need to update the source list after a successful metadata sync, because that's when the local database will have all the new sources. We also don't need to update the source list if there isn't a new source.

However, given a server with 200 sources, while the UI is now responsive, I no longer see the results I was able to see with far fewer sources (50) in previous tests. The server activity looks like this:

It would makes sense why removing the GUI update of the source list for each new message would speed this up.

172.17.0.1 - - [27/Jan/2020 11:51:51] "POST /api/v1/token HTTP/1.1" 200 -
172.17.0.1 - - [27/Jan/2020 11:52:20] "GET /api/v1/sources HTTP/1.1" 200 -
172.17.0.1 - - [27/Jan/2020 11:52:43] "GET /api/v1/sources HTTP/1.1" 200 -
2020-01-27 11:52:45,411 INFO Clearing shredder
2020-01-27 11:52:45,411 INFO Files to delete: 0
Which suggests something is pinging the API every 20 seconds or so. Not sure what though. I also notice from top that the client is taking up around 50% of my CPU and spawning lots of gpg2 related processes. After a while the client (responsively!) reports "The SecureDrop server cannot be reached".

so currently we call sync_api in a few more places (the big removal of sync_api is not yet fisnished). I am not seeing 20-second interval metadata syncs unless I change this timer to be 20 seconds: https://github.com/freedomofpress/securedrop-client/blob/master/securedrop_client/logic.py#L227

One way to see more frequent API calls is if there is a failure of a queue job, which will retry. The default timeout is 20 seconds for messages and replies, so perhaps there was so retrying happening.

As I understand it, we download everything in one go. Is there an API call we can make to get just the sources with their latest message id and just download them instead. Upon clicking the source, if not already downloaded, perhaps only the messages for the specific source could be downloaded and processed..?

This is something we have talked about and could do on the server-side, which seems ideal, but until then we could improve the client side of things by only updating the GUI source list after we get all the sources, not whenever there is a new message or reply (because we will already have the source for that message in the source list (recall that we update the source list after a metadata sync).

In any case, the end result is that the client responsively hangs and takes up CPU resources when there are a large number of sources. When there are only 50 the application appears to work fine except the latest-message preview in the source list is never updated (which can be coded around, see screenie below).

I do believe changing the code so that we only call update_sources in on_sync_success is a good optimization. So what you're doing makes sense! I think we should just go further and remove it in other locations and possibly add logic to only update the source list if there are new sources (we could do this in a separate PR as well).

sssoleileraaa · 2020-01-27T19:36:42Z

Since we want to display new messages and replies and new sources on-the-fly, we refresh the source list every time we download something new that needs to be displayed.

Note: that was the thinking behind this but actually it's unnecessary as you've discovered @ntoll

rmol · 2020-01-28T02:55:26Z

@ntoll I tried with 200 sources and the GPG imports only took a few seconds, and weren't killing my machine, but this branch skips the import for keys we already have. It's based on this branch (unfreeze-ui) so hopefully it'll be easy to try. If it helps, I'll add tests and finish it up.

ntoll · 2020-01-28T12:05:51Z

@rmol @creviera OK... I've merged you skip-gpg branch and removed the update_sources from various places. I'm about to embark on testing against a 200-source version of the server and checking the various aspects of the behaviour I've inevitably changed. I also want to update a few other things to make sure that when new messages arrive the source preview is updated properly.

ntoll · 2020-01-28T16:44:17Z

This has been an interesting "building-a-house-of-cards" like exercise to get to this state: work slowly, carefully, but one wrong move and it all falls down. All the tests were done against a server with 200 sources.

This is currently still a WiP but I feel I've made good progress both in terms of diagnosing and understanding the problem from the client side of things and then coming up with a fix.

The retries after 20 seconds that I noticed were because the MetadataJob had an API call timeout of 20 seconds which wasn't long enough for the file to download (on my machine). I'm not sure why this is the case on my machine, but the side effect is that the job would always fail (time-out) and then try again and I'd be stuck in that state of never completed syncs. This was what was causing the CPU load. I've increased the timeout value and the sync successfully completes.

Then I noticed that while the UI was responsive in that it wasn't frozen, it didn't look as if it were doing anything while the sources were being downloaded. So I have bodged an animation together which will definitely need the magical super powers of @ninavizz to provide a 16x16 looping animated gif of an indicator to show the client is doing something. I also ensured that the status message was updated to indicate things were "in flight". This, too, needs checking in terms of UI / messaging. However, and please remember this is a temporary UI bodge, this vastly improves the UX in terms of responsiveness.

As you can see, the UI is always responsive.

In addition I also managed to crash the client several times and eventually found that was due to some faulty login in the updated queue.py file. Since I've been on a bug hunt, I've also cleaned this problem up (see the diff for details... @creviera I'd love some feedback there please, 'cos you're the expert on this process and I'm painfully aware I might have missed something). Basically the bug was when a job was re-added to the metadata_queue the queue was Full and things broke.

Finally, as messages, files and replies are downloaded from the server the source list "preview last message" also needs updating. This is covered in another ticket (#707) but it feels related to all this work. I won't fix it here but once this is merged I want to nail that down now that I've become quite familiar with the code I'll be changing through the work done in this PR. ;-)

I'd like feedback, poking with a stick and comments about the content of this PR (remember it's got @rmol's branch for gpg key import improvements merged in too).

My aim is, assuming folks don't find gotchas, to get this PR revised with comprehensive unit tests (the work you see so far is the result of slow/careful exploration so is a bit like a "spike" at the moment) and incorporate an appropriate spinner from @ninavizz.

Phew.

ntoll · 2020-01-28T17:12:42Z

OK... I need to revisit this work in light of #732

sssoleileraaa

looks like you just need to add tests and remove code unrelated to the removal of unnecessary GUI updates and the client freeze issue.

other than that, i like this approach

sssoleileraaa · 2020-01-28T19:32:13Z

securedrop_client/api_jobs/downloads.py

@@ -53,7 +53,7 @@ def call_api(self, api_client: API, session: Session) -> Any:
        # TODO: Once https://github.com/freedomofpress/securedrop-client/issues/648, we will want to
        # pass the default request timeout to api calls instead of setting it on the api object
        # directly.
-        api_client.default_request_timeout = 20
+        api_client.default_request_timeout = 120


so you'll notice we have an issue for acquiring realistic timeouts for api calls here: #648

if you read through john's findings you'll see that 20 seconds is cutting it close for get_sources when there's 200, but it looks to be more than enough time for the other API calls made during a sync: get_submissions and get_all_replies

eventually we want to be able to scale these timeouts the way we do with file downloads but for now, using a default of 20 seconds seems like plenty of time for most cases except when there are 200+ sources for get_sources

Unfortunately we don't yet support passing a different default timeout to just the get_sources api method as you can see from my comment above:

# TODO: Once https://github.com/freedomofpress/securedrop-client/issues/648, we will want to # pass the default request timeout to api calls instead of setting it on the api object # directly.

i think until we can vary our timeouts per request and based on size of the request, we should use the largest measurement from john's profiling and mutliple it by 1.5 or 2, so I think 30-40 second timeouts should be enough, right?

I am wondering if @ntoll tested with freedomofpress/securedrop#5100 in place on the server which should already significantly improve performance (the cache gets primed on app server start)

sssoleileraaa · 2020-01-28T19:34:24Z

securedrop_client/gui/widgets.py

@@ -215,6 +215,9 @@ def __init__(self):

        self.active = False

+        self.download_animation = load_movie("sync_animation.gif")


as discussed during standup you can just remove all these widget.py changes and focus on optimizing the UI so that it doesn't freeze when there are many sources by removing all the unnecessary update_sources calls and increasing the default timeout a bit.

sssoleileraaa · 2020-01-28T19:36:45Z

securedrop_client/queue.py

@@ -103,7 +103,12 @@ def re_add_job(self, job: ApiJob) -> None:
        '''
        job.remaining_attempts = DEFAULT_NUM_ATTEMPTS
        priority = self.JOB_PRIORITIES[type(job)]
-        self.queue.put_nowait((priority, job))
+        try:


this is out of scope and can be made into a separate pr, right? or was this needed for optimizing UI during a sync? also see how we are moving towards a continuous background sync that only ever runs one sync at a time: #739

ntoll · 2020-01-29T12:13:08Z

OK... I've extracted just the essential aspect of this fix, rebased/reset things and wrapped it all up into a neat single commit. I'll add another PR to address the queue related bugfix.

NOTE: this branch does not include @rmol's gpg key based branch any more.

sssoleileraaa

lgtm

ntoll requested review from sssoleileraaa, kushaldas and redshiftzero as code owners January 27, 2020 12:27

ntoll mentioned this pull request Jan 27, 2020

snippets don't update when content of message/reply is updated #707

Closed

eloquence mentioned this pull request Jan 27, 2020

Do not re-import keys for sources that already have them #735

Closed

sssoleileraaa suggested changes Jan 28, 2020

View reviewed changes

Unblock UI.

0837cb6

ntoll force-pushed the unfreeze-ui branch from 1ca7422 to 0837cb6 Compare January 29, 2020 12:10

sssoleileraaa approved these changes Jan 29, 2020

View reviewed changes

sssoleileraaa merged commit 11ad0b1 into freedomofpress:master Jan 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure UI remains responsive at all times during sync with server. #733

Ensure UI remains responsive at all times during sync with server. #733

ntoll commented Jan 27, 2020

redshiftzero commented Jan 27, 2020

sssoleileraaa commented Jan 27, 2020 •

edited

Loading

sssoleileraaa commented Jan 27, 2020

rmol commented Jan 28, 2020

ntoll commented Jan 28, 2020

ntoll commented Jan 28, 2020

ntoll commented Jan 28, 2020

sssoleileraaa left a comment •

edited

Loading

sssoleileraaa Jan 28, 2020

eloquence Jan 28, 2020 •

edited

Loading

sssoleileraaa Jan 28, 2020

sssoleileraaa Jan 28, 2020

ntoll commented Jan 29, 2020

sssoleileraaa left a comment

		@@ -215,6 +215,9 @@ def __init__(self):

		self.active = False

		self.download_animation = load_movie("sync_animation.gif")

Ensure UI remains responsive at all times during sync with server. #733

Ensure UI remains responsive at all times during sync with server. #733

Conversation

ntoll commented Jan 27, 2020

Description

Test Plan

Checklist

redshiftzero commented Jan 27, 2020

sssoleileraaa commented Jan 27, 2020 • edited Loading

sssoleileraaa commented Jan 27, 2020

rmol commented Jan 28, 2020

ntoll commented Jan 28, 2020

ntoll commented Jan 28, 2020

ntoll commented Jan 28, 2020

sssoleileraaa left a comment • edited Loading

Choose a reason for hiding this comment

sssoleileraaa Jan 28, 2020

Choose a reason for hiding this comment

eloquence Jan 28, 2020 • edited Loading

Choose a reason for hiding this comment

sssoleileraaa Jan 28, 2020

Choose a reason for hiding this comment

sssoleileraaa Jan 28, 2020

Choose a reason for hiding this comment

ntoll commented Jan 29, 2020

sssoleileraaa left a comment

Choose a reason for hiding this comment

sssoleileraaa commented Jan 27, 2020 •

edited

Loading

sssoleileraaa left a comment •

edited

Loading

eloquence Jan 28, 2020 •

edited

Loading