fix: limit and track processing of `SendReplyJob`s around session and network partitions #1486

cfm · 2022-05-11T00:55:11Z

Blocked on data migrations will fail if subsequent versions add columns to back-referenced tables #1500

Description

Fixes Client retries failed jobs after logging out/in without restart #1420 by clearing the ApiJobQueue when it stops, so that unprocessed jobs (e.g. SendReplyJobs for pending DraftReplys) will not continue to be processed under the session that's just logged out.
Fixes "Failed" replies may actually succeed after client exits #1457 and fixes update_replies(): purge or mark stale pending ("draft") replies #1493 by having the SendReplyJob take "ownership" of a given DraftReply by marking it with the current PID, so that both the current and a subsequent Client session can infer whether or not the SendReplyJob's API call could still be in flight.

For more details on why all this extra machinery is necessary to address this pair of corner cases, see especially #1486 (comment) and #1457 (comment).

Test Plan

Synthesized from the reproductions of #1420 and #1457:

On your test server, watch /var/log/apache2/journalist-access.log.
Create a new source and submit a message.
Log into the Client and wait for sync.
Send a reply and wait for it to send.
Observe exactly 1 "POST /api/v1/sources/<uuid>/replies HTTP/1.1" 201.
Disconnect sd-whonix from sys-firewall.
Send at least n >= 1 more replies.
Quit the Client.
Reconnect sd-whonix to sys-firewall.
Observe exactly 1 "POST /api/v1/sources/<uuid>/replies HTTP/1.1" 201.
Wait several seconds.
Start the Client again.
Log in again and wait for sync.
Observe only the first message from step (5) has succeeded...
...and the rest are marked Failed to send.
Repeat steps (1)‒(15), but log out (and back in) rather than quitting and restarting the Client.
Observe the same results.

Bonus: bootstrapping case

Delete your svs.sqlite and reinitialize it with alembic upgrade head.
Log into the Client and wait for sync.
Sources and submissions are retrieved and displayed as usual.

Checklist

If these changes modify code paths involving cryptography, the opening of files in VMs or network (via the RPC service) traffic, Qubes testing in the staging environment is required. For fine tuning of the graphical user interface, testing in any environment in Qubes is required. Please check as applicable:

I have tested these changes in the appropriate Qubes environment
I do not have an appropriate Qubes OS workstation set up (the reviewer will need to test these changes)
These changes should not need testing in Qubes

If these changes add or remove files other than client code, the AppArmor profile may need to be updated. Please check as applicable:

I have updated the AppArmor profile
No update to the AppArmor profile is required for these changes
I don't know and would appreciate guidance

If these changes modify the database schema, you should include a database migration. Please check as applicable:

I have written a migration and upgraded a test database based on main and confirmed that the migration is self-contained and applies cleanly
I have written a migration but have not upgraded a test database based on main and would like the reviewer to do so
I need help writing a database migration
No database schema changes are needed

cfm · 2022-05-16T18:39:13Z

@creviera, I'd welcome your early feedback on this overall approach to #1457. The test plan (TK tomorrow) will address whether this also fixes #1420, per your hypothesis that they're actually the same bug only apparently at different layers in the application. Thanks!

sssoleileraaa · 2022-05-17T18:21:56Z

@cfm, this change is pleasantly small, yet the issue caused so much confusion during QA last release. The only gotcha I can think of is that we actually want to fail the replies in the queue that are not being processed.

cfm · 2022-05-19T00:33:21Z

Thanks, @creviera! The test plan I've outlined above gives two pieces of good news:

Client retries failed jobs after logging out/in without restart #1420 and "Failed" replies may actually succeed after client exits #1457 are indeed the same bug: the first SendReplyJob in flight before logout will succeed when network connectivity is restored, regardless of the Client's state. For that job, the Server behaves identically between the two reproductions, which appear to diverge only because the job queue is not persistent across Client invocations.
You wrote that "we actually want to fail the replies in the queue that are not being processed". Yes! In Client retries failed jobs after logging out/in without restart #1420, with this change, we get that behavior for free, because the Server correctly rejects the second through nth SendReplyJobs with HTTP 403.

The bad news: We need a different mechanism to handle this situation in #1457, in which:

we have draft replies still marked as pending from a previous Client invocation;
we have no SendReplyJobs pending in the queue in this Client invocation; and
these replies are not present in the set of replies returned by the Server.

I believe this to be a data-reconciliation problem (storage.py), not a job-processing problem (queue.py). I'll open a follow-up ticket for that problem, so that you can decide whether to review this fix as is or wait to do a stacked review of both.

cfm · 2022-05-19T23:27:12Z

Visually reviewed (and tested, albeit off-plan :-) with @creviera. The test plan checks out as written; but #1493 represents a regression from the current behavior, not just an impetus for future refactoring. Our new hypothesis, which I'll test next week (realistically):

this branch addresses Client retries failed jobs after logging out/in without restart #1420 in the GUI;
poor old fix: clear RunnableQueues on ApiJobQueue.stop() #1434 addresses Client retries failed jobs after logging out/in without restart #1420 in the job queue; and

—so the two of them, rebased together, will let us solve #1420 at both layers. (That would then leave #1493's refactoring to help us untangle these layers.)

cfm · 2022-05-25T07:07:03Z

Our hypothesis in #1486 (comment) is right so far as it goes, but we need a bit more wiring to keep the GUI in sync with both the SendReplyJob set (to which it's reactive) and the DraftReply table (to which it's not, and which currently doubles as a shadow queue of pending replies). That's added as of 9ae4f2d, which by the end of the week I'll have cleaned up and test-planned for review. Some new architectural/refactoring insights will follow in #1493.

cfm · 2022-07-21T20:20:48Z

Now that this fix is no longer blocked by #1500, I'm interested in getting it back to ready for review. Whether or not we merge it, it's useful as an illustration of some technical debt I'd like us to have a shared understanding of for future architectural discussions.

Necessary changes:

Rebase from main, resolving conflicts with recent refactoring of securedrop.logic (easy) and securedrop.queue (harder) and their tests
Rebase Alembic migration onto current head d7c8af95bc8e (new base from fix: squash Alembic migrations; eliminate non-self-contained data-migration tests #1517)
Retest
Write new test plan

eloquence · 2022-09-15T23:24:18Z

This is marked as "in development", is that accurate or should we bump it back for now?

cfm · 2022-10-26T00:54:23Z

I'm finally resuscitating this branch against current main and on the current project board—and aiming to propose a clear way forward on this proposal by the end of next week.

gonzalo-bulnes · 2022-11-02T00:33:08Z

For the record: @cfm and I walked through this PR and #1493. The vision makes sense to me. I'll summarize it later :P

Edit to add: The gist of the vision beyond this PR is:

making a distinction between user intent (primarily a GUI concern) and operations tracking (that is: how much have we acted upon that intent so far?)
being intentional in not introducing unnecessary fragmentation (e.g. by having two actors update distinct records of the same event) and actively looking for it
working on a technical level on unifying concepts, for example, towards dealing with queues in a near-standard manner, whichever that is, instead of having a collection of distinct queue implementations and behaviors

gonzalo-bulnes · 2022-11-02T04:45:27Z

@cfm It may be obvious to you, but the CI pipeline definition has changed significantly since you opened this PR, so you won't get anything close to a meaningful CI failure until you update onto main. 🍎 🍏

Fixes a partition[1][2] between SpeechBubble._update_text(), called when a reply has been downloaded from the Journalist API, and ReplyWidget._on_reply_success(), called when a reply has been successfully sent via the SDK. In the case where the Client starts with a reply still pending from a previous session and the reply is subsequently downloaded, ReplyWidget.status will not be updated from Reply.send_status unless it too listens for the SpeechBubble's update_signal, and future calls to ReplyWidget._update_styles() will be based on a stale ReplyWidget.status. While conflating these signals is not ideal, in the current data model we might as well, since a reply downloaded from the server is by definition one that has been successfully sent to the server. See <#1493 (comment)>. [1]: #1486 (review) [2]: #1486 (comment)

cfm · 2022-12-09T00:16:00Z

@gonzalo-bulnes, I've been able to trace this bug to the discrepancy described and fixed in e313fc9. As I note there, this turns out not to be a data race, strictly speaking. Rather, the Reply and the ReplyWidget have different (implicit) state machines, and only certain signals propagate changes from the former to the latter.

This could be fixed in the controller instead, with something like:

--- a/securedrop_client/logic.py
+++ b/securedrop_client/logic.py
@@ -892,6 +892,7 @@ class Controller(QObject):
         """
         self.session.commit()  # Needed to flush stale data.
         reply = storage.get_reply(self.session, uuid)
+        self.reply_succeeded.emit(reply.source.uuid, reply.uuid, reply.content)
         self.reply_ready.emit(reply.source.uuid, reply.uuid, reply.content)
 
     def on_reply_download_failure(self, exception: DownloadException) -> None:

(Emitting this signal in Controller.download_new_replies, where it would arguably make more sense, unfortunately creates the risk of an equal, but different, discrepancy as a side effect.)

I favor the change proposed in e313fc9 simply because the fix and the test are each a single-line change at the GUI level, while this approach would require constructing a dedicated cross-layer test case.

I don't love it, though. More motivation for #1493 (comment), unless you can think of a cleaner way for now!

gonzalo-bulnes · 2022-12-13T04:11:01Z

The explanation in the commit message makes sense to me @cfm, and I agree that while no ideal (a somewhat unintelligible without context), received_from_the_sever => successfully_sent_to_the_server is sound.

I think you're making the right decision when favoring the smaller, simpler change over another one that requires additional design. We'll be in a better place for design when we can take a few steps back, rather than designing a solution to a problem that might not even occur anymore after we look at the process sin its entirety. 👍 👍

I'll do a round of testing tomorrow. (And might ask you for a refresher on the onion addresses config, let's see how precisely I remember the procedure!)

gonzalo-bulnes · 2022-12-16T14:36:53Z

On your test server, watch /var/log/apache2/journalist-access.log.
Create a new source and submit a message. ℹ️ The source is called ethnological remark.
Log into the Client and wait for sync.
Send a reply and wait for it to send.
Observe exactly 1 "POST /api/v1/sources/<uuid>/replies HTTP/1.1" 201.
Disconnect sd-whonix from sys-firewall.
Send at least n >= 1 more replies. ℹ️ I sent two.
Quit the Client.
Reconnect sd-whonix to sys-firewall.
Observe exactly 1 "POST /api/v1/sources/<uuid>/replies HTTP/1.1" 201. ℹ️ That reply, the first of the two I sent, made it to the server and was visible to the source, as expected.
Wait several seconds.
Start the Client again.
Log in again and wait for sync.
Observe only the first message from step (5) has succeeded... 🍊 Like in the previous test, the first message -that gets sent- never get "seen" check marks until the client is restarted. - which, again, I wouldn't worry about in the scope of this PR.
...and the rest are marked Failed to send.
Repeat steps (1)‒(15), but log out (and back in) rather than quitting and restarting the Client.
Observe the same results.

However: When the network is reconnected, the reply can be shown as failed for some time, until a "sync" turns the body and avatar purple.. but leaves the "(i) Failed to send" label in place 😞

I'm skipping detailing steps to reproduce because I'm not sure I understand the basic connection behavior. (The GUI messaging around connection state doesn't seems to match what I know of the connection between sd-whonix and sys-firewall. I'm not sure if that's related to this branch, and I suspect it might not be. I think the "sync" sometimes gets stuck until new source messages come in... either way I see flashed of "failed" replies that then turn "sent". Let's talk about that, and maybe we can 🍐 pair together on characterizing those scenarios errors better @cfm.

This test case currently fails by design, as a reproduction of the low-level behavior responsible for bug #1420.

…reads ClearQueueJob has higher priority than all other defined jobs, which it preempts. ClearQueueJob clears the queue; then, like PauseQueueJob, it returns from the RunnableQueue.process() loop. The processing threads will then exit without ApiJobQueue.stop() having to block on QThread.wait().

…hether QThread.quit() has been called

… state Since #750, application-level state transitions (logging in, logging out, and switching into offline mode) call securedrop_client.storage.mark_all_pending_draft_replies_as_failed(). However, one or more SendReplyJobs (and their underlying POST requests to "/sources/<source_uuid/replies>") may be in flight on the network at that time, whether or not the application is connected (or even running) to receive their responses. Until we have better (ideally generalized) logic around upload jobs that have not yet been confirmed by the server, these application-level events should not make assumptions about the results of jobs that have already been dispatched. Individual SendReplyJobs still call their own _set_status_to_failed() method on non-timeout exceptions.

When a SendReplyJob actually starts to send (POST) a reply, take ownership of it by marking it with the current PID so that we can update "pending" versus "failed" states in subsequent Client sessions.

…der previous PIDs If a DraftReply was marked "pending" by a previous Client session (i.e., with a different PID), by now it must have either failed or been sent successfully. To avoid leaving it an an inconsistent state, we presume it failed; we'll update it anyway if the Journalist API reports it received on the next sync.

Now that pending replies have their statuses updated by signal (when sync succeeds and when queue processing stops), testing this behavior coupled to logout no longer makes sense and is difficult to implement idiomatically. These paths are tested adequately here in the controller layer in test_Controller_update_failed_replies() and in the storage layer in test_storage.test_mark_pending_replies_as_failed().

Fixes a partition[1][2] between SpeechBubble._update_text(), called when a reply has been downloaded from the Journalist API, and ReplyWidget._on_reply_success(), called when a reply has been successfully sent via the SDK. In the case where the Client starts with a reply still pending from a previous session and the reply is subsequently downloaded, ReplyWidget.status will not be updated from Reply.send_status unless it too listens for the SpeechBubble's update_signal, and future calls to ReplyWidget._update_styles() will be based on a stale ReplyWidget.status. While conflating these signals is not ideal, in the current data model we might as well, since a reply downloaded from the server is by definition one that has been successfully sent to the server. See <#1493 (comment)>. [1]: #1486 (review) [2]: #1486 (comment)

cfm · 2022-12-20T01:46:09Z

#1486 (comment):

However: When the network is reconnected, the reply can be shown as failed for some time, until a "sync" turns the body and avatar purple.. but leaves the "(i) Failed to send" label in place disappointed

Tomorrow @gonzalo-bulnes and I will try to reproduce this case with some instrumentation patched in. Depending on our findings, I have a local branch already rebased against current main that I can force-push here for a quick turnaround.

cfm · 2022-12-22T03:17:45Z

f0040bc is rebased to clear conflicts after #1604, after extensive testing with @gonzalo-bulnes to make sure that e313fc9, from which it does not differ substantively, was actually the head under test.

gonzalo-bulnes

The unexpected and inconsistent states are cleared by f0040bc. 💮

I'll approve the PR based on extensive testing with @cfm.

gonzalo-bulnes · 2022-12-22T03:41:45Z

🎩 @cfm!

After #1611, specifically 64293d1, this "ignore" pragma is unnecessary and therefore invalid in strict mode. It was introduced in #1486, which was reviewed and merged while #1611 was still in review.

cfm self-assigned this May 11, 2022

cfm mentioned this pull request May 11, 2022

Client retries failed jobs after logging out/in without restart #1420

Closed

cfm marked this pull request as ready for review May 19, 2022 00:33

cfm requested a review from a team as a code owner May 19, 2022 00:33

cfm mentioned this pull request May 19, 2022

"Failed" replies may actually succeed after client exits #1457

Closed

cfm mentioned this pull request May 19, 2022

update_replies(): purge or mark stale pending ("draft") replies #1493

Closed

cfm removed their assignment May 19, 2022

cfm marked this pull request as draft May 19, 2022 23:27

cfm self-assigned this May 19, 2022

cfm force-pushed the 1457-pending-replies branch from d27fd8d to 9ae4f2d Compare May 25, 2022 06:54

cfm force-pushed the 1457-pending-replies branch 2 times, most recently from 82421c7 to 72cb369 Compare May 27, 2022 06:21

cfm added the on hold label Jul 12, 2022

cfm removed the on hold label Jul 21, 2022

cfm force-pushed the 1457-pending-replies branch from 72cb369 to d75b875 Compare November 8, 2022 00:31

cfm force-pushed the 1457-pending-replies branch from c405d1d to e313fc9 Compare December 9, 2022 00:00

cfm requested a review from gonzalo-bulnes December 9, 2022 00:16

cfm assigned gonzalo-bulnes and unassigned cfm Dec 9, 2022

cfm added 9 commits December 19, 2022 17:36

test: ApiJobQueue.stop() clears jobs

5e479d0

This test case currently fails by design, as a reproduction of the low-level behavior responsible for bug #1420.

feat: RunnableQueue._clear() clears PriorityQueue

c2d7fe6

test(ApiJobQueue_stop_stops_queue_threads): check thread state, not w…

4e9c230

…hether QThread.quit() has been called

feat(DraftReply): track sending PID

783c930

When a SendReplyJob actually starts to send (POST) a reply, take ownership of it by marking it with the current PID so that we can update "pending" versus "failed" states in subsequent Client sessions.

cfm force-pushed the 1457-pending-replies branch from e313fc9 to f0040bc Compare December 22, 2022 03:11

gonzalo-bulnes approved these changes Dec 22, 2022

View reviewed changes

gonzalo-bulnes merged commit 4ef79d8 into main Dec 22, 2022

gonzalo-bulnes deleted the 1457-pending-replies branch December 22, 2022 03:41

cfm mentioned this pull request Jan 4, 2023

model more-unified handling of Reply and Message objects #1617

Closed

cfm mentioned this pull request Jan 5, 2023

chore: remove unnecessary mypy ignore pragma #1619

Merged

4 tasks

gonzalo-bulnes mentioned this pull request Mar 14, 2023

Bump version to 0.9.0 #1641

Merged

2 tasks

rocodes mentioned this pull request Mar 14, 2023

Release securedrop-client 0.9.0 #1637

Closed

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: limit and track processing of `SendReplyJob`s around session and network partitions #1486

fix: limit and track processing of `SendReplyJob`s around session and network partitions #1486

cfm commented May 11, 2022 •

edited

Loading

cfm commented May 16, 2022

sssoleileraaa commented May 17, 2022 •

edited

Loading

cfm commented May 19, 2022 •

edited

Loading

cfm commented May 19, 2022 •

edited

Loading

cfm commented May 25, 2022

cfm commented Jul 21, 2022 •

edited

Loading

eloquence commented Sep 15, 2022

cfm commented Oct 26, 2022

gonzalo-bulnes commented Nov 2, 2022 •

edited

Loading

gonzalo-bulnes commented Nov 2, 2022

cfm commented Dec 9, 2022

gonzalo-bulnes commented Dec 13, 2022

gonzalo-bulnes commented Dec 16, 2022

cfm commented Dec 20, 2022

cfm commented Dec 22, 2022

gonzalo-bulnes left a comment

gonzalo-bulnes commented Dec 22, 2022

fix: limit and track processing of SendReplyJobs around session and network partitions #1486

fix: limit and track processing of SendReplyJobs around session and network partitions #1486

Conversation

cfm commented May 11, 2022 • edited Loading

Description

Test Plan

Bonus: bootstrapping case

Checklist

cfm commented May 16, 2022

sssoleileraaa commented May 17, 2022 • edited Loading

cfm commented May 19, 2022 • edited Loading

cfm commented May 19, 2022 • edited Loading

cfm commented May 25, 2022

cfm commented Jul 21, 2022 • edited Loading

eloquence commented Sep 15, 2022

cfm commented Oct 26, 2022

gonzalo-bulnes commented Nov 2, 2022 • edited Loading

gonzalo-bulnes commented Nov 2, 2022

cfm commented Dec 9, 2022

gonzalo-bulnes commented Dec 13, 2022

gonzalo-bulnes commented Dec 16, 2022

cfm commented Dec 20, 2022

cfm commented Dec 22, 2022

gonzalo-bulnes left a comment

Choose a reason for hiding this comment

gonzalo-bulnes commented Dec 22, 2022

fix: limit and track processing of `SendReplyJob`s around session and network partitions #1486

fix: limit and track processing of `SendReplyJob`s around session and network partitions #1486

cfm commented May 11, 2022 •

edited

Loading

sssoleileraaa commented May 17, 2022 •

edited

Loading

cfm commented May 19, 2022 •

edited

Loading

cfm commented May 19, 2022 •

edited

Loading

cfm commented Jul 21, 2022 •

edited

Loading

gonzalo-bulnes commented Nov 2, 2022 •

edited

Loading