Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: limit and track processing of SendReplyJobs around session and network partitions #1486

Merged
merged 9 commits into from
Dec 22, 2022

Conversation

cfm
Copy link
Member

@cfm cfm commented May 11, 2022

Description

For more details on why all this extra machinery is necessary to address this pair of corner cases, see especially #1486 (comment) and #1457 (comment).

Test Plan

Synthesized from the reproductions of #1420 and #1457:

  1. On your test server, watch /var/log/apache2/journalist-access.log.
  2. Create a new source and submit a message.
  3. Log into the Client and wait for sync.
  4. Send a reply and wait for it to send.
  5. Observe exactly 1 "POST /api/v1/sources/<uuid>/replies HTTP/1.1" 201.
  6. Disconnect sd-whonix from sys-firewall.
  7. Send at least n >= 1 more replies.
  8. Quit the Client.
  9. Reconnect sd-whonix to sys-firewall.
  10. Observe exactly 1 "POST /api/v1/sources/<uuid>/replies HTTP/1.1" 201.
  11. Wait several seconds.
  12. Start the Client again.
  13. Log in again and wait for sync.
  14. Observe only the first message from step (5) has succeeded...
  15. ...and the rest are marked Failed to send.
  16. Repeat steps (1)‒(15), but log out (and back in) rather than quitting and restarting the Client.
  17. Observe the same results.

Bonus: bootstrapping case

  1. Delete your svs.sqlite and reinitialize it with alembic upgrade head.
  2. Log into the Client and wait for sync.
  3. Sources and submissions are retrieved and displayed as usual.

Checklist

If these changes modify code paths involving cryptography, the opening of files in VMs or network (via the RPC service) traffic, Qubes testing in the staging environment is required. For fine tuning of the graphical user interface, testing in any environment in Qubes is required. Please check as applicable:

  • I have tested these changes in the appropriate Qubes environment
  • I do not have an appropriate Qubes OS workstation set up (the reviewer will need to test these changes)
  • These changes should not need testing in Qubes

If these changes add or remove files other than client code, the AppArmor profile may need to be updated. Please check as applicable:

  • I have updated the AppArmor profile
  • No update to the AppArmor profile is required for these changes
  • I don't know and would appreciate guidance

If these changes modify the database schema, you should include a database migration. Please check as applicable:

  • I have written a migration and upgraded a test database based on main and confirmed that the migration is self-contained and applies cleanly
  • I have written a migration but have not upgraded a test database based on main and would like the reviewer to do so
  • I need help writing a database migration
  • No database schema changes are needed

@cfm
Copy link
Member Author

cfm commented May 16, 2022

@creviera, I'd welcome your early feedback on this overall approach to #1457. The test plan (TK tomorrow) will address whether this also fixes #1420, per your hypothesis that they're actually the same bug only apparently at different layers in the application. Thanks!

@sssoleileraaa
Copy link
Contributor

sssoleileraaa commented May 17, 2022

@cfm, this change is pleasantly small, yet the issue caused so much confusion during QA last release. The only gotcha I can think of is that we actually want to fail the replies in the queue that are not being processed.

@cfm cfm marked this pull request as ready for review May 19, 2022 00:33
@cfm cfm requested a review from a team as a code owner May 19, 2022 00:33
@cfm
Copy link
Member Author

cfm commented May 19, 2022

Thanks, @creviera! The test plan I've outlined above gives two pieces of good news:

  1. Client retries failed jobs after logging out/in without restart #1420 and "Failed" replies may actually succeed after client exits #1457 are indeed the same bug: the first SendReplyJob in flight before logout will succeed when network connectivity is restored, regardless of the Client's state. For that job, the Server behaves identically between the two reproductions, which appear to diverge only because the job queue is not persistent across Client invocations.

  2. You wrote that "we actually want to fail the replies in the queue that are not being processed". Yes! In Client retries failed jobs after logging out/in without restart #1420, with this change, we get that behavior for free, because the Server correctly rejects the second through nth SendReplyJobs with HTTP 403.

The bad news: We need a different mechanism to handle this situation in #1457, in which:

  1. we have draft replies still marked as pending from a previous Client invocation;
  2. we have no SendReplyJobs pending in the queue in this Client invocation; and
  3. these replies are not present in the set of replies returned by the Server.

I believe this to be a data-reconciliation problem (storage.py), not a job-processing problem (queue.py). I'll open a follow-up ticket for that problem, so that you can decide whether to review this fix as is or wait to do a stacked review of both.

@cfm
Copy link
Member Author

cfm commented May 19, 2022

Visually reviewed (and tested, albeit off-plan :-) with @creviera. The test plan checks out as written; but #1493 represents a regression from the current behavior, not just an impetus for future refactoring. Our new hypothesis, which I'll test next week (realistically):

  1. this branch addresses Client retries failed jobs after logging out/in without restart #1420 in the GUI;
  2. poor old fix: clear RunnableQueues on ApiJobQueue.stop() #1434 addresses Client retries failed jobs after logging out/in without restart #1420 in the job queue; and

—so the two of them, rebased together, will let us solve #1420 at both layers. (That would then leave #1493's refactoring to help us untangle these layers.)

@cfm cfm marked this pull request as draft May 19, 2022 23:27
@cfm cfm self-assigned this May 19, 2022
@cfm cfm force-pushed the 1457-pending-replies branch from d27fd8d to 9ae4f2d Compare May 25, 2022 06:54
@cfm
Copy link
Member Author

cfm commented May 25, 2022

Our hypothesis in #1486 (comment) is right so far as it goes, but we need a bit more wiring to keep the GUI in sync with both the SendReplyJob set (to which it's reactive) and the DraftReply table (to which it's not, and which currently doubles as a shadow queue of pending replies). That's added as of 9ae4f2d, which by the end of the week I'll have cleaned up and test-planned for review. Some new architectural/refactoring insights will follow in #1493.

@cfm cfm force-pushed the 1457-pending-replies branch 2 times, most recently from 82421c7 to 72cb369 Compare May 27, 2022 06:21
@cfm cfm added the on hold label Jul 12, 2022
@cfm cfm removed the on hold label Jul 21, 2022
@cfm
Copy link
Member Author

cfm commented Jul 21, 2022

Now that this fix is no longer blocked by #1500, I'm interested in getting it back to ready for review. Whether or not we merge it, it's useful as an illustration of some technical debt I'd like us to have a shared understanding of for future architectural discussions.

Necessary changes:

  1. Rebase from main, resolving conflicts with recent refactoring of securedrop.logic (easy) and securedrop.queue (harder) and their tests
  2. Rebase Alembic migration onto current head d7c8af95bc8e (new base from fix: squash Alembic migrations; eliminate non-self-contained data-migration tests #1517)
  3. Retest
  4. Write new test plan

@eloquence
Copy link
Member

This is marked as "in development", is that accurate or should we bump it back for now?

@cfm
Copy link
Member Author

cfm commented Oct 26, 2022

I'm finally resuscitating this branch against current main and on the current project board—and aiming to propose a clear way forward on this proposal by the end of next week.

@gonzalo-bulnes
Copy link
Contributor

gonzalo-bulnes commented Nov 2, 2022

For the record: @cfm and I walked through this PR and #1493. The vision makes sense to me. I'll summarize it later :P

Edit to add: The gist of the vision beyond this PR is:

  • making a distinction between user intent (primarily a GUI concern) and operations tracking (that is: how much have we acted upon that intent so far?)
  • being intentional in not introducing unnecessary fragmentation (e.g. by having two actors update distinct records of the same event) and actively looking for it
  • working on a technical level on unifying concepts, for example, towards dealing with queues in a near-standard manner, whichever that is, instead of having a collection of distinct queue implementations and behaviors

@gonzalo-bulnes
Copy link
Contributor

@cfm It may be obvious to you, but the CI pipeline definition has changed significantly since you opened this PR, so you won't get anything close to a meaningful CI failure until you update onto main. 🍎 🍏

@cfm cfm force-pushed the 1457-pending-replies branch from 72cb369 to d75b875 Compare November 8, 2022 00:31
cfm added a commit that referenced this pull request Dec 8, 2022
Fixes a partition[1][2] between SpeechBubble._update_text(), called when
a reply has been downloaded from the Journalist API, and
ReplyWidget._on_reply_success(), called when a reply has been
successfully sent via the SDK.  In the case where the Client starts with
a reply still pending from a previous session and the reply is
subsequently downloaded, ReplyWidget.status will not be updated from
Reply.send_status unless it too listens for the SpeechBubble's
update_signal, and future calls to ReplyWidget._update_styles() will be
based on a stale ReplyWidget.status.

While conflating these signals is not ideal, in the current data model
we might as well, since a reply downloaded from the server is by
definition one that has been successfully sent to the server.  See
<#1493 (comment)>.

[1]: #1486 (review)
[2]: #1486 (comment)
cfm added a commit that referenced this pull request Dec 9, 2022
Fixes a partition[1][2] between SpeechBubble._update_text(), called when
a reply has been downloaded from the Journalist API, and
ReplyWidget._on_reply_success(), called when a reply has been
successfully sent via the SDK.  In the case where the Client starts with
a reply still pending from a previous session and the reply is
subsequently downloaded, ReplyWidget.status will not be updated from
Reply.send_status unless it too listens for the SpeechBubble's
update_signal, and future calls to ReplyWidget._update_styles() will be
based on a stale ReplyWidget.status.

While conflating these signals is not ideal, in the current data model
we might as well, since a reply downloaded from the server is by
definition one that has been successfully sent to the server.  See
<#1493 (comment)>.

[1]: #1486 (review)
[2]: #1486 (comment)
@cfm cfm force-pushed the 1457-pending-replies branch from c405d1d to e313fc9 Compare December 9, 2022 00:00
@cfm
Copy link
Member Author

cfm commented Dec 9, 2022

@gonzalo-bulnes, I've been able to trace this bug to the discrepancy described and fixed in e313fc9. As I note there, this turns out not to be a data race, strictly speaking. Rather, the Reply and the ReplyWidget have different (implicit) state machines, and only certain signals propagate changes from the former to the latter.

This could be fixed in the controller instead, with something like:

--- a/securedrop_client/logic.py
+++ b/securedrop_client/logic.py
@@ -892,6 +892,7 @@ class Controller(QObject):
         """
         self.session.commit()  # Needed to flush stale data.
         reply = storage.get_reply(self.session, uuid)
+        self.reply_succeeded.emit(reply.source.uuid, reply.uuid, reply.content)
         self.reply_ready.emit(reply.source.uuid, reply.uuid, reply.content)
 
     def on_reply_download_failure(self, exception: DownloadException) -> None:

(Emitting this signal in Controller.download_new_replies, where it would arguably make more sense, unfortunately creates the risk of an equal, but different, discrepancy as a side effect.)

I favor the change proposed in e313fc9 simply because the fix and the test are each a single-line change at the GUI level, while this approach would require constructing a dedicated cross-layer test case.

I don't love it, though. More motivation for #1493 (comment), unless you can think of a cleaner way for now!

@cfm cfm requested a review from gonzalo-bulnes December 9, 2022 00:16
@cfm cfm assigned gonzalo-bulnes and unassigned cfm Dec 9, 2022
@gonzalo-bulnes
Copy link
Contributor

The explanation in the commit message makes sense to me @cfm, and I agree that while no ideal (a somewhat unintelligible without context), received_from_the_sever => successfully_sent_to_the_server is sound.

I think you're making the right decision when favoring the smaller, simpler change over another one that requires additional design. We'll be in a better place for design when we can take a few steps back, rather than designing a solution to a problem that might not even occur anymore after we look at the process sin its entirety. 👍 👍

I'll do a round of testing tomorrow. (And might ask you for a refresher on the onion addresses config, let's see how precisely I remember the procedure!)

@gonzalo-bulnes
Copy link
Contributor

  1. On your test server, watch /var/log/apache2/journalist-access.log.
  2. Create a new source and submit a message. ℹ️ The source is called ethnological remark.
  3. Log into the Client and wait for sync.
  4. Send a reply and wait for it to send.
  5. Observe exactly 1 "POST /api/v1/sources/<uuid>/replies HTTP/1.1" 201.
  6. Disconnect sd-whonix from sys-firewall.
  7. Send at least n >= 1 more replies. ℹ️ I sent two.
  8. Quit the Client.
  9. Reconnect sd-whonix to sys-firewall.
  10. Observe exactly 1 "POST /api/v1/sources/<uuid>/replies HTTP/1.1" 201. ℹ️ That reply, the first of the two I sent, made it to the server and was visible to the source, as expected.
  11. Wait several seconds.
  12. Start the Client again.
  13. Log in again and wait for sync.
  14. Observe only the first message from step (5) has succeeded... 🍊 Like in the previous test, the first message -that gets sent- never get "seen" check marks until the client is restarted. - which, again, I wouldn't worry about in the scope of this PR.
  15. ...and the rest are marked Failed to send.
  16. Repeat steps (1)‒(15), but log out (and back in) rather than quitting and restarting the Client.
  17. Observe the same results.

However: When the network is reconnected, the reply can be shown as failed for some time, until a "sync" turns the body and avatar purple.. but leaves the "(i) Failed to send" label in place 😞

I'm skipping detailing steps to reproduce because I'm not sure I understand the basic connection behavior. (The GUI messaging around connection state doesn't seems to match what I know of the connection between sd-whonix and sys-firewall. I'm not sure if that's related to this branch, and I suspect it might not be. I think the "sync" sometimes gets stuck until new source messages come in... either way I see flashed of "failed" replies that then turn "sent". Let's talk about that, and maybe we can 🍐 pair together on characterizing those scenarios errors better @cfm.

cfm added 9 commits December 19, 2022 17:36
This test case currently fails by design, as a reproduction of the
low-level behavior responsible for bug #1420.
…reads

ClearQueueJob has higher priority than all other defined jobs, which it
preempts.  ClearQueueJob clears the queue; then, like PauseQueueJob, it
returns from the RunnableQueue.process() loop.  The processing threads
will then exit without ApiJobQueue.stop() having to block on
QThread.wait().
… state

Since #750, application-level state transitions (logging in, logging
out, and switching into offline mode) call
securedrop_client.storage.mark_all_pending_draft_replies_as_failed().
However, one or more SendReplyJobs (and their underlying POST requests
to "/sources/<source_uuid/replies>") may be in flight on the network at
that time, whether or not the application is connected (or even running)
to receive their responses.  Until we have better (ideally generalized)
logic around upload jobs that have not yet been confirmed by the server,
these application-level events should not make assumptions about the
results of jobs that have already been dispatched.

Individual SendReplyJobs still call their own _set_status_to_failed()
method on non-timeout exceptions.
When a SendReplyJob actually starts to send (POST) a reply, take
ownership of it by marking it with the current PID so that we can update
"pending" versus "failed" states in subsequent Client sessions.
…der previous PIDs

If a DraftReply was marked "pending" by a previous Client session (i.e.,
with a different PID), by now it must have either failed or been sent
successfully.  To avoid leaving it an an inconsistent state, we presume
it failed; we'll update it anyway if the Journalist API reports it
received on the next sync.
Now that pending replies have their statuses updated by signal (when
sync succeeds and when queue processing stops), testing this behavior
coupled to logout no longer makes sense and is difficult to implement
idiomatically.  These paths are tested adequately here in the controller
layer in test_Controller_update_failed_replies() and in the storage
layer in test_storage.test_mark_pending_replies_as_failed().
Fixes a partition[1][2] between SpeechBubble._update_text(), called when
a reply has been downloaded from the Journalist API, and
ReplyWidget._on_reply_success(), called when a reply has been
successfully sent via the SDK.  In the case where the Client starts with
a reply still pending from a previous session and the reply is
subsequently downloaded, ReplyWidget.status will not be updated from
Reply.send_status unless it too listens for the SpeechBubble's
update_signal, and future calls to ReplyWidget._update_styles() will be
based on a stale ReplyWidget.status.

While conflating these signals is not ideal, in the current data model
we might as well, since a reply downloaded from the server is by
definition one that has been successfully sent to the server.  See
<#1493 (comment)>.

[1]: #1486 (review)
[2]: #1486 (comment)
@cfm
Copy link
Member Author

cfm commented Dec 20, 2022

#1486 (comment):

However: When the network is reconnected, the reply can be shown as failed for some time, until a "sync" turns the body and avatar purple.. but leaves the "(i) Failed to send" label in place disappointed

Tomorrow @gonzalo-bulnes and I will try to reproduce this case with some instrumentation patched in. Depending on our findings, I have a local branch already rebased against current main that I can force-push here for a quick turnaround.

@cfm cfm force-pushed the 1457-pending-replies branch from e313fc9 to f0040bc Compare December 22, 2022 03:11
@cfm
Copy link
Member Author

cfm commented Dec 22, 2022

f0040bc is rebased to clear conflicts after #1604, after extensive testing with @gonzalo-bulnes to make sure that e313fc9, from which it does not differ substantively, was actually the head under test.

Copy link
Contributor

@gonzalo-bulnes gonzalo-bulnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unexpected and inconsistent states are cleared by f0040bc. 💮

I'll approve the PR based on extensive testing with @cfm. :shipit:

@gonzalo-bulnes gonzalo-bulnes merged commit 4ef79d8 into main Dec 22, 2022
@gonzalo-bulnes gonzalo-bulnes deleted the 1457-pending-replies branch December 22, 2022 03:41
@gonzalo-bulnes
Copy link
Contributor

🎩 @cfm!

cfm added a commit that referenced this pull request Jan 5, 2023
After #1611, specifically 64293d1, this "ignore" pragma is unnecessary
and therefore invalid in strict mode.  It was introduced in #1486, which
was reviewed and merged while #1611 was still in review.
@gonzalo-bulnes gonzalo-bulnes mentioned this pull request Mar 14, 2023
2 tasks
@rocodes rocodes mentioned this pull request Mar 14, 2023
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants