-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: limit and track processing of SendReplyJob
s around session and network partitions
#1486
Conversation
@cfm, this change is pleasantly small, yet the issue caused so much confusion during QA last release. The only gotcha I can think of is that we actually want to fail the replies in the queue that are not being processed. |
Thanks, @creviera! The test plan I've outlined above gives two pieces of good news:
The bad news: We need a different mechanism to handle this situation in #1457, in which:
I believe this to be a data-reconciliation problem ( |
Visually reviewed (and tested, albeit off-plan :-) with @creviera. The test plan checks out as written; but #1493 represents a regression from the current behavior, not just an impetus for future refactoring. Our new hypothesis, which I'll test next week (realistically):
—so the two of them, rebased together, will let us solve #1420 at both layers. (That would then leave #1493's refactoring to help us untangle these layers.) |
Our hypothesis in #1486 (comment) is right so far as it goes, but we need a bit more wiring to keep the GUI in sync with both the |
82421c7
to
72cb369
Compare
Now that this fix is no longer blocked by #1500, I'm interested in getting it back to ready for review. Whether or not we merge it, it's useful as an illustration of some technical debt I'd like us to have a shared understanding of for future architectural discussions. Necessary changes:
|
This is marked as "in development", is that accurate or should we bump it back for now? |
I'm finally resuscitating this branch against current |
For the record: @cfm and I walked through this PR and #1493. The vision makes sense to me. I'll summarize it later :P Edit to add: The gist of the vision beyond this PR is:
|
@cfm It may be obvious to you, but the CI pipeline definition has changed significantly since you opened this PR, so you won't get anything close to a meaningful CI failure until you update onto |
72cb369
to
d75b875
Compare
Fixes a partition[1][2] between SpeechBubble._update_text(), called when a reply has been downloaded from the Journalist API, and ReplyWidget._on_reply_success(), called when a reply has been successfully sent via the SDK. In the case where the Client starts with a reply still pending from a previous session and the reply is subsequently downloaded, ReplyWidget.status will not be updated from Reply.send_status unless it too listens for the SpeechBubble's update_signal, and future calls to ReplyWidget._update_styles() will be based on a stale ReplyWidget.status. While conflating these signals is not ideal, in the current data model we might as well, since a reply downloaded from the server is by definition one that has been successfully sent to the server. See <#1493 (comment)>. [1]: #1486 (review) [2]: #1486 (comment)
Fixes a partition[1][2] between SpeechBubble._update_text(), called when a reply has been downloaded from the Journalist API, and ReplyWidget._on_reply_success(), called when a reply has been successfully sent via the SDK. In the case where the Client starts with a reply still pending from a previous session and the reply is subsequently downloaded, ReplyWidget.status will not be updated from Reply.send_status unless it too listens for the SpeechBubble's update_signal, and future calls to ReplyWidget._update_styles() will be based on a stale ReplyWidget.status. While conflating these signals is not ideal, in the current data model we might as well, since a reply downloaded from the server is by definition one that has been successfully sent to the server. See <#1493 (comment)>. [1]: #1486 (review) [2]: #1486 (comment)
c405d1d
to
e313fc9
Compare
@gonzalo-bulnes, I've been able to trace this bug to the discrepancy described and fixed in e313fc9. As I note there, this turns out not to be a data race, strictly speaking. Rather, the This could be fixed in the controller instead, with something like: --- a/securedrop_client/logic.py
+++ b/securedrop_client/logic.py
@@ -892,6 +892,7 @@ class Controller(QObject):
"""
self.session.commit() # Needed to flush stale data.
reply = storage.get_reply(self.session, uuid)
+ self.reply_succeeded.emit(reply.source.uuid, reply.uuid, reply.content)
self.reply_ready.emit(reply.source.uuid, reply.uuid, reply.content)
def on_reply_download_failure(self, exception: DownloadException) -> None: (Emitting this signal in I favor the change proposed in e313fc9 simply because the fix and the test are each a single-line change at the GUI level, while this approach would require constructing a dedicated cross-layer test case. I don't love it, though. More motivation for #1493 (comment), unless you can think of a cleaner way for now! |
The explanation in the commit message makes sense to me @cfm, and I agree that while no ideal (a somewhat unintelligible without context), I think you're making the right decision when favoring the smaller, simpler change over another one that requires additional design. We'll be in a better place for design when we can take a few steps back, rather than designing a solution to a problem that might not even occur anymore after we look at the process sin its entirety. 👍 👍 I'll do a round of testing tomorrow. (And might ask you for a refresher on the onion addresses config, let's see how precisely I remember the procedure!) |
However: When the network is reconnected, the reply can be shown as failed for some time, until a "sync" turns the body and avatar purple.. but leaves the "(i) Failed to send" label in place 😞 I'm skipping detailing steps to reproduce because I'm not sure I understand the basic connection behavior. (The GUI messaging around connection state doesn't seems to match what I know of the connection between |
This test case currently fails by design, as a reproduction of the low-level behavior responsible for bug #1420.
…reads ClearQueueJob has higher priority than all other defined jobs, which it preempts. ClearQueueJob clears the queue; then, like PauseQueueJob, it returns from the RunnableQueue.process() loop. The processing threads will then exit without ApiJobQueue.stop() having to block on QThread.wait().
…hether QThread.quit() has been called
… state Since #750, application-level state transitions (logging in, logging out, and switching into offline mode) call securedrop_client.storage.mark_all_pending_draft_replies_as_failed(). However, one or more SendReplyJobs (and their underlying POST requests to "/sources/<source_uuid/replies>") may be in flight on the network at that time, whether or not the application is connected (or even running) to receive their responses. Until we have better (ideally generalized) logic around upload jobs that have not yet been confirmed by the server, these application-level events should not make assumptions about the results of jobs that have already been dispatched. Individual SendReplyJobs still call their own _set_status_to_failed() method on non-timeout exceptions.
When a SendReplyJob actually starts to send (POST) a reply, take ownership of it by marking it with the current PID so that we can update "pending" versus "failed" states in subsequent Client sessions.
…der previous PIDs If a DraftReply was marked "pending" by a previous Client session (i.e., with a different PID), by now it must have either failed or been sent successfully. To avoid leaving it an an inconsistent state, we presume it failed; we'll update it anyway if the Journalist API reports it received on the next sync.
Now that pending replies have their statuses updated by signal (when sync succeeds and when queue processing stops), testing this behavior coupled to logout no longer makes sense and is difficult to implement idiomatically. These paths are tested adequately here in the controller layer in test_Controller_update_failed_replies() and in the storage layer in test_storage.test_mark_pending_replies_as_failed().
Fixes a partition[1][2] between SpeechBubble._update_text(), called when a reply has been downloaded from the Journalist API, and ReplyWidget._on_reply_success(), called when a reply has been successfully sent via the SDK. In the case where the Client starts with a reply still pending from a previous session and the reply is subsequently downloaded, ReplyWidget.status will not be updated from Reply.send_status unless it too listens for the SpeechBubble's update_signal, and future calls to ReplyWidget._update_styles() will be based on a stale ReplyWidget.status. While conflating these signals is not ideal, in the current data model we might as well, since a reply downloaded from the server is by definition one that has been successfully sent to the server. See <#1493 (comment)>. [1]: #1486 (review) [2]: #1486 (comment)
Tomorrow @gonzalo-bulnes and I will try to reproduce this case with some instrumentation patched in. Depending on our findings, I have a local branch already rebased against current |
e313fc9
to
f0040bc
Compare
f0040bc is rebased to clear conflicts after #1604, after extensive testing with @gonzalo-bulnes to make sure that e313fc9, from which it does not differ substantively, was actually the head under test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎩 @cfm! |
Description
ApiJobQueue
when it stops, so that unprocessed jobs (e.g.SendReplyJob
s for pendingDraftReply
s) will not continue to be processed under the session that's just logged out.update_replies()
: purge or mark stale pending ("draft") replies #1493 by having theSendReplyJob
take "ownership" of a givenDraftReply
by marking it with the current PID, so that both the current and a subsequent Client session can infer whether or not theSendReplyJob
's API call could still be in flight.For more details on why all this extra machinery is necessary to address this pair of corner cases, see especially #1486 (comment) and #1457 (comment).
Test Plan
Synthesized from the reproductions of #1420 and #1457:
/var/log/apache2/journalist-access.log
."POST /api/v1/sources/<uuid>/replies HTTP/1.1" 201
.sd-whonix
fromsys-firewall
.n >= 1
more replies.sd-whonix
tosys-firewall
."POST /api/v1/sources/<uuid>/replies HTTP/1.1" 201
.Failed to send
.Bonus: bootstrapping case
svs.sqlite
and reinitialize it withalembic upgrade head
.Checklist
If these changes modify code paths involving cryptography, the opening of files in VMs or network (via the RPC service) traffic, Qubes testing in the staging environment is required. For fine tuning of the graphical user interface, testing in any environment in Qubes is required. Please check as applicable:
If these changes add or remove files other than client code, the AppArmor profile may need to be updated. Please check as applicable:
If these changes modify the database schema, you should include a database migration. Please check as applicable:
main
and confirmed that the migration is self-contained and applies cleanlymain
and would like the reviewer to do so