Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enumerate {source,submission,reply} UUIDs, then fetch subsets #1549

Open
cfm opened this issue Aug 25, 2022 · 6 comments
Open

enumerate {source,submission,reply} UUIDs, then fetch subsets #1549

cfm opened this issue Aug 25, 2022 · 6 comments

Comments

@cfm
Copy link
Member

cfm commented Aug 25, 2022

Description

securedrop-client's MetadataSyncJob currently fetches in order from the Journalist API's /{sources,submissions,replies} endpoints, then downloads individual submissions and replies. To take just the /sources endpoint as an example, on a development server with 350 sources, this is a significant (~1 MiB) response to fetch in every sync cycle:

cfm@ozymandias ~ % curl http://localhost:8081/api/v1/sources > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  919k  100  919k    0     0  1206k      0 --:--:-- --:--:-- --:--:-- 1217k

Instead, securedrop-client could fetch from a more-compact (~15 KiB) /source_uuids endpoint:

cfm@ozymandias ~ % curl http://localhost:8081/api/v1/source_uuids | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15771  100 15771    0     0   430k      0 --:--:-- --:--:-- --:--:--  592k
{
  "sources": [
    "d4436ea5-36b5-4158-aaab-3467b583bda2", 
    "8e7659d6-3617-4018-84e7-8cb455742d91", 
    "8a2dc427-49b0-4189-b9b6-b6c4e5f4ff6e", 
    "e93039a4-df68-4332-b5d7-0d8dfdbae3a3", 
    "1acadb6d-2c69-4651-b2da-6c46142acbc3", 
    "ac9ffda4-2967-40de-8256-39afd6405d83", 
    "45b54372-c91a-4c5f-8527-633f3cc84080", 
    "d53f0eb8-7652-4ed9-9692-5e98fc82f441"

securedrop-client could update its local datastore with the current set of source_uuids, then issue follow-up requests to /sources/<source_uuid> for (arbitrarily-sized) subsets of source_uuids. This would offer a compromise between the pagination pattern, which is challenging to reconcile with our UUID-based keying and the possibility of deletion operations initiated from both the server and the client sides; and the request-bundle pattern, which is not scalable for large numbers of sources (and submissions and replies). (Thanks to @gonzalo-bulnes for reminding me to check the literature here; other references welcome!)

In future versions of this API, the /source_uuids endpoint could also indicate a per-record version, so that securedrop-client (or other consumers) would know what subset of source_uuids needed to be refreshed. Or freedomofpress/securedrop#5104 may point towards other, richer synchronization strategies between the Journalist API and its consumers.

How will this impact SecureDrop users?

A lighter-weight metadata-fetching strategy would be faster, especially across Tor; more responsive to new data, especially on heavily-loaded instances, and more resilient to network hiccups.

How would this affect the SecureDrop Workstation threat model?

No Workstation-level threat-model implications.

See also

@cfm
Copy link
Member Author

cfm commented Aug 25, 2022

The transcripts and numbers above come from a development server running with this diff.
diff --git a/securedrop/journalist_app/api.py b/securedrop/journalist_app/api.py
index f1b96531f..64d8adf00 100644
--- a/securedrop/journalist_app/api.py
+++ b/securedrop/journalist_app/api.py
@@ -153,11 +153,15 @@ def make_blueprint(config: SDConfig) -> Blueprint:
             return abort(403, "Token authentication failed.")
 
     @api.route("/sources", methods=["GET"])
-    @token_required
     def get_all_sources() -> Tuple[flask.Response, int]:
         sources = Source.query.filter_by(pending=False, deleted_at=None).all()
         return jsonify({"sources": [source.to_json() for source in sources]}), 200
 
+    @api.route("/source_ids", methods=["GET"])
+    def get_all_source_ids() -> Tuple[flask.Response, int]:
+        sources = Source.query.filter_by(pending=False, deleted_at=None).all()  # could SELECT only Source.uuid
+        return jsonify({"sources": [source.uuid for source in sources]}), 200
+
     @api.route("/sources/<source_uuid>", methods=["GET", "DELETE"])
     @token_required
     def single_source(source_uuid: str) -> Tuple[flask.Response, int]:

With this "back-of-the-envelope" sketch done, I won't do any further work on this until the team has had a chance to consider it along with the other architectural questions raised by freedomofpress/securedrop#5104. That said, as we refine our architectural decision-making process, I'll be happy to write up a proposal for formal consideration at any time.

@zenmonkeykstop
Copy link
Contributor

zenmonkeykstop commented Aug 25, 2022

We also have access to 'last_updated' for sources, so we could compare that with the local value to determine which source have changed server-side and need to be resynced. Initial syncs on large datasets would still suck, but subsequent syncs would be pretty minimal.

@nabla-c0d3
Copy link

nabla-c0d3 commented Aug 28, 2022

The pattern I recommend for this would be to have:

  • /sources return a paginated list of the "summary view" of each source. Compared to having a specific /source_uids endpoint, this approach gives you more flexibility as you can put any source-field you need in the response (UUID, last time updated, etc..). Of course the number of fields should be limited in order to keep the response small.

  • /sources/<uuid> return a "detailed view" of a specific source.

For example the responses could be like this:

# For /sources, which will return a MultipleSourcesResponse

@dataclass
class SourceSummaryView:
    uuid: UUID
    date_last_updated: datetime
    # Any other field needed when we show a list of sources in the UI; but keep it light...


@dataclass
class MultipleSourcesResponse:
    sources: List[SourceSummaryView]


# For /sources/<uuid>, which will return a SingleSourceResponse

@dataclass
class SingleSourceResponse:
    uuid: UUID
    date_last_updated: datetime
    filesystem_id: str
    interaction_count: int
    # Any other fields needed when we show a single source in the UI

@eloquence
Copy link
Member

We also have access to 'last_updated' for sources, so we could compare that with the local value to determine which source have changed server-side and need to be resynced.

Currently, last_updated is bumped only when there is Source activity, so relying on last_updated would not be sufficient to pick up new journalist replies, starring, or other metadata changes such as "seen" activity. See also freedomofpress/securedrop#4683

@zenmonkeykstop
Copy link
Contributor

With an individual api call verifying whether a source conversation changed since a given timestamp, it might be worth looking more at batching API requests; so for example a syncing operation would then consist of:

  1. a call fetching all new sources since a given time, and
  2. multiple batches of individual /sources/<uuid>/since/<time> calls, with responses selecting from ("no changes since <time>", "no such source (delete it!)", "hey, this source changed").

With changed sources flagged as such in the client db, you could then flag them in the list view and make subsequent batch calls to update them, or if you were really lazy/bandwidth conscious you could maybe also defer updating them until they were actually displayed in the conversation view (downside of that is a display lag when a user selects an updated source).

This is straight-up request bundling in terms of the patterns mentioned above, but it does have the advantages of a) being easy to reason about because you can still think in terms of api calls for individual source conversations and b) being a straightforward operation with minimal data transfer for unchanged or deleted sources.

This is pretty similar to what @cfm is suggesting originally, tbh, but it also provides tunability (the batch size can be configurable) and it reduces the number of follow-up calls to the bare minimum.

@zenmonkeykstop
Copy link
Contributor

(...but all of this would as @rocodes points out elsewhere be a pretty big change to the current sync behaviour, which basically seems to mirror the sources, replies, submissions, and journalists tables en masse. It is not a quick win.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants