enumerate {source,submission,reply} UUIDs, then fetch subsets #1549

cfm · 2022-08-25T00:58:21Z

Motivated by allow experimental override of default_request_timeout #1547 (among others; see below)
Towards Journalist API v2 securedrop#5104

Description

securedrop-client's MetadataSyncJob currently fetches in order from the Journalist API's /{sources,submissions,replies} endpoints, then downloads individual submissions and replies. To take just the /sources endpoint as an example, on a development server with 350 sources, this is a significant (~1 MiB) response to fetch in every sync cycle:

cfm@ozymandias ~ % curl http://localhost:8081/api/v1/sources > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  919k  100  919k    0     0  1206k      0 --:--:-- --:--:-- --:--:-- 1217k

Instead, securedrop-client could fetch from a more-compact (~15 KiB) /source_uuids endpoint:

cfm@ozymandias ~ % curl http://localhost:8081/api/v1/source_uuids | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15771  100 15771    0     0   430k      0 --:--:-- --:--:-- --:--:--  592k
{
  "sources": [
    "d4436ea5-36b5-4158-aaab-3467b583bda2", 
    "8e7659d6-3617-4018-84e7-8cb455742d91", 
    "8a2dc427-49b0-4189-b9b6-b6c4e5f4ff6e", 
    "e93039a4-df68-4332-b5d7-0d8dfdbae3a3", 
    "1acadb6d-2c69-4651-b2da-6c46142acbc3", 
    "ac9ffda4-2967-40de-8256-39afd6405d83", 
    "45b54372-c91a-4c5f-8527-633f3cc84080", 
    "d53f0eb8-7652-4ed9-9692-5e98fc82f441"

securedrop-client could update its local datastore with the current set of source_uuids, then issue follow-up requests to /sources/<source_uuid> for (arbitrarily-sized) subsets of source_uuids. This would offer a compromise between the pagination pattern, which is challenging to reconcile with our UUID-based keying and the possibility of deletion operations initiated from both the server and the client sides; and the request-bundle pattern, which is not scalable for large numbers of sources (and submissions and replies). (Thanks to @gonzalo-bulnes for reminding me to check the literature here; other references welcome!)

In future versions of this API, the /source_uuids endpoint could also indicate a per-record version, so that securedrop-client (or other consumers) would know what subset of source_uuids needed to be refreshed. Or freedomofpress/securedrop#5104 may point towards other, richer synchronization strategies between the Journalist API and its consumers.

How will this impact SecureDrop users?

A lighter-weight metadata-fetching strategy would be faster, especially across Tor; more responsive to new data, especially on heavily-loaded instances, and more resilient to network hiccups.

How would this affect the SecureDrop Workstation threat model?

No Workstation-level threat-model implications.

diff --git a/securedrop/journalist_app/api.py b/securedrop/journalist_app/api.py
index f1b96531f..64d8adf00 100644
--- a/securedrop/journalist_app/api.py
+++ b/securedrop/journalist_app/api.py
@@ -153,11 +153,15 @@ def make_blueprint(config: SDConfig) -> Blueprint:
             return abort(403, "Token authentication failed.")
 
     @api.route("/sources", methods=["GET"])
-    @token_required
     def get_all_sources() -> Tuple[flask.Response, int]:
         sources = Source.query.filter_by(pending=False, deleted_at=None).all()
         return jsonify({"sources": [source.to_json() for source in sources]}), 200
 
+    @api.route("/source_ids", methods=["GET"])
+    def get_all_source_ids() -> Tuple[flask.Response, int]:
+        sources = Source.query.filter_by(pending=False, deleted_at=None).all()  # could SELECT only Source.uuid
+        return jsonify({"sources": [source.uuid for source in sources]}), 200
+
     @api.route("/sources/<source_uuid>", methods=["GET", "DELETE"])
     @token_required
     def single_source(source_uuid: str) -> Tuple[flask.Response, int]:

With this "back-of-the-envelope" sketch done, I won't do any further work on this until the team has had a chance to consider it along with the other architectural questions raised by freedomofpress/securedrop#5104. That said, as we refine our architectural decision-making process, I'll be happy to write up a proposal for formal consideration at any time.

zenmonkeykstop · 2022-08-25T16:31:22Z

We also have access to 'last_updated' for sources, so we could compare that with the local value to determine which source have changed server-side and need to be resynced. Initial syncs on large datasets would still suck, but subsequent syncs would be pretty minimal.

nabla-c0d3 · 2022-08-28T11:57:40Z

The pattern I recommend for this would be to have:

/sources return a paginated list of the "summary view" of each source. Compared to having a specific /source_uids endpoint, this approach gives you more flexibility as you can put any source-field you need in the response (UUID, last time updated, etc..). Of course the number of fields should be limited in order to keep the response small.
/sources/<uuid> return a "detailed view" of a specific source.

For example the responses could be like this:

# For /sources, which will return a MultipleSourcesResponse

@dataclass
class SourceSummaryView:
    uuid: UUID
    date_last_updated: datetime
    # Any other field needed when we show a list of sources in the UI; but keep it light...


@dataclass
class MultipleSourcesResponse:
    sources: List[SourceSummaryView]


# For /sources/<uuid>, which will return a SingleSourceResponse

@dataclass
class SingleSourceResponse:
    uuid: UUID
    date_last_updated: datetime
    filesystem_id: str
    interaction_count: int
    # Any other fields needed when we show a single source in the UI

eloquence · 2022-09-08T18:45:42Z

We also have access to 'last_updated' for sources, so we could compare that with the local value to determine which source have changed server-side and need to be resynced.

Currently, last_updated is bumped only when there is Source activity, so relying on last_updated would not be sufficient to pick up new journalist replies, starring, or other metadata changes such as "seen" activity. See also freedomofpress/securedrop#4683

zenmonkeykstop · 2022-09-09T16:22:19Z

With an individual api call verifying whether a source conversation changed since a given timestamp, it might be worth looking more at batching API requests; so for example a syncing operation would then consist of:

a call fetching all new sources since a given time, and
multiple batches of individual /sources/<uuid>/since/<time> calls, with responses selecting from ("no changes since <time>", "no such source (delete it!)", "hey, this source changed").

With changed sources flagged as such in the client db, you could then flag them in the list view and make subsequent batch calls to update them, or if you were really lazy/bandwidth conscious you could maybe also defer updating them until they were actually displayed in the conversation view (downside of that is a display lag when a user selects an updated source).

This is straight-up request bundling in terms of the patterns mentioned above, but it does have the advantages of a) being easy to reason about because you can still think in terms of api calls for individual source conversations and b) being a straightforward operation with minimal data transfer for unchanged or deleted sources.

This is pretty similar to what @cfm is suggesting originally, tbh, but it also provides tunability (the batch size can be configurable) and it reduces the number of follow-up calls to the bare minimum.

zenmonkeykstop · 2022-09-09T16:31:45Z

(...but all of this would as @rocodes points out elsewhere be a pretty big change to the current sync behaviour, which basically seems to mirror the sources, replies, submissions, and journalists tables en masse. It is not a quick win.)

cfm added needs/server-changes needs discussion scalability needs/sdk-changes labels Aug 25, 2022

cfm mentioned this issue Aug 30, 2022

Formalize SecureDrop's API freedomofpress/securedrop#6522

Open

philmcmahon mentioned this issue Sep 2, 2022

Gzip API endpoints? freedomofpress/securedrop#6532

Closed

zenmonkeykstop mentioned this issue Sep 8, 2022

Consider "recent sources" endpoint for high-volume instances freedomofpress/securedrop#6539

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enumerate {source,submission,reply} UUIDs, then fetch subsets #1549

enumerate {source,submission,reply} UUIDs, then fetch subsets #1549

cfm commented Aug 25, 2022

cfm commented Aug 25, 2022

zenmonkeykstop commented Aug 25, 2022 •

edited

Loading

nabla-c0d3 commented Aug 28, 2022 •

edited

Loading

eloquence commented Sep 8, 2022

zenmonkeykstop commented Sep 9, 2022

zenmonkeykstop commented Sep 9, 2022

enumerate {source,submission,reply} UUIDs, then fetch subsets #1549

enumerate {source,submission,reply} UUIDs, then fetch subsets #1549

Comments

cfm commented Aug 25, 2022

Description

How will this impact SecureDrop users?

How would this affect the SecureDrop Workstation threat model?

See also

cfm commented Aug 25, 2022

zenmonkeykstop commented Aug 25, 2022 • edited Loading

nabla-c0d3 commented Aug 28, 2022 • edited Loading

eloquence commented Sep 8, 2022

zenmonkeykstop commented Sep 9, 2022

zenmonkeykstop commented Sep 9, 2022

zenmonkeykstop commented Aug 25, 2022 •

edited

Loading

nabla-c0d3 commented Aug 28, 2022 •

edited

Loading