Improve download file handling #737

rmol · 2020-01-28T00:36:54Z

Description

Eliminate File.original_file column and hard links to downloaded files. Now as files are downloaded and decrypted, File.filename will be updated.

Instead of downloading directly into the root of the data directory, create a directory therein for each downloadable object, named with its UUID, and process downloaded files in that directory. This should
eliminate collisions if multiple sources upload documents with the same names.

Fixes #714.

Test Plan

Start a SecureDrop development server with make dev
Visit the source interface and upload a file.
Run the client with run.sh. Find your source in the list and:
- download the file
- verify that it can be opened in a disp VM by clicking on it
- verify that it can be exported
- verify that a metadata sync does not change the filename
In the shell, confirm that exactly one file is present under the $SDC_HOME/data_dir/File/{file-uuid}/ directory, with no files hard linked to it.

Checklist

If these changes modify code paths involving cryptography, the opening of files in VMs or network (via the RPC service) traffic, Qubes testing in the staging environment is required. For fine tuning of the graphical user interface, testing in any environment in Qubes is required. Please check as applicable:

I have tested these changes in the appropriate Qubes environment
I do not have an appropriate Qubes OS workstation set up (the reviewer will need to test these changes)
These changes should not need testing in Qubes

If these changes add or remove files other than client code, packaging logic (e.g., the AppArmor profile) may need to be updated. Please check as applicable:

I have submitted a separate PR to the packaging repo
No update to the packaging logic (e.g., AppArmor profile) is required for these changes
I don't know and would appreciate guidance

sssoleileraaa · 2020-01-28T18:10:02Z

looks like ci is failing due to:

Ensure that build-requirements.txt and requirements.txt are in sync.00:00
Exit code: 1

#!/bin/bash -eo pipefail
cd ~/project
# Return 1 if unstaged changes exist (after `make requirements` in the
# previous run step), else return 0.
git diff --quiet


Exited with code exit status 1

sssoleileraaa · 2020-01-28T18:17:25Z

more info:

circleci@4bfc3f883da5:~/project$ git diff
diff --git a/build-requirements.txt b/build-requirements.txt
index fc41b39..4529c32 100644
--- a/build-requirements.txt
+++ b/build-requirements.txt
@@ -1,15 +1,15 @@
-alembic==1.0.2 --hash=sha256:14024bd47f71d8b51920721dcd63248d07d370fbd0f6af
a9bec67b9edaf71f36
-arrow==0.12.1 --hash=sha256:5ef4a593615dc61ed85e62070b1bd27c71f7266233f0f9f
385b651370e8c6760
-certifi==2018.10.15 --hash=sha256:a5471c55b011bd45d6155f5c3629310c1d2f1e1a5
a899b7e438a223343de583d
-chardet==3.0.4 --hash=sha256:9f178988ca4c86e8a319b51aac1185b6fe5192328eb5a1
63c286f4bf50b7b3d8
-idna==2.7 --hash=sha256:954e65e127d0433a352981f43f291a438423d5b385ebf643c70
fd740e0634111
-mako==1.0.7 --hash=sha256:87ee3f74ba3ea544e683a5a22e7e34f4d1cf3ad34414b5f38
58becf00facf1d6
-markupsafe==1.0 --hash=sha256:6a7078a2fb4406458d6ae3579e4eb01a9bdc0a9a0686a
28fa50c19a039e3fcb8 --hash=sha256:3c9bf8fb4c3cf7dd11fd465132156d4c3cddb926d39bdb
d0f0bf5920fd8009a4
-pathlib2==2.3.2 --hash=sha256:8e276e2bf50a9a06c36e20f03b050e59b63dfe0678e37
164333deb87af03b6ad
-python-dateutil==2.7.5 --hash=sha256:56f285e7fad54cde3e31dc68a31a861543bfee
5ada9278da8e85ec20a8f72912
-python-editor==1.0.3 --hash=sha256:44fc57a6db6e04c7922c37a04d0a86d0024a4f0f
06245b6c57638cb322176202
-requests==2.20.0 --hash=sha256:2a539dd6af40a611f3b8eb3f99d3567781352ece1698
b2fab42bf4c2218705b5
-securedrop-sdk==0.0.12 --hash=sha256:274beb38dccd91988b45517e6479863f80bb31
e00953695e61dd8103509f2337
-six==1.11.0 --hash=sha256:4663c7a1dbed033cfb294f2d534bd6151c0698dc12ecabb4e
aa3cb041d758528
-sqlalchemy==1.3.3 --hash=sha256:bd030ff97e7a4f3aa34aafa6b62898c7de6999784c8
b4828b3e8b31cf69dae9c --hash=sha256:680b2ae6a728941c9fe661c85f2309a69408e7ec8ed8
fa39d03e07595259b75b
-urllib3==1.24.3 --hash=sha256:028309393606e28e640e2031edd27eb969c94f9364b08
71912608aaa8e66c96e
+alembic==1.0.2 --hash=sha256:99cc931e11dbef6e41e9376f18be62fc90fe4be9
c541eac1b30a3455b3d655f3
+arrow==0.12.1 --hash=sha256:fc8c8e0b587d00f38986bc161f4496e000acea033
fe2ce25f4f5bffa9ae53a7c
+certifi==2018.10.15 --hash=sha256:173b19dd31ca7faa50d1fcc0eaf30f5e32e
8e99e17d8c7fd4cfc8bc8d94e18a6
+chardet==3.0.4 --hash=sha256:f5632e583a4f61f1e16d0cc98127d241fb11c3c6
ddfddee159307d4215186837
+idna==2.7 --hash=sha256:491f674364ba3232ed1eb4c1eb7407887f62cef6c300a
ad7df6e01acd88ffb25
+mako==1.0.7 --hash=sha256:614c22fe1a5b0a3f46f6c5c43ff2e6795e4e784328d
559ec9dc49db0f06b3a75
+markupsafe==1.0 --hash=sha256:c6b726d2e9d6300a044cf6a37627f10994268d6
ac39464bc0d725126609311a5
+pathlib2==2.3.2 --hash=sha256:460e67b14d0574b0529a0017b1eb05d10d97226
81e303fec7077c2a628de60c1
+python-dateutil==2.7.5 --hash=sha256:f6eb9c17acd5a6954e1a5f2f999a41de
3e7e25b6bc41baf6344bd053ec25ceeb
+python-editor==1.0.3 --hash=sha256:e47dcec4ea883853b8196fbd425b875d7e
c791d4ede2e20cfc70b9a25365c65b
+requests==2.20.0 --hash=sha256:d87b2085783d31d874ac7bc62660e287932aae
e7059e80b41b76462eb18d35cc
+securedrop-sdk==0.0.12 --hash=sha256:d05bb78652c8771e6aa1aefcd76ade1f
ef08c563d2641acbc5ac8e1d635e6a53
+six==1.11.0 --hash=sha256:aa4ad34049ddff178b533062797fd1db9f0038b7c5c
2461a7cde2244300b9f3d
+sqlalchemy==1.3.3 --hash=sha256:bfb4cd0df5802a01acd738a080a19e60ff470
0e030d497de273807f9a8bd4a0a
+urllib3==1.24.3 --hash=sha256:3d440cbb168e2c963d5099232bdb3f7390bf031
b6270dad1bc79751698a1399a
circleci@4bfc3f883da5:~/project$

sssoleileraaa · 2020-01-28T20:56:02Z

securedrop_client/api_jobs/downloads.py

+                    uuid=self.uuid,
+                    session=session,
+                    content=plaintext_file.read())
+            finally:


you can remove this cleanup since we use a temporary file that automatically gets removed (see how when you remove this block of code, there is no decryption file?

The temporary file is gone, but because I'm doing the work in subdirectories under data_dir, instead of at the top level, the folder named after the UUID of the message or reply will be left behind, empty. This cleanup keeps those from accumulating and possibly affecting performance.

ah i see this makes sense.

just out of curiosity, why did you chose not to decrypt to the same uuid directory, e.g. instead of:

Message/baf388ff-dfbd-45e1-a629-bf9e761b1ac8 Reply/baf388ff-dfbd-45e1-a629-bf9e761b1ac8 File/baf388ff-dfbd-45e1-a629-bf9e761b1ac8/example_submission.pdf

why not:

baf388ff-dfbd-45e1-a629-bf9e761b1ac8/example_submission.pdf baf388ff-dfbd-45e1-a629-bf9e761b1ac8/<temporary message and reply files that are removed>

ah maybe it's to avoid having an empty uuid directory if there are only messages and replies and not files?

but i guess you could just decrypt them to the data dir and then there wouldn't need to be any folder creation and deletion cleanup. I prefer this way because it keeps the code simpler, so to be clear we would just have these in the data directory:

baf388ff-dfbd-45e1-a629-bf9e761b1ac8/example_submission.pdf <temporary message and reply files downloded with the server file name `<count>-<jojurnalist designation>-<type of file>` removed once decrypted>

The reason for including the class in the path is that UUIDs are only unique within a table, so if we were writing them into the root of the data directory, we could theoretically see collisions. I know, I know, UUIDs, but this approach uses the database constraints to guarantee it can't happen.

To recap our discussion from standup this morning, there were generally three ideas kicking around (we went with the third idea):

keep storing downloaded messages and replies in the data directory and if they fail to decrypt, that's where we'll find them, and store submission files in the data directory until they are successfully decrypted into a directory with a unique name, either UUID or the server filename.

Pros:

If messages and replies successfully decrypt, we don't have to delete any folders

Cons:

files are not organized under a journalist designation, so files that are part of the same conversation would have to be collected in various places if we want to one day export multiple files and messages.

maintaining two different data directory structures in securedrop-export and securedrop-client, more cognitive overload

store all messages and replies in their respective <Type>/<UUID> directories

Pros:

the UUID never changes, so we don't have to change the name of the directory when the journalist designation changes

if for some reason the journalist designation name change fails when we update the filesystem, we can still easily delete messages and files easily because they will be organized under the UUID (this is relevant removing a source or individual submissions)

Cons:

(same cons as above, also a note about how we already have to change the filename if the journalist designation changes)

maintain the same data directory structure as securedrop-export (see https://github.com/freedomofpress/securedrop-export/blob/master/README.md#export-archive-format)

Pros:

same directory structure as we (plan to) use in the securedrop-export, less to remember, and if we find a problem with the structure we can easily fix it in both places

files are organized under a journalist designation, so files that are part of the same conversation would be easier to collect for a multi-file export or a full-source conversation export

it's more human readable

Cons:

We have to change the name of the folders in addition to the files if the journalist designation changes

One additonal thought I'm having now is that we might want to consider storing messages directly under the journalist designation name, instead of create a subdirectory of the server filename, since they will always be text files and have the same server filename as the subdirecotry.

So instead of:

├── cytotoxic payer │ ├── 1-cytotoxic-payer-msg │ │ └── 1-cytotoxic-payer-msg.txt │ ├── 2-cytotoxic-payer-msg │ │ └── 2-cytotoxic-payer-msg.txt │ └── 3-cytotoxic-payer-doc │ │ └── original-filename.doc ├── grandiloquent pasteboard │ └── 1-grandiloquent-pasteboard-doc │ │ └── original-filename.pdf └── snug seek

I'm proposing:

├── cytotoxic payer │ ├── 1-cytotoxic-payer-msg.txt │ ├── 2-cytotoxic-payer-msg.txt │ └── 3-cytotoxic-payer-doc │ │ └── original-filename.doc ├── grandiloquent pasteboard │ └── 1-grandiloquent-pasteboard-doc │ │ └── original-filename.pdf └── snug seek

But this is a minor detail, and shouldn't hold up this PR since this has yet to be implemented in securdrop-export.

Ok, that's the recap, plus one additional thought!

One of the requirements in #714 was to remove File.original_filename, instead replacing File.filename with the submission's original filename when the submission was unpacked. That's going to make the hierarchy above impossible, since we won't have 1-grandiloquent-pasteboard-doc in the File once it's been processed.

Keeping the original_filename column would enable this structure.

ok so we discussed this a bit and @creviera is gonna pick up freedomofpress/securedrop#4304 to remove the renaming of journalist_designation entirely, we'll pick into the next core/server minor release. One thing we thought about when we discussed this is that the metadata sync (and file rename) and file/message/reply download can occur concurrently, so it's definitely worth removing the rename to avoid racing behavior (i.e. to avoid guarding against such behavior). This way we can just use the securedrop-export structure everywhere.

so: for the remainder of this review, we can assume that this will be tested with a version of securedrop server that lacks journalist_designation renaming

and for those not part of the discussion today, we agreed that it makes sense to stick with idea #3 (see recap of all the three ideas we discussed and their pros and cons from Wednesday's standup: #737 (comment))

(ignore the #3 link... forgot that would link to an issue)

sssoleileraaa

i did a first pass

securedrop_client/api_jobs/downloads.py

sssoleileraaa · 2020-01-28T21:06:19Z

securedrop_client/crypto.py

@@ -124,8 +124,12 @@ def decrypt_submission_or_reply(self,

            # Store the plaintext in the file located at the specified plaintext_filepath
            if is_doc:
-                original_filename = read_gzip_header_filename(out.name)
-                with gzip.open(out.name, 'rb') as infile, open(plaintext_filepath, 'wb') as outfile:
+                original_filename = read_gzip_header_filename(out.name) or plaintext_filepath


since you're using the filepath as the default filename here, in case a file name is missing from the header, i think you'll have to clean up our code elsewhere that has logic to use the server filename, or perhaps you did that? i haven't come across it yet or maybe i missed it.

sssoleileraaa · 2020-01-28T21:08:36Z

securedrop_client/logic.py

@@ -679,39 +641,34 @@ def export_file_to_usb_drive(self, file_uuid: str, passphrase: str) -> None:
        is_downloaded is set to False.
        '''
        file = self.get_file(file_uuid)
-        logger.info('Exporting file {}'.format(file.original_filename))


looks like you accidentally removed a log line

just a nit but i think we should add this log line back

securedrop_client/storage.py

rmol · 2020-01-31T00:41:06Z

Rebased, changed the naming scheme in the data directory to be human readable, eliminated the renaming of files when source journalist designation changes.

Did not change the export archive format, as we decided to do that in a separate issue.

Eliminate File.original_file column and hard links to downloaded files. Now as files are downloaded and decrypted, File.filename will be updated. Instead of downloading directly into the root of the data directory, create a directory therein for each downloadable object, named with its UUID, and process downloaded files in that directory. This should eliminate collisions if multiple sources upload documents with the same names.

Base directory and file names in the data directory on the journalist designation of their sources, plus file counters and extensions, instead of UUIDs. Remove handling of changes to source journalist designations on the server; local files will no longer be renamed. The ability to change journalist designations will soon be removed from SecureDrop core, so this will not be needed.

sssoleileraaa

ran through the test plan and things work as expected

looks like we're not removing directories and files with old journalist designations after a change, but since we're removing this name change feature i think it makes sense not to spend cycles support it here

rmol requested review from sssoleileraaa, kushaldas and redshiftzero as code owners January 28, 2020 00:36

rmol force-pushed the 714-improve-download-handling branch from 32da4a1 to 6813852 Compare January 28, 2020 20:47

sssoleileraaa reviewed Jan 28, 2020

View reviewed changes

sssoleileraaa suggested changes Jan 28, 2020

View reviewed changes

sssoleileraaa mentioned this pull request Jan 30, 2020

Make sync continuous #739

Merged

rmol force-pushed the 714-improve-download-handling branch from 6813852 to bcd4eab Compare January 31, 2020 00:38

rmol added 2 commits January 31, 2020 15:28

rmol force-pushed the 714-improve-download-handling branch from bcd4eab to d6c8320 Compare January 31, 2020 20:30

sssoleileraaa approved these changes Jan 31, 2020

View reviewed changes

sssoleileraaa merged commit 078c3d0 into master Feb 1, 2020

sssoleileraaa deleted the 714-improve-download-handling branch February 1, 2020 00:10

redshiftzero mentioned this pull request Feb 4, 2020

Downloaded files removed when updated #758

Closed

eloquence mentioned this pull request Feb 5, 2020

AppArmor prevents directory creation, causing sync failure #776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve download file handling #737

Improve download file handling #737

rmol commented Jan 28, 2020

sssoleileraaa commented Jan 28, 2020

sssoleileraaa commented Jan 28, 2020

sssoleileraaa Jan 28, 2020

rmol Jan 28, 2020

sssoleileraaa Jan 28, 2020

sssoleileraaa Jan 28, 2020 •

edited

Loading

rmol Jan 29, 2020

sssoleileraaa Jan 29, 2020 •

edited

Loading

rmol Jan 30, 2020

redshiftzero Jan 30, 2020

sssoleileraaa Jan 30, 2020

sssoleileraaa Jan 30, 2020

sssoleileraaa left a comment

sssoleileraaa Jan 28, 2020

sssoleileraaa Jan 28, 2020

rmol Jan 29, 2020

sssoleileraaa Jan 31, 2020

rmol commented Jan 31, 2020

sssoleileraaa left a comment

Improve download file handling #737

Improve download file handling #737

Conversation

rmol commented Jan 28, 2020

Description

Test Plan

Checklist

sssoleileraaa commented Jan 28, 2020

sssoleileraaa commented Jan 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sssoleileraaa Jan 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sssoleileraaa Jan 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sssoleileraaa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmol commented Jan 31, 2020

sssoleileraaa left a comment

Choose a reason for hiding this comment

sssoleileraaa Jan 28, 2020 •

edited

Loading

sssoleileraaa Jan 29, 2020 •

edited

Loading