-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve download file handling #737
Conversation
looks like ci is failing due to:
|
more info: circleci@4bfc3f883da5:~/project$ git diff
diff --git a/build-requirements.txt b/build-requirements.txt
index fc41b39..4529c32 100644
--- a/build-requirements.txt
+++ b/build-requirements.txt
@@ -1,15 +1,15 @@
-alembic==1.0.2 --hash=sha256:14024bd47f71d8b51920721dcd63248d07d370fbd0f6af
a9bec67b9edaf71f36
-arrow==0.12.1 --hash=sha256:5ef4a593615dc61ed85e62070b1bd27c71f7266233f0f9f
385b651370e8c6760
-certifi==2018.10.15 --hash=sha256:a5471c55b011bd45d6155f5c3629310c1d2f1e1a5
a899b7e438a223343de583d
-chardet==3.0.4 --hash=sha256:9f178988ca4c86e8a319b51aac1185b6fe5192328eb5a1
63c286f4bf50b7b3d8
-idna==2.7 --hash=sha256:954e65e127d0433a352981f43f291a438423d5b385ebf643c70
fd740e0634111
-mako==1.0.7 --hash=sha256:87ee3f74ba3ea544e683a5a22e7e34f4d1cf3ad34414b5f38
58becf00facf1d6
-markupsafe==1.0 --hash=sha256:6a7078a2fb4406458d6ae3579e4eb01a9bdc0a9a0686a
28fa50c19a039e3fcb8 --hash=sha256:3c9bf8fb4c3cf7dd11fd465132156d4c3cddb926d39bdb
d0f0bf5920fd8009a4
-pathlib2==2.3.2 --hash=sha256:8e276e2bf50a9a06c36e20f03b050e59b63dfe0678e37
164333deb87af03b6ad
-python-dateutil==2.7.5 --hash=sha256:56f285e7fad54cde3e31dc68a31a861543bfee
5ada9278da8e85ec20a8f72912
-python-editor==1.0.3 --hash=sha256:44fc57a6db6e04c7922c37a04d0a86d0024a4f0f
06245b6c57638cb322176202
-requests==2.20.0 --hash=sha256:2a539dd6af40a611f3b8eb3f99d3567781352ece1698
b2fab42bf4c2218705b5
-securedrop-sdk==0.0.12 --hash=sha256:274beb38dccd91988b45517e6479863f80bb31
e00953695e61dd8103509f2337
-six==1.11.0 --hash=sha256:4663c7a1dbed033cfb294f2d534bd6151c0698dc12ecabb4e
aa3cb041d758528
-sqlalchemy==1.3.3 --hash=sha256:bd030ff97e7a4f3aa34aafa6b62898c7de6999784c8
b4828b3e8b31cf69dae9c --hash=sha256:680b2ae6a728941c9fe661c85f2309a69408e7ec8ed8
fa39d03e07595259b75b
-urllib3==1.24.3 --hash=sha256:028309393606e28e640e2031edd27eb969c94f9364b08
71912608aaa8e66c96e
+alembic==1.0.2 --hash=sha256:99cc931e11dbef6e41e9376f18be62fc90fe4be9
c541eac1b30a3455b3d655f3
+arrow==0.12.1 --hash=sha256:fc8c8e0b587d00f38986bc161f4496e000acea033
fe2ce25f4f5bffa9ae53a7c
+certifi==2018.10.15 --hash=sha256:173b19dd31ca7faa50d1fcc0eaf30f5e32e
8e99e17d8c7fd4cfc8bc8d94e18a6
+chardet==3.0.4 --hash=sha256:f5632e583a4f61f1e16d0cc98127d241fb11c3c6
ddfddee159307d4215186837
+idna==2.7 --hash=sha256:491f674364ba3232ed1eb4c1eb7407887f62cef6c300a
ad7df6e01acd88ffb25
+mako==1.0.7 --hash=sha256:614c22fe1a5b0a3f46f6c5c43ff2e6795e4e784328d
559ec9dc49db0f06b3a75
+markupsafe==1.0 --hash=sha256:c6b726d2e9d6300a044cf6a37627f10994268d6
ac39464bc0d725126609311a5
+pathlib2==2.3.2 --hash=sha256:460e67b14d0574b0529a0017b1eb05d10d97226
81e303fec7077c2a628de60c1
+python-dateutil==2.7.5 --hash=sha256:f6eb9c17acd5a6954e1a5f2f999a41de
3e7e25b6bc41baf6344bd053ec25ceeb
+python-editor==1.0.3 --hash=sha256:e47dcec4ea883853b8196fbd425b875d7e
c791d4ede2e20cfc70b9a25365c65b
+requests==2.20.0 --hash=sha256:d87b2085783d31d874ac7bc62660e287932aae
e7059e80b41b76462eb18d35cc
+securedrop-sdk==0.0.12 --hash=sha256:d05bb78652c8771e6aa1aefcd76ade1f
ef08c563d2641acbc5ac8e1d635e6a53
+six==1.11.0 --hash=sha256:aa4ad34049ddff178b533062797fd1db9f0038b7c5c
2461a7cde2244300b9f3d
+sqlalchemy==1.3.3 --hash=sha256:bfb4cd0df5802a01acd738a080a19e60ff470
0e030d497de273807f9a8bd4a0a
+urllib3==1.24.3 --hash=sha256:3d440cbb168e2c963d5099232bdb3f7390bf031
b6270dad1bc79751698a1399a
circleci@4bfc3f883da5:~/project$ |
32da4a1
to
6813852
Compare
uuid=self.uuid, | ||
session=session, | ||
content=plaintext_file.read()) | ||
finally: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove this cleanup since we use a temporary file that automatically gets removed (see how when you remove this block of code, there is no decryption file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The temporary file is gone, but because I'm doing the work in subdirectories under data_dir, instead of at the top level, the folder named after the UUID of the message or reply will be left behind, empty. This cleanup keeps those from accumulating and possibly affecting performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see this makes sense.
just out of curiosity, why did you chose not to decrypt to the same uuid directory, e.g. instead of:
Message/baf388ff-dfbd-45e1-a629-bf9e761b1ac8
Reply/baf388ff-dfbd-45e1-a629-bf9e761b1ac8
File/baf388ff-dfbd-45e1-a629-bf9e761b1ac8/example_submission.pdf
why not:
baf388ff-dfbd-45e1-a629-bf9e761b1ac8/example_submission.pdf
baf388ff-dfbd-45e1-a629-bf9e761b1ac8/<temporary message and reply files that are removed>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah maybe it's to avoid having an empty uuid directory if there are only messages and replies and not files?
but i guess you could just decrypt them to the data dir and then there wouldn't need to be any folder creation and deletion cleanup. I prefer this way because it keeps the code simpler, so to be clear we would just have these in the data directory:
baf388ff-dfbd-45e1-a629-bf9e761b1ac8/example_submission.pdf
<temporary message and reply files downloded with the server file name `<count>-<jojurnalist designation>-<type of file>` removed once decrypted>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for including the class in the path is that UUIDs are only unique within a table, so if we were writing them into the root of the data directory, we could theoretically see collisions. I know, I know, UUIDs, but this approach uses the database constraints to guarantee it can't happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To recap our discussion from standup this morning, there were generally three ideas kicking around (we went with the third idea):
-
keep storing downloaded messages and replies in the data directory and if they fail to decrypt, that's where we'll find them, and store submission files in the data directory until they are successfully decrypted into a directory with a unique name, either UUID or the server filename.
Pros:
- If messages and replies successfully decrypt, we don't have to delete any folders
Cons:
-
files are not organized under a journalist designation, so files that are part of the same conversation would have to be collected in various places if we want to one day export multiple files and messages.
-
maintaining two different data directory structures in securedrop-export and securedrop-client, more cognitive overload
-
store all messages and replies in their respective
<Type>/<UUID>
directoriesPros:
-
the UUID never changes, so we don't have to change the name of the directory when the journalist designation changes
-
if for some reason the journalist designation name change fails when we update the filesystem, we can still easily delete messages and files easily because they will be organized under the UUID (this is relevant removing a source or individual submissions)
Cons:
- (same cons as above, also a note about how we already have to change the filename if the journalist designation changes)
-
-
maintain the same data directory structure as securedrop-export (see https://github.com/freedomofpress/securedrop-export/blob/master/README.md#export-archive-format)
Pros:
-
same directory structure as we (plan to) use in the securedrop-export, less to remember, and if we find a problem with the structure we can easily fix it in both places
-
files are organized under a journalist designation, so files that are part of the same conversation would be easier to collect for a multi-file export or a full-source conversation export
-
it's more human readable
Cons:
- We have to change the name of the folders in addition to the files if the journalist designation changes
-
One additonal thought I'm having now is that we might want to consider storing messages directly under the journalist designation name, instead of create a subdirectory of the server filename, since they will always be text files and have the same server filename as the subdirecotry.
So instead of:
├── cytotoxic payer
│ ├── 1-cytotoxic-payer-msg
│ │ └── 1-cytotoxic-payer-msg.txt
│ ├── 2-cytotoxic-payer-msg
│ │ └── 2-cytotoxic-payer-msg.txt
│ └── 3-cytotoxic-payer-doc
│ │ └── original-filename.doc
├── grandiloquent pasteboard
│ └── 1-grandiloquent-pasteboard-doc
│ │ └── original-filename.pdf
└── snug seek
I'm proposing:
├── cytotoxic payer
│ ├── 1-cytotoxic-payer-msg.txt
│ ├── 2-cytotoxic-payer-msg.txt
│ └── 3-cytotoxic-payer-doc
│ │ └── original-filename.doc
├── grandiloquent pasteboard
│ └── 1-grandiloquent-pasteboard-doc
│ │ └── original-filename.pdf
└── snug seek
But this is a minor detail, and shouldn't hold up this PR since this has yet to be implemented in securdrop-export.
Ok, that's the recap, plus one additional thought!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the requirements in #714 was to remove File.original_filename
, instead replacing File.filename
with the submission's original filename when the submission was unpacked. That's going to make the hierarchy above impossible, since we won't have 1-grandiloquent-pasteboard-doc
in the File
once it's been processed.
Keeping the original_filename
column would enable this structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so we discussed this a bit and @creviera is gonna pick up freedomofpress/securedrop#4304 to remove the renaming of journalist_designation
entirely, we'll pick into the next core/server minor release. One thing we thought about when we discussed this is that the metadata sync (and file rename) and file/message/reply download can occur concurrently, so it's definitely worth removing the rename to avoid racing behavior (i.e. to avoid guarding against such behavior). This way we can just use the securedrop-export structure everywhere.
so: for the remainder of this review, we can assume that this will be tested with a version of securedrop server that lacks journalist_designation renaming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and for those not part of the discussion today, we agreed that it makes sense to stick with idea #3 (see recap of all the three ideas we discussed and their pros and cons from Wednesday's standup: #737 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(ignore the #3
link... forgot that would link to an issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did a first pass
@@ -124,8 +124,12 @@ def decrypt_submission_or_reply(self, | |||
|
|||
# Store the plaintext in the file located at the specified plaintext_filepath | |||
if is_doc: | |||
original_filename = read_gzip_header_filename(out.name) | |||
with gzip.open(out.name, 'rb') as infile, open(plaintext_filepath, 'wb') as outfile: | |||
original_filename = read_gzip_header_filename(out.name) or plaintext_filepath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you're using the filepath as the default filename here, in case a file name is missing from the header, i think you'll have to clean up our code elsewhere that has logic to use the server filename, or perhaps you did that? i haven't come across it yet or maybe i missed it.
@@ -679,39 +641,34 @@ def export_file_to_usb_drive(self, file_uuid: str, passphrase: str) -> None: | |||
is_downloaded is set to False. | |||
''' | |||
file = self.get_file(file_uuid) | |||
logger.info('Exporting file {}'.format(file.original_filename)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you accidentally removed a log line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a nit but i think we should add this log line back
6813852
to
bcd4eab
Compare
Rebased, changed the naming scheme in the data directory to be human readable, eliminated the renaming of files when source journalist designation changes. Did not change the export archive format, as we decided to do that in a separate issue. |
Eliminate File.original_file column and hard links to downloaded files. Now as files are downloaded and decrypted, File.filename will be updated. Instead of downloading directly into the root of the data directory, create a directory therein for each downloadable object, named with its UUID, and process downloaded files in that directory. This should eliminate collisions if multiple sources upload documents with the same names.
Base directory and file names in the data directory on the journalist designation of their sources, plus file counters and extensions, instead of UUIDs. Remove handling of changes to source journalist designations on the server; local files will no longer be renamed. The ability to change journalist designations will soon be removed from SecureDrop core, so this will not be needed.
bcd4eab
to
d6c8320
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ran through the test plan and things work as expected
looks like we're not removing directories and files with old journalist designations after a change, but since we're removing this name change feature i think it makes sense not to spend cycles support it here
Description
Eliminate
File.original_file
column and hard links to downloaded files. Now as files are downloaded and decrypted,File.filename
will be updated.Instead of downloading directly into the root of the data directory, create a directory therein for each downloadable object, named with its UUID, and process downloaded files in that directory. This should
eliminate collisions if multiple sources upload documents with the same names.
Fixes #714.
Test Plan
make dev
run.sh
. Find your source in the list and:$SDC_HOME/data_dir/File/{file-uuid}/
directory, with no files hard linked to it.Checklist
If these changes modify code paths involving cryptography, the opening of files in VMs or network (via the RPC service) traffic, Qubes testing in the staging environment is required. For fine tuning of the graphical user interface, testing in any environment in Qubes is required. Please check as applicable:
If these changes add or remove files other than client code, packaging logic (e.g., the AppArmor profile) may need to be updated. Please check as applicable: