-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve download file handling #737
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
"""drop File.original_filename | ||
|
||
Revision ID: fb657f2ee8a7 | ||
Revises: 86b01b6290da | ||
Create Date: 2020-01-23 18:55:09.857324 | ||
|
||
""" | ||
from alembic import op | ||
import sqlalchemy as sa | ||
|
||
# revision identifiers, used by Alembic. | ||
revision = "fb657f2ee8a7" | ||
down_revision = "86b01b6290da" | ||
branch_labels = None | ||
depends_on = None | ||
|
||
|
||
def upgrade(): | ||
conn = op.get_bind() | ||
|
||
op.rename_table("files", "original_files") | ||
|
||
conn.execute( | ||
""" | ||
CREATE TABLE files ( | ||
id INTEGER NOT NULL, | ||
uuid VARCHAR(36) NOT NULL, | ||
filename VARCHAR(255) NOT NULL, | ||
file_counter INTEGER NOT NULL, | ||
size INTEGER NOT NULL, | ||
download_url VARCHAR(255) NOT NULL, | ||
is_downloaded BOOLEAN DEFAULT 0 NOT NULL, | ||
is_read BOOLEAN DEFAULT 0 NOT NULL, | ||
is_decrypted BOOLEAN, | ||
source_id INTEGER NOT NULL, | ||
CONSTRAINT pk_files PRIMARY KEY (id), | ||
CONSTRAINT fk_files_source_id_sources FOREIGN KEY(source_id) REFERENCES sources (id), | ||
CONSTRAINT uq_messages_source_id_file_counter UNIQUE (source_id, file_counter), | ||
CONSTRAINT uq_files_uuid UNIQUE (uuid), | ||
CONSTRAINT files_compare_is_downloaded_vs_is_decrypted CHECK (CASE WHEN is_downloaded = 0 THEN is_decrypted IS NULL ELSE 1 END), | ||
CONSTRAINT ck_files_is_downloaded CHECK (is_downloaded IN (0, 1)), | ||
CONSTRAINT ck_files_is_read CHECK (is_read IN (0, 1)), | ||
CONSTRAINT ck_files_is_decrypted CHECK (is_decrypted IN (0, 1)) | ||
) | ||
""" | ||
) | ||
|
||
conn.execute( | ||
""" | ||
INSERT INTO files | ||
(id, uuid, filename, file_counter, size, download_url, is_downloaded, | ||
is_decrypted, is_read, source_id) | ||
SELECT id, uuid, filename, file_counter, size, download_url, is_downloaded, | ||
is_decrypted, is_read, source_id | ||
FROM original_files | ||
""" | ||
) | ||
|
||
op.drop_table("original_files") | ||
|
||
|
||
def downgrade(): | ||
op.add_column( | ||
"files", | ||
sa.Column( | ||
"original_filename", | ||
sa.VARCHAR(length=255), | ||
server_default=sa.text("''"), | ||
nullable=False, | ||
), | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -125,8 +125,12 @@ def decrypt_submission_or_reply(self, | |
|
||
# Store the plaintext in the file located at the specified plaintext_filepath | ||
if is_doc: | ||
original_filename = read_gzip_header_filename(out.name) | ||
with gzip.open(out.name, 'rb') as infile, open(plaintext_filepath, 'wb') as outfile: | ||
original_filename = read_gzip_header_filename(out.name) or plaintext_filepath | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. since you're using the filepath as the default filename here, in case a file name is missing from the header, i think you'll have to clean up our code elsewhere that has logic to use the server filename, or perhaps you did that? i haven't come across it yet or maybe i missed it. |
||
decrypt_path = os.path.join( | ||
os.path.dirname(filepath), | ||
os.path.basename(original_filename) | ||
) | ||
with gzip.open(out.name, 'rb') as infile, open(decrypt_path, 'wb') as outfile: | ||
shutil.copyfileobj(infile, outfile) | ||
else: | ||
shutil.copy(out.name, plaintext_filepath) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove this cleanup since we use a temporary file that automatically gets removed (see how when you remove this block of code, there is no decryption file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The temporary file is gone, but because I'm doing the work in subdirectories under data_dir, instead of at the top level, the folder named after the UUID of the message or reply will be left behind, empty. This cleanup keeps those from accumulating and possibly affecting performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see this makes sense.
just out of curiosity, why did you chose not to decrypt to the same uuid directory, e.g. instead of:
why not:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah maybe it's to avoid having an empty uuid directory if there are only messages and replies and not files?
but i guess you could just decrypt them to the data dir and then there wouldn't need to be any folder creation and deletion cleanup. I prefer this way because it keeps the code simpler, so to be clear we would just have these in the data directory:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for including the class in the path is that UUIDs are only unique within a table, so if we were writing them into the root of the data directory, we could theoretically see collisions. I know, I know, UUIDs, but this approach uses the database constraints to guarantee it can't happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To recap our discussion from standup this morning, there were generally three ideas kicking around (we went with the third idea):
keep storing downloaded messages and replies in the data directory and if they fail to decrypt, that's where we'll find them, and store submission files in the data directory until they are successfully decrypted into a directory with a unique name, either UUID or the server filename.
Pros:
Cons:
files are not organized under a journalist designation, so files that are part of the same conversation would have to be collected in various places if we want to one day export multiple files and messages.
maintaining two different data directory structures in securedrop-export and securedrop-client, more cognitive overload
store all messages and replies in their respective
<Type>/<UUID>
directoriesPros:
the UUID never changes, so we don't have to change the name of the directory when the journalist designation changes
if for some reason the journalist designation name change fails when we update the filesystem, we can still easily delete messages and files easily because they will be organized under the UUID (this is relevant removing a source or individual submissions)
Cons:
maintain the same data directory structure as securedrop-export (see https://github.com/freedomofpress/securedrop-export/blob/master/README.md#export-archive-format)
Pros:
same directory structure as we (plan to) use in the securedrop-export, less to remember, and if we find a problem with the structure we can easily fix it in both places
files are organized under a journalist designation, so files that are part of the same conversation would be easier to collect for a multi-file export or a full-source conversation export
it's more human readable
Cons:
One additonal thought I'm having now is that we might want to consider storing messages directly under the journalist designation name, instead of create a subdirectory of the server filename, since they will always be text files and have the same server filename as the subdirecotry.
So instead of:
I'm proposing:
But this is a minor detail, and shouldn't hold up this PR since this has yet to be implemented in securdrop-export.
Ok, that's the recap, plus one additional thought!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the requirements in #714 was to remove
File.original_filename
, instead replacingFile.filename
with the submission's original filename when the submission was unpacked. That's going to make the hierarchy above impossible, since we won't have1-grandiloquent-pasteboard-doc
in theFile
once it's been processed.Keeping the
original_filename
column would enable this structure.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so we discussed this a bit and @creviera is gonna pick up freedomofpress/securedrop#4304 to remove the renaming of
journalist_designation
entirely, we'll pick into the next core/server minor release. One thing we thought about when we discussed this is that the metadata sync (and file rename) and file/message/reply download can occur concurrently, so it's definitely worth removing the rename to avoid racing behavior (i.e. to avoid guarding against such behavior). This way we can just use the securedrop-export structure everywhere.so: for the remainder of this review, we can assume that this will be tested with a version of securedrop server that lacks journalist_designation renaming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and for those not part of the discussion today, we agreed that it makes sense to stick with idea #3 (see recap of all the three ideas we discussed and their pros and cons from Wednesday's standup: #737 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(ignore the
#3
link... forgot that would link to an issue)