Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(doc): fix disk-space leak #3019

Merged
merged 1 commit into from
May 15, 2024
Merged

fix(doc): fix disk-space leak #3019

merged 1 commit into from
May 15, 2024

Conversation

scanny
Copy link
Collaborator

@scanny scanny commented May 15, 2024

Summary
Remedy disk-space leak where partition_doc() would leave a copy of each .doc file passed as a file-like object on disk.

Additional Context
partition_doc() creates a temporary file in which it writes each source-document provided as a file-like object. This file is not deleted and disk consumption grows without bound.

The convert_office_doc() function used to convert DOC->DOCX uses a command-line program provided with LibreOffice to convert do the conversion. Because this command-line program operates in a different memory space, the source file cannot be passed as an in-memory object and needs to be on the filesystem. When the DOC file is passed as a file-like object, it is written to disk so the conversion program has access to it. It is not deleted afterward.

Fix this by writing the temporary source DOC file in the TemporaryDirectory already being used to write the conversion-target DOCX file. That directory is automatically removed when partition_doc() completes.

`partition_doc()` creates a temporary file in which it writes each
source-document provided as a file-like object. This file is not
deleted and disk consumption grows without bound.

`convert_office_doc()` uses a command-line program provided with
LibreOffice to convert from DOC -> DOCX. Because this command-line
program operates in a different memory space, the source file cannot be
passed as an in-memory object and needs to be on the filesystem.

Fix this by writing the temporary source file in the TemporaryDirectory
already being used to write the conversion-target DOCX file. That
directory is automatically removed when `partition_doc()` completes.
@scanny scanny requested a review from Coniferish May 15, 2024 06:29
Copy link
Collaborator

@Coniferish Coniferish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@scanny scanny added this pull request to the merge queue May 15, 2024
Merged via the queue into main with commit b1b8eae May 15, 2024
42 checks passed
@scanny scanny deleted the scanny/fix-doc-disk-space-leak branch May 15, 2024 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants