Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(odt): fix disk-space leak in partition_odt() #3037

Merged
merged 1 commit into from
May 16, 2024

Conversation

scanny
Copy link
Collaborator

@scanny scanny commented May 16, 2024

Remedy disk-space leak where partition_odt() would leave an on-disk copy of each .odt file passed as a file-like object.

partition_odt() creates a temporary file in which it writes each source-document provided as a file-like object. This file is not deleted and disk consumption grows without bound.

The convert_and_partition_docx() function used to convert ODT->DOCX uses pandoc (a command-line program) to do the conversion. Because this command-line program operates in a different memory space, the source file cannot be passed as an in-memory object and needs to be on the filesystem. When the ODT source-document is passed as a file-like object, it is written to disk so the conversion program has access to it. It is not deleted afterward.

Fix this by writing the temporary source ODT file in a TemporaryDirectory and also use that location to write the conversion-target DOCX file. That directory is automatically removed when partition_odt() completes.

While we're in there, improve the factoring of partition_odt().

  • Extract convert_and_partition_docx() from partition.docx (used only by partition_odt()) to _convert_odt_to_docx() in partition.odt where it is used. Decouple file conversion from calling partition_docx() with the converted file as the partition_docx() call is partition_odt()'s natural responsibility.
  • Improve docstrings, typing, and comments.
  • All tests pass both before and after.

Remedy disk-space leak where `partition_odt()` would leave an on-disk
copy of each `.odt` file passed as a file-like object.

`partition_odt()` creates a temporary file in which it writes each
source-document provided as a file-like object. This file is not deleted
and disk consumption grows without bound.

The `convert_and_partition_docx()` function used to convert ODT->DOCX
uses `pandoc` (a command-line program) to do the conversion. Because
this command-line program operates in a different memory space, the
source file cannot be passed as an in-memory object and needs to be on
the filesystem. When the ODT source-document is passed as a file-like
object, it is written to disk so the conversion program has access to
it. It is not deleted afterward.

Fix this by writing the temporary source ODT file in a
`TemporaryDirectory` and also use that location to write the
conversion-target DOCX file. That directory is automatically removed
when `partition_odt()` completes.

While we're in there, improve the factoring of `partition_odt()`.

- Extract `convert_and_partition_docx()` from `partition.docx` (used
  only by `partition_odt()`) to `_convert_odt_to_docx()` in
  `partition.odt` where it is used. Decouple file conversion from
  calling `partition_docx()` with the converted file as the
  `partition_docx()` call is `partition_odt()`'s natural responsibility.
- Improve docstrings, typing, and comments.
- All tests pass both before and after.
Copy link
Collaborator

@Coniferish Coniferish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@scanny scanny added this pull request to the merge queue May 16, 2024
Merged via the queue into main with commit 8644a3b May 16, 2024
42 checks passed
@scanny scanny deleted the scanny/fix-odt-disk-space-leak branch May 16, 2024 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants