fix(odt): fix disk-space leak in partition_odt() #3037
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Remedy disk-space leak where
partition_odt()
would leave an on-disk copy of each.odt
file passed as a file-like object.partition_odt()
creates a temporary file in which it writes each source-document provided as a file-like object. This file is not deleted and disk consumption grows without bound.The
convert_and_partition_docx()
function used to convert ODT->DOCX usespandoc
(a command-line program) to do the conversion. Because this command-line program operates in a different memory space, the source file cannot be passed as an in-memory object and needs to be on the filesystem. When the ODT source-document is passed as a file-like object, it is written to disk so the conversion program has access to it. It is not deleted afterward.Fix this by writing the temporary source ODT file in a
TemporaryDirectory
and also use that location to write the conversion-target DOCX file. That directory is automatically removed whenpartition_odt()
completes.While we're in there, improve the factoring of
partition_odt()
.convert_and_partition_docx()
frompartition.docx
(used only bypartition_odt()
) to_convert_odt_to_docx()
inpartition.odt
where it is used. Decouple file conversion from callingpartition_docx()
with the converted file as thepartition_docx()
call ispartition_odt()
's natural responsibility.