Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/reopen temp file (pdf high_res) #303

Closed
FAbrahamDev opened this issue Dec 13, 2023 · 0 comments · Fixed by #376
Closed

bug/reopen temp file (pdf high_res) #303

FAbrahamDev opened this issue Dec 13, 2023 · 0 comments · Fixed by #376

Comments

@FAbrahamDev
Copy link

Describe the bug
In line:

with tempfile.NamedTemporaryFile() as tmp_file:

is a tmp file created to pass as filename to process_file_with_model -> DocumentLayout.from_file -> load_pdf -> extract_pages (pdf_miner).
The extract_pages tries to read the file again with open_filename(pdf_file, "rb") as fp:.

Which results in a PermissionError: [Errno 13] Permission denied: 'C:\\Users\\...\\AppData\\Local\\Temp\\tmpf9flca30' under windows.

Same error here:
https://github.com/Unstructured-IO/unstructured/blob/d3a404cfb541dae8e16956f096bac99fc05c985b/unstructured/partition/pdf_image/ocr.py#L79

To Reproduce

import tempfile

# print operating system name
import os
print(os.name)


# Create a temporary file
with tempfile.NamedTemporaryFile() as tmp_file:
    # Write some data to the file
    tmp_file.write(b'Hello, world!')
    tmp_file.flush()  # Flush the buffer to make sure data is written

    # Get the name of the file
    file_name = tmp_file.name

    # Since the file is closed after the with block, we need to open it again for reading
    with open(file_name, 'r') as file:
        # Read the data from the file
        content = file.read()
        print("Content of the temp file:", content)

Expected behavior
I expect it not to crash :)

Additional context
Possible solution taken from here: https://stackoverflow.com/questions/39983886/python-writing-and-reading-from-a-temporary-file

def process_data_with_model(
    data: BinaryIO,
    model_name: Optional[str],
    **kwargs,
) -> DocumentLayout:
    """Processes pdf file in the form of a file handler (supporting a read method) into a
    DocumentLayout by using a model identified by model_name."""

    with tempfile.TemporaryDirectory() as td:
        f_name = os.path.join(td, "tmp_file")
        with open(f_name, "w") as tmp_file:
            tmp_file.write(data.read())
            tmp_file.flush()

        layout = process_file_with_model(
            f_name,
            model_name,
            **kwargs,
        )

    return layout

or another solution by gpt:

import tempfile

with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
    tmp_file.write(b'Hello, world!')
    # Get the name of the file before closing
    file_name = tmp_file.name

# Now the file is closed, you can open it again
with open(file_name, 'r') as file:
    content = file.read()
    print("Content of the temp file:", content)

# Optionally, delete the file if you don't need it anymore
import os
os.remove(file_name)

Not sure which is better.
The latter one probably requires a try catch final with the removal and then reraise the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant