PPTX, DOCS parsing not working with Python SDK but works when I use the UI due to mime type=None #436

ajpanyteam · 2024-10-11T19:59:43Z

(code is below)

Describe the bug

When using llama-index-core==0.11.17 and llama-parse==0.5.7 to upload Powerpoint and WordDoc files from a GCP bucket, it fails, error below.

For reference, in the GCP bucket, the powerpoint MIME type is "application/vnd.openxmlformats-officedocument.presentationml.presentation"

Started parsing the file under job_id 95517e7b-b8d5-48f4-888f-dd49336a9f4c
Error while parsing the file '<bytes/buffer>':
Job ID: 95517e7b-b8d5-48f4-888f-dd49336a9f4c failed with status: ERROR, Error code: _UNKOWN_ERROR, Error message: UNKNOWN_ERROR: PDF_IS_BROKEN

However, using the LlamaParse UI to upload the same PPTX file indicates that no markdown text was present and suggests using the Accurate mode. When Accurate mode is used, markdowns are generated.

Re-running the Python code then retrieves the document, likely due to caching.

Investigation

Adding a print statement in base.py on

mime_type = mimetypes.guess_type(file_name)[0]
            print('guessing mime type:', mime_type) **returns None**

If I retrieve the file from the bucket, write the file locally and then load_data it successfully processes the document.

guessing mime type: application/vnd.openxmlformats-officedocument.wordprocessingml.document

How can I address this by sending the blob to load_data instead?

Files
Happens with all DOCX and PPTX files

Job ID
If you have it, please provide the ID of the job you ran.
See above.

Client:
Please remove untested options:

Frontend (cloud.llamaindex.ai)
Python Library
Notebook
API

My Code

parser = LlamaParse(
    api_key=os.environ["LLAMA_PARSE_API_KEY"],
    result_type="markdown",
)
storage_client = storage.Client() //GCP
bucket_name = os.getenv('GCP_FILE_UPLOAD_BUCKET_LOCAL')
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(fileId)
filename = blob.name
buffer = blob.download_as_string()
# buffer = blob.download_as_bytes(). --- I tried both options
extra_info = {'file_name': filename}
documents = parser.load_data(buffer ,extra_info=extra_info)
print('documents count:', len(documents))

Additional context

The issue is related to MIME type handling differences between the Python library and the UI. The Python library fails to process the file initially but succeeds after the file is processed through the UI, suggesting a potential caching mechanism or MIME type issue.

The text was updated successfully, but these errors were encountered:

ajpanyteam · 2024-10-12T00:03:40Z

Closing. The filename in my code did not include the file extension, therefore llamaParse did not know what to do with it.

Take this with a pinch of salt...
It looks to me like the magic number is the same for PPTX and DOCX.
.docx files start with PK, which signals a ZIP archive (OpenXML formats are zipped).
thus
.pptx files also start with PK

ajpanyteam added the bug Something isn't working label Oct 11, 2024

ajpanyteam changed the title ~~PowerPoint parsing not working with Python SDK but works when I use the UI~~ PowerPoint parsing not working with Python SDK but works when I use the UI - Error while parsing the file '<bytes/buffer>' Oct 11, 2024

ajpanyteam changed the title ~~PowerPoint parsing not working with Python SDK but works when I use the UI - Error while parsing the file '<bytes/buffer>'~~ PPTX, DOCS parsing not working with Python SDK but works when I use the UI due to mime type=None Oct 11, 2024

ajpanyteam closed this as completed Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPTX, DOCS parsing not working with Python SDK but works when I use the UI due to mime type=None #436

PPTX, DOCS parsing not working with Python SDK but works when I use the UI due to mime type=None #436

ajpanyteam commented Oct 11, 2024 •

edited

Loading

ajpanyteam commented Oct 12, 2024 •

edited

Loading

PPTX, DOCS parsing not working with Python SDK but works when I use the UI due to mime type=None #436

PPTX, DOCS parsing not working with Python SDK but works when I use the UI due to mime type=None #436

Comments

ajpanyteam commented Oct 11, 2024 • edited Loading

ajpanyteam commented Oct 12, 2024 • edited Loading

ajpanyteam commented Oct 11, 2024 •

edited

Loading

ajpanyteam commented Oct 12, 2024 •

edited

Loading