You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using llama-index-core==0.11.17 and llama-parse==0.5.7 to upload Powerpoint and WordDoc files from a GCP bucket, it fails, error below.
For reference, in the GCP bucket, the powerpoint MIME type is "application/vnd.openxmlformats-officedocument.presentationml.presentation"
Started parsing the file under job_id 95517e7b-b8d5-48f4-888f-dd49336a9f4c
Error while parsing the file '<bytes/buffer>':
Job ID: 95517e7b-b8d5-48f4-888f-dd49336a9f4c failed with status: ERROR, Error code: _UNKOWN_ERROR, Error message: UNKNOWN_ERROR: PDF_IS_BROKEN
However, using the LlamaParse UI to upload the same PPTX file indicates that no markdown text was present and suggests using the Accurate mode. When Accurate mode is used, markdowns are generated.
Re-running the Python code then retrieves the document, likely due to caching.
The issue is related to MIME type handling differences between the Python library and the UI. The Python library fails to process the file initially but succeeds after the file is processed through the UI, suggesting a potential caching mechanism or MIME type issue.
The text was updated successfully, but these errors were encountered:
ajpanyteam
changed the title
PowerPoint parsing not working with Python SDK but works when I use the UI
PowerPoint parsing not working with Python SDK but works when I use the UI - Error while parsing the file '<bytes/buffer>'
Oct 11, 2024
ajpanyteam
changed the title
PowerPoint parsing not working with Python SDK but works when I use the UI - Error while parsing the file '<bytes/buffer>'
PPTX, DOCS parsing not working with Python SDK but works when I use the UI due to mime type=None
Oct 11, 2024
Closing. The filename in my code did not include the file extension, therefore llamaParse did not know what to do with it.
Take this with a pinch of salt...
It looks to me like the magic number is the same for PPTX and DOCX.
.docx files start with PK, which signals a ZIP archive (OpenXML formats are zipped).
thus
.pptx files also start with PK
(code is below)
Describe the bug
When using llama-index-core==0.11.17 and llama-parse==0.5.7 to upload Powerpoint and WordDoc files from a GCP bucket, it fails, error below.
For reference, in the GCP bucket, the powerpoint MIME type is "application/vnd.openxmlformats-officedocument.presentationml.presentation"
However, using the LlamaParse UI to upload the same PPTX file indicates that no markdown text was present and suggests using the Accurate mode. When Accurate mode is used, markdowns are generated.
Re-running the Python code then retrieves the document, likely due to caching.
Investigation
load_data
it successfully processes the document.guessing mime type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
How can I address this by sending the blob to
load_data
instead?Files
Happens with all DOCX and PPTX files
Job ID
If you have it, please provide the ID of the job you ran.
See above.
Client:
Please remove untested options:
My Code
Additional context
The text was updated successfully, but these errors were encountered: