Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPTX, DOCS parsing not working with Python SDK but works when I use the UI due to mime type=None #436

Closed
ajpanyteam opened this issue Oct 11, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@ajpanyteam
Copy link

ajpanyteam commented Oct 11, 2024

(code is below)

Describe the bug

When using llama-index-core==0.11.17 and llama-parse==0.5.7 to upload Powerpoint and WordDoc files from a GCP bucket, it fails, error below.

For reference, in the GCP bucket, the powerpoint MIME type is "application/vnd.openxmlformats-officedocument.presentationml.presentation"

Started parsing the file under job_id 95517e7b-b8d5-48f4-888f-dd49336a9f4c
Error while parsing the file '<bytes/buffer>':
Job ID: 95517e7b-b8d5-48f4-888f-dd49336a9f4c failed with status: ERROR, Error code: _UNKOWN_ERROR, Error message: UNKNOWN_ERROR: PDF_IS_BROKEN

However, using the LlamaParse UI to upload the same PPTX file indicates that no markdown text was present and suggests using the Accurate mode. When Accurate mode is used, markdowns are generated.

Re-running the Python code then retrieves the document, likely due to caching.

Investigation

  • Adding a print statement in base.py on
mime_type = mimetypes.guess_type(file_name)[0]
            print('guessing mime type:', mime_type) **returns None** 
  • If I retrieve the file from the bucket, write the file locally and then load_data it successfully processes the document.

guessing mime type: application/vnd.openxmlformats-officedocument.wordprocessingml.document

How can I address this by sending the blob to load_data instead?

Files
Happens with all DOCX and PPTX files

Job ID
If you have it, please provide the ID of the job you ran.
See above.

Client:
Please remove untested options:

  • Frontend (cloud.llamaindex.ai)
  • Python Library
  • Notebook
  • API

My Code

parser = LlamaParse(
    api_key=os.environ["LLAMA_PARSE_API_KEY"],
    result_type="markdown",
)
storage_client = storage.Client() //GCP
bucket_name = os.getenv('GCP_FILE_UPLOAD_BUCKET_LOCAL')
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(fileId)
filename = blob.name
buffer = blob.download_as_string()
# buffer = blob.download_as_bytes(). --- I tried both options
extra_info = {'file_name': filename}
documents = parser.load_data(buffer ,extra_info=extra_info)
print('documents count:', len(documents))

Additional context

  • The issue is related to MIME type handling differences between the Python library and the UI. The Python library fails to process the file initially but succeeds after the file is processed through the UI, suggesting a potential caching mechanism or MIME type issue.
@ajpanyteam ajpanyteam added the bug Something isn't working label Oct 11, 2024
@ajpanyteam ajpanyteam changed the title PowerPoint parsing not working with Python SDK but works when I use the UI PowerPoint parsing not working with Python SDK but works when I use the UI - Error while parsing the file '<bytes/buffer>' Oct 11, 2024
@ajpanyteam ajpanyteam changed the title PowerPoint parsing not working with Python SDK but works when I use the UI - Error while parsing the file '<bytes/buffer>' PPTX, DOCS parsing not working with Python SDK but works when I use the UI due to mime type=None Oct 11, 2024
@ajpanyteam
Copy link
Author

ajpanyteam commented Oct 12, 2024

Closing. The filename in my code did not include the file extension, therefore llamaParse did not know what to do with it.

Take this with a pinch of salt...
It looks to me like the magic number is the same for PPTX and DOCX.
.docx files start with PK, which signals a ZIP archive (OpenXML formats are zipped).
thus
.pptx files also start with PK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant