Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LlamaParse error parsing MSFT DOCX and PPTX files - suspect LlamaParseReader is applying wrong MIME Type #1313

Closed
ajpanyteam opened this issue Oct 9, 2024 · 3 comments · Fixed by #1340
Labels
bug Something isn't working

Comments

@ajpanyteam
Copy link

ajpanyteam commented Oct 9, 2024

Describe the bug:

LLamaParse is failing for PPTX and DOCX files. But works for PDF and XLSX files.

Using the llamaIndex TS SDK v0.6.17.

I suspect this is related to #1007

Write a concise description of what the bug is:

Using the following code returns “Error while parsing the file: Failed to parse the file:"

const file = storage.bucket(process.env.GCP_BUCKET).file(fileId);
const [buffer] = await file.download();

const getMetadata = await file.getMetadata();
console.log(getMetadata[0].contentType)

const reader = new LlamaParseReader({
   resultType: "markdown",
   skipDiagonalText: true,
   verbose: true,
});

const unt8Array = new Uint8Array(buffer); 
const documents = await reader.loadDataAsContent(unt8Array); // -> Error while parsing the file: Failed to parse the file: c8e8b079-f3aa-4786-bfe3-e9b3981812cf, status: ERROR
image

Additional context:

  • Digging into the SDK, I added a breakpoint on LlamaParseReader function loadDataAsContent which led to createJob. In this screenshot, the PPTX file MimeType applied is application/vnd.oasis.opendocument.spreadsheet, this does not sound right.
    image

Files:

  • wget "https://meetings.wmo.int/Cg-19/PublishingImages/SitePages/FINAC-43/7%20-%20EC-77-Doc%205%20Financial%20Statements%20for%202022%20(FINAC).pptx" -O data/presentation.ppt
    ppx source file from llamaParse github

Job ID:

  • ac0c1b4a-915f-4d00-9340-25506d085f33
  • f8275764-361f-4fd9-ac6d-634193c556b2
  • c8e8b079-f3aa-4786-bfe3-e9b3981812cf
  • Above jobIds are from my console output. My https://cloud.llamaindex.ai/parse history tab does not have these jobs.
@ajpanyteam ajpanyteam changed the title Error parsing MSFT DOCX and PPTX File Types - I suspect LlamaParseReader is applying wrong MIME Type Error parsing MSFT DOCX and PPTX File Types - suspect LlamaParseReader is applying wrong MIME Type Oct 9, 2024
@ajpanyteam ajpanyteam changed the title Error parsing MSFT DOCX and PPTX File Types - suspect LlamaParseReader is applying wrong MIME Type LlamaParse Error parsing MSFT DOCX and PPTX File Types - suspect LlamaParseReader is applying wrong MIME Type Oct 9, 2024
@ajpanyteam ajpanyteam changed the title LlamaParse Error parsing MSFT DOCX and PPTX File Types - suspect LlamaParseReader is applying wrong MIME Type LlamaParse error parsing MSFT DOCX and PPTX files - suspect LlamaParseReader is applying wrong MIME Type Oct 9, 2024
@danielbank
Copy link

For certain files, this example code does not fail as an error but instead produces a garbage parse based on data which is not at all related to the original file. See example repo for reproducing

@himself65 himself65 added the bug Something isn't working label Oct 18, 2024
@himself65
Copy link
Member

It seems a bug in llama parse side, also I remove the file type detect in sdk side beacuse llama parse will handle that correctly

@jsmusgrave
Copy link
Contributor

@himself65 I just validate docx upload via the website. So, assuming they're both using the same API, it seems unlikely this is on the llama parse side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants