fix: page numbering when only converting specific pages #24

dfdeagle47 · 2024-09-09T08:07:43Z

Context

Follow up of #23 (cf. #23 (comment)) to fix page numbering when not converting all pages.

When we only convert specific pages, the formattedPages.page index will not be correct.

Description

Depending on the use case, we can retrieve the proper index:

If pagesToConvertAsImages === -1: we simply use the aggregatedMarkdown array index, same as before.
If pagesToConvertAsImages is an array, we use the aggregatedMarkdown array index to find the corresponding element in pagesToConvertAsImages
If pagesToConvertAsImages is a number, we simply use that value

Note: pdf2pic will always return the pages in order, regardless the initial order in pagesToConvertAsImages parameter. Therefore, it's better to always sort the pagesToConvertAsImages array for consistency.

Testing notes

Here are a few tests cases (check the output to see if the page number is correct):

import { zerox } from "./src/index";
import { writeFile } from "node:fs/promises";

async function main() {
  // All pages (i.e. no value provided for `pagesToConvertAsImages`)
  const resultAllPages = await zerox({
    filePath: "https://www.sldttc.org/allpdf/21583473018.pdf",
    openaiAPIKey: process.env.OPENAI_API_KEY,
  });
  console.log(
    resultAllPages.pages.map((p) => ({
      ...p,
      content: p.content.slice(0, 100),
    }))
  );

  // Page 2
  const resultPage1 = await zerox({
    filePath: "https://www.sldttc.org/allpdf/21583473018.pdf",
    openaiAPIKey: process.env.OPENAI_API_KEY,
    pagesToConvertAsImages: 2,
  });
  console.log(
    resultPage1.pages.map((p) => ({
      ...p,
      content: p.content.slice(0, 100),
    }))
  );

  // Page 2 and 3
  const resultPage2and3 = await zerox({
    filePath: "https://www.sldttc.org/allpdf/21583473018.pdf",
    openaiAPIKey: process.env.OPENAI_API_KEY,
    pagesToConvertAsImages: [2, 3],
  });
  console.log(
    resultPage2and3.pages.map((p) => ({
      ...p,
      content: p.content.slice(0, 100),
    }))
  );

  // Page 3 and 2
  const resultPage3and2 = await zerox({
    filePath: "https://www.sldttc.org/allpdf/21583473018.pdf",
    openaiAPIKey: process.env.OPENAI_API_KEY,
    pagesToConvertAsImages: [3, 2],
  });
  console.log(
    resultPage3and2.pages.map((p) => ({
      ...p,
      content: p.content.slice(0, 100),
    }))
  );
}

main();

dfdeagle47 · 2024-09-11T17:33:02Z

@annapo23 @tylermaran

This PR is not urgent, and you might be busy with other stuff, but do let me know if I should change something or you need more context on the PR to merge this fix. Happy to iterate on it as needed.

annapo23 · 2024-09-13T00:39:11Z

@dfdeagle47 Works well! I'll go ahead and merge the PR.

pradhyumna85 · 2024-09-13T06:52:27Z

@annapo23, @dfdeagle47 I want to add this functionality in the Python SDK. I am just curious on how should the maintain_format = True should be handled where we have pagesToConvertAsImages an array where the pages might not be in continuity like [2,4,8].
Any suggestions?

dfdeagle47 · 2024-09-13T07:35:56Z

@annapo23, @dfdeagle47 I want to add this functionality in the Python SDK. I am just curious on how should the maintain_format = True should be handled where we have pagesToConvertAsImages an array where the pages might not be in continuity like [2,4,8]. Any suggestions?

@pradhyumna85
That's a good question.

As I currently understand it, this parameter is used to maintain the same output markdown format throughout the document, by feeding the previous output page in the prompt. For instance, to have the same types of headings, a consistent tabular format etc.

While it's true that two consecutive pages might be more likely to be similar with each other, I think it doesn't rely too much on any single pages actually. The source document is likely to be consistent throughout, so if you pick two random pages in the source document, the formatting should still be consistent.

So my intuition is that it doesn't matter too much which output page you feed in the context in most cases, and it should still make sense, even when converting specific pages.

Apart from this thought, I don't see an obvious solution because if we need to feed the actual previous page, we'd end up converting the whole document again.

So perhaps, if formatting is really important for the end-user and it doesn't produce the expected result when selecting a few pages, they should try converting the whole document instead. So it's more of a choice for the user of the lib, rather than something that should be handled by the lib itself.

pradhyumna85 · 2024-09-13T08:05:42Z

@dfdeagle47 that makes sense. I think even if it has to be implemented then in theory maintain_format will only be applicable in continuous subsets of the sorted array like:

[3,4,5] whole array is continuous
[1,3,4,8,10,11] has two continuous subsets [3,4], [10,11]

@annapo23, @tylermaran, I am thinking on just implementing this PR in the python SDK for now and any further extensions to that can be done later.
Let me know your thoughts and how you want this to proceed.

Changes: - Feature added: now specific pages can be processed with the python sdk using "select_pages" param. Incorporates getomni-ai#23, getomni-ai#24 for python sdk - workflow for the above feature: create a new temperory pdf in the tempdir if select_pages is specified and follow the rest of the process as usual and finally map the page number in the formatted markdown to get the actual number instead of index. - raise warning when both select_pages and maintain used. - required adaptations and updates in messages, exceptions, types, processor, utils etc Fixes/improvements: - memory efficient pdf to image conversion, utilizing paths only option to directly get sorted image paths from pdf2image api Misc: - Bump the version tag - documentation updated

#39) * Python SDK Changes: - Feature added: now specific pages can be processed with the python sdk using "select_pages" param. Incorporates #23, #24 for python sdk - workflow for the above feature: create a new temperory pdf in the tempdir if select_pages is specified and follow the rest of the process as usual and finally map the page number in the formatted markdown to get the actual number instead of index. - raise warning when both select_pages and maintain used. - required adaptations and updates in messages, exceptions, types, processor, utils etc Fixes/improvements: - memory efficient pdf to image conversion, utilizing paths only option to directly get sorted image paths from pdf2image api Misc: - Bump the version tag - documentation updated * Minor update in README.md --------- Co-authored-by: Pradyumna Singh Rathore <[email protected]>

…hen-only-converting-specific-pages fix: page numbering when only converting specific pages

getomni-ai#39) * Python SDK Changes: - Feature added: now specific pages can be processed with the python sdk using "select_pages" param. Incorporates getomni-ai#23, getomni-ai#24 for python sdk - workflow for the above feature: create a new temperory pdf in the tempdir if select_pages is specified and follow the rest of the process as usual and finally map the page number in the formatted markdown to get the actual number instead of index. - raise warning when both select_pages and maintain used. - required adaptations and updates in messages, exceptions, types, processor, utils etc Fixes/improvements: - memory efficient pdf to image conversion, utilizing paths only option to directly get sorted image paths from pdf2image api Misc: - Bump the version tag - documentation updated * Minor update in README.md --------- Co-authored-by: Pradyumna Singh Rathore <[email protected]>

fix: page numbering when only converting specific pages

48640be

dfdeagle47 marked this pull request as ready for review September 9, 2024 08:08

annapo23 merged commit 26e87e8 into getomni-ai:main Sep 13, 2024

pradhyumna85 mentioned this pull request Sep 13, 2024

Python SDK: Feat. process specific pages, Other fixes and improvements #39

Merged

namtho7078 pushed a commit to namtho7078/zerox that referenced this pull request Oct 27, 2024

Merge pull request getomni-ai#24 from dfdeagle47/fix/page-numbering-w…

a0999e7

…hen-only-converting-specific-pages fix: page numbering when only converting specific pages

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: page numbering when only converting specific pages #24

fix: page numbering when only converting specific pages #24

dfdeagle47 commented Sep 9, 2024 •

edited

Loading

dfdeagle47 commented Sep 11, 2024

annapo23 commented Sep 13, 2024

pradhyumna85 commented Sep 13, 2024

dfdeagle47 commented Sep 13, 2024

pradhyumna85 commented Sep 13, 2024 •

edited

Loading

fix: page numbering when only converting specific pages #24

fix: page numbering when only converting specific pages #24

Conversation

dfdeagle47 commented Sep 9, 2024 • edited Loading

Context

Description

Testing notes

dfdeagle47 commented Sep 11, 2024

annapo23 commented Sep 13, 2024

pradhyumna85 commented Sep 13, 2024

dfdeagle47 commented Sep 13, 2024

pradhyumna85 commented Sep 13, 2024 • edited Loading

dfdeagle47 commented Sep 9, 2024 •

edited

Loading

pradhyumna85 commented Sep 13, 2024 •

edited

Loading