Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: page numbering when only converting specific pages #24

Conversation

dfdeagle47
Copy link
Contributor

@dfdeagle47 dfdeagle47 commented Sep 9, 2024

Context

Follow up of #23 (cf. #23 (comment)) to fix page numbering when not converting all pages.

When we only convert specific pages, the formattedPages.page index will not be correct.

Description

Depending on the use case, we can retrieve the proper index:

  • If pagesToConvertAsImages === -1: we simply use the aggregatedMarkdown array index, same as before.
  • If pagesToConvertAsImages is an array, we use the aggregatedMarkdown array index to find the corresponding element in pagesToConvertAsImages
  • If pagesToConvertAsImages is a number, we simply use that value

Note: pdf2pic will always return the pages in order, regardless the initial order in pagesToConvertAsImages parameter. Therefore, it's better to always sort the pagesToConvertAsImages array for consistency.

Testing notes

Here are a few tests cases (check the output to see if the page number is correct):

import { zerox } from "./src/index";
import { writeFile } from "node:fs/promises";

async function main() {
  // All pages (i.e. no value provided for `pagesToConvertAsImages`)
  const resultAllPages = await zerox({
    filePath: "https://www.sldttc.org/allpdf/21583473018.pdf",
    openaiAPIKey: process.env.OPENAI_API_KEY,
  });
  console.log(
    resultAllPages.pages.map((p) => ({
      ...p,
      content: p.content.slice(0, 100),
    }))
  );

  // Page 2
  const resultPage1 = await zerox({
    filePath: "https://www.sldttc.org/allpdf/21583473018.pdf",
    openaiAPIKey: process.env.OPENAI_API_KEY,
    pagesToConvertAsImages: 2,
  });
  console.log(
    resultPage1.pages.map((p) => ({
      ...p,
      content: p.content.slice(0, 100),
    }))
  );

  // Page 2 and 3
  const resultPage2and3 = await zerox({
    filePath: "https://www.sldttc.org/allpdf/21583473018.pdf",
    openaiAPIKey: process.env.OPENAI_API_KEY,
    pagesToConvertAsImages: [2, 3],
  });
  console.log(
    resultPage2and3.pages.map((p) => ({
      ...p,
      content: p.content.slice(0, 100),
    }))
  );

  // Page 3 and 2
  const resultPage3and2 = await zerox({
    filePath: "https://www.sldttc.org/allpdf/21583473018.pdf",
    openaiAPIKey: process.env.OPENAI_API_KEY,
    pagesToConvertAsImages: [3, 2],
  });
  console.log(
    resultPage3and2.pages.map((p) => ({
      ...p,
      content: p.content.slice(0, 100),
    }))
  );
}

main();

@dfdeagle47 dfdeagle47 marked this pull request as ready for review September 9, 2024 08:08
@dfdeagle47
Copy link
Contributor Author

@annapo23 @tylermaran

This PR is not urgent, and you might be busy with other stuff, but do let me know if I should change something or you need more context on the PR to merge this fix. Happy to iterate on it as needed.

@annapo23
Copy link
Contributor

@dfdeagle47 Works well! I'll go ahead and merge the PR.

@annapo23 annapo23 merged commit 26e87e8 into getomni-ai:main Sep 13, 2024
@pradhyumna85
Copy link
Contributor

@annapo23, @dfdeagle47 I want to add this functionality in the Python SDK. I am just curious on how should the maintain_format = True should be handled where we have pagesToConvertAsImages an array where the pages might not be in continuity like [2,4,8].
Any suggestions?

@dfdeagle47
Copy link
Contributor Author

@annapo23, @dfdeagle47 I want to add this functionality in the Python SDK. I am just curious on how should the maintain_format = True should be handled where we have pagesToConvertAsImages an array where the pages might not be in continuity like [2,4,8]. Any suggestions?

@pradhyumna85
That's a good question.

As I currently understand it, this parameter is used to maintain the same output markdown format throughout the document, by feeding the previous output page in the prompt. For instance, to have the same types of headings, a consistent tabular format etc.

While it's true that two consecutive pages might be more likely to be similar with each other, I think it doesn't rely too much on any single pages actually. The source document is likely to be consistent throughout, so if you pick two random pages in the source document, the formatting should still be consistent.

So my intuition is that it doesn't matter too much which output page you feed in the context in most cases, and it should still make sense, even when converting specific pages.

Apart from this thought, I don't see an obvious solution because if we need to feed the actual previous page, we'd end up converting the whole document again.

So perhaps, if formatting is really important for the end-user and it doesn't produce the expected result when selecting a few pages, they should try converting the whole document instead. So it's more of a choice for the user of the lib, rather than something that should be handled by the lib itself.

@pradhyumna85
Copy link
Contributor

pradhyumna85 commented Sep 13, 2024

@dfdeagle47 that makes sense. I think even if it has to be implemented then in theory maintain_format will only be applicable in continuous subsets of the sorted array like:

  • [3,4,5] whole array is continuous
  • [1,3,4,8,10,11] has two continuous subsets [3,4], [10,11]

@annapo23, @tylermaran, I am thinking on just implementing this PR in the python SDK for now and any further extensions to that can be done later.
Let me know your thoughts and how you want this to proceed.

pradhyumna85 pushed a commit to pradhyumna85/zerox that referenced this pull request Sep 13, 2024
Changes:
- Feature added: now specific pages can be processed with the python sdk using "select_pages" param. Incorporates getomni-ai#23, getomni-ai#24 for python sdk
- workflow for the above feature: create a new temperory pdf in the tempdir if select_pages is specified and follow the rest of the process as usual and finally map the page number in the formatted markdown to get the actual number instead of index.
- raise warning when both select_pages and maintain used.
- required adaptations and updates in messages, exceptions, types, processor, utils etc

Fixes/improvements:
- memory efficient pdf to image conversion, utilizing paths only option to directly get sorted image paths from pdf2image api

Misc:
- Bump the version tag
- documentation updated
tylermaran pushed a commit that referenced this pull request Sep 17, 2024
#39)

* Python SDK

Changes:
- Feature added: now specific pages can be processed with the python sdk using "select_pages" param. Incorporates #23, #24 for python sdk
- workflow for the above feature: create a new temperory pdf in the tempdir if select_pages is specified and follow the rest of the process as usual and finally map the page number in the formatted markdown to get the actual number instead of index.
- raise warning when both select_pages and maintain used.
- required adaptations and updates in messages, exceptions, types, processor, utils etc

Fixes/improvements:
- memory efficient pdf to image conversion, utilizing paths only option to directly get sorted image paths from pdf2image api

Misc:
- Bump the version tag
- documentation updated

* Minor update in README.md

---------

Co-authored-by: Pradyumna Singh Rathore <[email protected]>
namtho7078 pushed a commit to namtho7078/zerox that referenced this pull request Oct 27, 2024
…hen-only-converting-specific-pages

fix: page numbering when only converting specific pages
namtho7078 pushed a commit to namtho7078/zerox that referenced this pull request Oct 27, 2024
getomni-ai#39)

* Python SDK

Changes:
- Feature added: now specific pages can be processed with the python sdk using "select_pages" param. Incorporates getomni-ai#23, getomni-ai#24 for python sdk
- workflow for the above feature: create a new temperory pdf in the tempdir if select_pages is specified and follow the rest of the process as usual and finally map the page number in the formatted markdown to get the actual number instead of index.
- raise warning when both select_pages and maintain used.
- required adaptations and updates in messages, exceptions, types, processor, utils etc

Fixes/improvements:
- memory efficient pdf to image conversion, utilizing paths only option to directly get sorted image paths from pdf2image api

Misc:
- Bump the version tag
- documentation updated

* Minor update in README.md

---------

Co-authored-by: Pradyumna Singh Rathore <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants