-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: page numbering when only converting specific pages #24
fix: page numbering when only converting specific pages #24
Conversation
This PR is not urgent, and you might be busy with other stuff, but do let me know if I should change something or you need more context on the PR to merge this fix. Happy to iterate on it as needed. |
@dfdeagle47 Works well! I'll go ahead and merge the PR. |
@annapo23, @dfdeagle47 I want to add this functionality in the Python SDK. I am just curious on how should the maintain_format = True should be handled where we have pagesToConvertAsImages an array where the pages might not be in continuity like [2,4,8]. |
@pradhyumna85 As I currently understand it, this parameter is used to maintain the same output markdown format throughout the document, by feeding the previous output page in the prompt. For instance, to have the same types of headings, a consistent tabular format etc. While it's true that two consecutive pages might be more likely to be similar with each other, I think it doesn't rely too much on any single pages actually. The source document is likely to be consistent throughout, so if you pick two random pages in the source document, the formatting should still be consistent. So my intuition is that it doesn't matter too much which output page you feed in the context in most cases, and it should still make sense, even when converting specific pages. Apart from this thought, I don't see an obvious solution because if we need to feed the actual previous page, we'd end up converting the whole document again. So perhaps, if formatting is really important for the end-user and it doesn't produce the expected result when selecting a few pages, they should try converting the whole document instead. So it's more of a choice for the user of the lib, rather than something that should be handled by the lib itself. |
@dfdeagle47 that makes sense. I think even if it has to be implemented then in theory maintain_format will only be applicable in continuous subsets of the sorted array like:
@annapo23, @tylermaran, I am thinking on just implementing this PR in the python SDK for now and any further extensions to that can be done later. |
Changes: - Feature added: now specific pages can be processed with the python sdk using "select_pages" param. Incorporates getomni-ai#23, getomni-ai#24 for python sdk - workflow for the above feature: create a new temperory pdf in the tempdir if select_pages is specified and follow the rest of the process as usual and finally map the page number in the formatted markdown to get the actual number instead of index. - raise warning when both select_pages and maintain used. - required adaptations and updates in messages, exceptions, types, processor, utils etc Fixes/improvements: - memory efficient pdf to image conversion, utilizing paths only option to directly get sorted image paths from pdf2image api Misc: - Bump the version tag - documentation updated
#39) * Python SDK Changes: - Feature added: now specific pages can be processed with the python sdk using "select_pages" param. Incorporates #23, #24 for python sdk - workflow for the above feature: create a new temperory pdf in the tempdir if select_pages is specified and follow the rest of the process as usual and finally map the page number in the formatted markdown to get the actual number instead of index. - raise warning when both select_pages and maintain used. - required adaptations and updates in messages, exceptions, types, processor, utils etc Fixes/improvements: - memory efficient pdf to image conversion, utilizing paths only option to directly get sorted image paths from pdf2image api Misc: - Bump the version tag - documentation updated * Minor update in README.md --------- Co-authored-by: Pradyumna Singh Rathore <[email protected]>
…hen-only-converting-specific-pages fix: page numbering when only converting specific pages
getomni-ai#39) * Python SDK Changes: - Feature added: now specific pages can be processed with the python sdk using "select_pages" param. Incorporates getomni-ai#23, getomni-ai#24 for python sdk - workflow for the above feature: create a new temperory pdf in the tempdir if select_pages is specified and follow the rest of the process as usual and finally map the page number in the formatted markdown to get the actual number instead of index. - raise warning when both select_pages and maintain used. - required adaptations and updates in messages, exceptions, types, processor, utils etc Fixes/improvements: - memory efficient pdf to image conversion, utilizing paths only option to directly get sorted image paths from pdf2image api Misc: - Bump the version tag - documentation updated * Minor update in README.md --------- Co-authored-by: Pradyumna Singh Rathore <[email protected]>
Context
Follow up of #23 (cf. #23 (comment)) to fix page numbering when not converting all pages.
When we only convert specific pages, the
formattedPages.page
index will not be correct.Description
Depending on the use case, we can retrieve the proper index:
pagesToConvertAsImages === -1
: we simply use theaggregatedMarkdown
array index, same as before.pagesToConvertAsImages
is an array, we use theaggregatedMarkdown
array index to find the corresponding element inpagesToConvertAsImages
pagesToConvertAsImages
is a number, we simply use that valueNote:
pdf2pic
will always return the pages in order, regardless the initial order inpagesToConvertAsImages
parameter. Therefore, it's better to always sort thepagesToConvertAsImages
array for consistency.Testing notes
Here are a few tests cases (check the output to see if the
page
number is correct):