Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to export_to_markdown to mark page breaks #309

Open
cau-git opened this issue Nov 11, 2024 · 7 comments
Open

Add option to export_to_markdown to mark page breaks #309

cau-git opened this issue Nov 11, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@cau-git
Copy link
Contributor

cau-git commented Nov 11, 2024

As suggested in this discussion, we should add a placeholder feature for page breaks, the same way we support placeholders for pictures.

  • The placeholder text for page breaks should by default contain the page number.
  • It should be disabled as a default.
@cau-git cau-git added the enhancement New feature or request label Nov 11, 2024
@chakravarthik27
Copy link

Hi @cau-git

I am interested in working on this issue.

@cau-git
Copy link
Contributor Author

cau-git commented Nov 12, 2024

@chakravarthik27 Thanks, we would welcome that you make a contribution for this issue!

As a starting point, it would require extension of this method.

@sunwoongc
Copy link

@chakravarthik27
Hi, I'm also very interested in resolving this issue. I've found a solution that works in my case, though I'm not sure if it’s a general fix. Would you mind if I also worked on this?

@PeterStaar-IBM
Copy link
Contributor

@chakravarthik27 Absolutely not, go ahead! Just make a PR in https://github.com/DS4SD/docling-core!

@chakravarthik27
Copy link

I'm getting confusion, Still I didn't started, so please continue @sunwoongc

@PeterStaar-IBM
Copy link
Contributor

@sunwoongc @chakravarthik27 Please coordinate with each other for the markdown pagebreaks and let us know when you expect it to be done.

@sunwoongc
Copy link

@chakravarthik27 @PeterStaar-IBM

Thank you for your input!

I noticed a simple but important detail: most items inheriting from DocItem have an attribute called prov, which includes a page_no field for tracking provenance. For reference, here's the ProvenanceItem.

However, the GroupItem class lacks this attribute, as it's designated as a container type. See GroupItem.

To handle this in the export_to_markdown function, I've added the following code:

prev_page_no = -1
page_change_flag = False
for ix, (item, level) in enumerate(doc.iterate_items(doc.body, with_groups=True)):  
    if not isinstance(item, GroupItem):
        cur_page_no = item.prov[0].page_no
        if prev_page_no != cur_page_no:
            page_change_flag = True
        else:
            page_change_flag = False

        # Append text if page has changed
        if page_change_flag:
            mdtexts.append(f"Page {cur_page_no}")

        # Update previous page number after handling change
        prev_page_no = cur_page_no

I'm concerned this solution may encounter edge cases. If you have any suggestions or foresee potential issues, I'd appreciate your feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants