-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to export_to_markdown to mark page breaks #309
Comments
Hi @cau-git I am interested in working on this issue. |
@chakravarthik27 Thanks, we would welcome that you make a contribution for this issue! As a starting point, it would require extension of this method. |
@chakravarthik27 |
@chakravarthik27 Absolutely not, go ahead! Just make a PR in https://github.com/DS4SD/docling-core! |
I'm getting confusion, Still I didn't started, so please continue @sunwoongc |
@sunwoongc @chakravarthik27 Please coordinate with each other for the markdown pagebreaks and let us know when you expect it to be done. |
@chakravarthik27 @PeterStaar-IBM Thank you for your input! I noticed a simple but important detail: most items inheriting from DocItem have an attribute called prov, which includes a page_no field for tracking provenance. For reference, here's the ProvenanceItem. However, the GroupItem class lacks this attribute, as it's designated as a container type. See GroupItem. To handle this in the export_to_markdown function, I've added the following code: prev_page_no = -1
page_change_flag = False
for ix, (item, level) in enumerate(doc.iterate_items(doc.body, with_groups=True)):
if not isinstance(item, GroupItem):
cur_page_no = item.prov[0].page_no
if prev_page_no != cur_page_no:
page_change_flag = True
else:
page_change_flag = False
# Append text if page has changed
if page_change_flag:
mdtexts.append(f"Page {cur_page_no}")
# Update previous page number after handling change
prev_page_no = cur_page_no I'm concerned this solution may encounter edge cases. If you have any suggestions or foresee potential issues, I'd appreciate your feedback! |
As suggested in this discussion, we should add a placeholder feature for page breaks, the same way we support placeholders for pictures.
The text was updated successfully, but these errors were encountered: