How to extract texts between two coordinates in a page? #3955

StephenZKCurry · 2024-10-17T02:08:39Z

I want to extract texts between two coordinates on a page use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position, mimics dragging a cursor highlights text in a PDF, How can I do that?

JorjMcKie · 2024-10-17T10:46:14Z

You can supply an arbitrary rectangle ("clip") inside which your desired text lives. If you only have top and bottom values, make a rectangle clip = pymupdf.Rect(0, top, page.rect.width, bottom).
Then execute text = page.get_text(sort=True, clip=clip).
This will (pymupdf v1.24.11+) extract the text in reading order.

pymupdf locked and limited conversation to collaborators Oct 17, 2024

JorjMcKie converted this issue into discussion #3959 Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

How to extract texts between two coordinates in a page? #3955

How to extract texts between two coordinates in a page? #3955

StephenZKCurry commented Oct 17, 2024

JorjMcKie commented Oct 17, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

How to extract texts between two coordinates in a page? #3955

How to extract texts between two coordinates in a page? #3955

Comments

StephenZKCurry commented Oct 17, 2024

JorjMcKie commented Oct 17, 2024

This issue was moved to a discussion.