Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maybe_is_text() discards valid text due to spaces from titles and tables #710

Open
loesinghaus opened this issue Nov 19, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@loesinghaus
Copy link

loesinghaus commented Nov 19, 2024

I noticed that the maybe_is_text() check discards quite a few perfectly valid and well-parsed publications. The issue is that it checks the entropy of the first text chunk of a document. Document parsing by pymupdf can introduce a lot of spaces, especially if the first few pages contain a title page, tables, or something similar (which they very often do, especially for books). Might be better to average across text chunks in the middle of the document.

Alternatively, checking the entropy of the text without spaces fixed it for my pdfs:

def maybe_is_text(s: str, thresh: float = 2.5) -> bool:
    if not s:
        return False
    # Calculate the entropy of the string
    entropy = 0.0
    s_wo_spaces = s.replace(" ", "")
    for c in string.printable:
        p = s_wo_spaces.count(c) / len(s_wo_spaces)
        if p > 0:
            entropy += -p * math.log2(p)

    return entropy > thresh
@dosubot dosubot bot added the bug Something isn't working label Nov 19, 2024
@jamesbraza
Copy link
Collaborator

I like what you're thinking, feel free to make a PR and expand the test_maybe_is_text in tests/test_paperqa.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants