Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I use mupdf.js to get acutal font name, and modify text content before they were rendered to html elements ? #109

Open
longnight opened this issue Aug 22, 2024 · 2 comments

Comments

@longnight
Copy link

longnight commented Aug 22, 2024

In PyMuPdf I can do this, on pdf files that with text layer:

text_dict = page.get_text("dict")
    for bl in text_dict['blocks']:
        for line in bl.get('lines', []):
            for span in line.get('spans', []):
                print(span.get('font'))    //  here I got the actual font name

But in PDF.js, it transfer/change font name to internal identifier likes "g_d0_f18" .
Now in mupdf.js , can I extract these text blocks, with actual font name as py script did ?

And question sencond, still for pdf with text layer:

Can I replace/modify some text content before they were rendered into page/html elements in the viewer ? I need to replace some sepecial symbols(they were set in special custom font) into other characters , then when others select then copy its text they got a modified verstion text content.

@jamie-lemon
Copy link
Collaborator

For the first question have you tried toStructuredText() , https://mupdfjs.readthedocs.io/en/latest/how-to-guide/node/document/index.html#extracting-document-text

const stext = page.toStructuredText("preserve-whitespace").asJSON()
console.log(`stext=${stext}`)
const json = JSON.parse(stext);
console.log(`json=${json}`)

This gives me reasonable font names against text objects.

For the second question I think you would need to redact the content and then insert your own version of the text - so redaction - https://mupdfjs.readthedocs.io/en/latest/how-to-guide/node/annotations/redactions/index.html and then adding text - https://mupdfjs.readthedocs.io/en/latest/how-to-guide/node/page/index.html#adding-text-to-pages . The API here is a bit tricky, and we will be working on providing a simpler API for this kind of thing in the future.

@longnight
Copy link
Author

Thank you for your detailed answer. Migrating pdf lib for custom viewer is not a small project. I will continue to keep an eye on this library until it matures. @jamie-lemon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants