Is there any way to remove this textbox use PyMuPDF for large numbers PDF files ? Thanks #3527

zwjat · 2024-05-30T04:17:01Z

zwjat
May 30, 2024

I have large number of PDF files with following textbox as watermark or advertisement:

In this picture,, the textbox has blue border and has some degree as well as transparent text.

I want to remove this textbox and do not have any effect on text under this textbox.

The following code snippet doesn't work as it will remove the text under this textbox:
`
pf = fitz.open(fp)
for pg in range(pf.page_count):
page = pf[pg]

    tbs = page.get_text_blocks()
    for tb in tbs:
        if not ('bzfxw' in tb[4] or  'biaozhun.org' in tb[4]):
            continue

        annot = page.add_redact_annot((tb[0], tb[1], tb[2], tb[3]))

    page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)

`

The code above produces the following pdf:

What is the the right and effective way to do this ? I appreciate your help.

JorjMcKie · 2024-05-30T12:31:04Z

JorjMcKie
May 30, 2024
Maintainer

This very much depends on how this watermark is implemented technically.
I cannot say until I see the PDF itself.

Except one thing: it is text and also some vector graphics.

4 replies

zwjat May 31, 2024
Author

6J3aJe6zB6BN.pdf

Thank you for your reply . This is the file.

From this PDF file, we can see a textbox with text of 'www.bzfxw.com' coving over PDF content.

JorjMcKie May 31, 2024
Maintainer

Ahhhh You are looking to remove a link!
I thought some watermark was in you way ...
Links are easy to remove:

for link in page.get_links():
    page.delete_link(link)

zwjat Jun 1, 2024
Author

Why does it not work for me?

page.get_links():

return [] for PDF file 6J3aJe6zB6BN.pdf

zwjat Jun 1, 2024
Author

Why does it not work for me?

page.get_links():

return [] for PDF file 6J3aJe6zB6BN.pdf

the following is whole code:
`
import os
import fitz

BASE_DIR = 'pdfs/'
fns = [x for x in os.listdir(BASE_DIR) if '.pdf' in x]

for fn in fns:
fp = BASE_DIR + fn
pf = fitz.open(fp)

for pg in range(pf.page_count):
    page = pf[pg]

    print(' ------------- ', page.get_links())  # -------------  []
    for link in page.get_links():
        print(' ------------- ', link)
        page.delete_link(link)

pf.save('jllNav/' + fn)
pf.close()

print(' ============= ', fn , ' ======= FINISHED ')`

JorjMcKie · 2024-06-01T08:00:43Z

JorjMcKie
Jun 1, 2024
Maintainer

Sorry - I took the wrong road.
This link is not defined on the pages. It is defined in a special object which is referenced by every page.

It cannot be detected and removed by the official API.
But once one knows that it is there, it is possible to use low level functions to remove it.

for page in doc:
    print(doc.xref_get_key(page.xref,"Resources/XObject"))

    
('dict', '<</Im0 89 0 R/Fm0 94 0 R>>')
('dict', '<</Im0 3 0 R/Fm0 94 0 R>>')
('dict', '<</Im0 6 0 R/Fm0 94 0 R>>')
('dict', '<</Im0 9 0 R/Fm0 94 0 R>>')
('dict', '<</Im0 12 0 R/Fm0 94 0 R>>')
('dict', '<</Im0 15 0 R/Fm0 94 0 R>>')
('dict', '<</Im0 18 0 R/Fm0 94 0 R>>')
('dict', '<</Im0 21 0 R/Fm0 94 0 R>>')

From the above, we see that object /Fm0 is referenced by all pages (xref 94). Looking at the source of xref 94:

 print(doc.xref_object(94))
<<
  /Subtype /Form
  /Length 96
  /OC 96 0 R
  /PieceInfo <<
    /ADBE_CompoundType <<
      /Private /Watermark
      /LastModified (D:20080804204943+08'00')
    >>
  >>
  /Matrix [ 1 0 0 1 0 0 ]
  /Resources <<
    /Font <<
      /TT0 91 0 R
    >>
    /ProcSet [ /PDF /Text ]
  >>
  /BBox [ 0 -87.1205 516.025 -5.61621 ]
  /LastModified (D:20080804204943+08'00')
  /FormType 1
>>

Shows that it seems to be a watermark. We can remove the watermark by setting object 94 to empty:

doc.update_object(94, "<<>>")

When saving we will see that the watermark is gone.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any way to remove this textbox use PyMuPDF for large numbers PDF files ? Thanks #3527

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is there any way to remove this textbox use PyMuPDF for large numbers PDF files ? Thanks #3527

zwjat May 30, 2024

Replies: 2 comments · 4 replies

JorjMcKie May 30, 2024 Maintainer

zwjat May 31, 2024 Author

JorjMcKie May 31, 2024 Maintainer

zwjat Jun 1, 2024 Author

zwjat Jun 1, 2024 Author

JorjMcKie Jun 1, 2024 Maintainer

zwjat
May 30, 2024

Replies: 2 comments 4 replies

JorjMcKie
May 30, 2024
Maintainer

zwjat May 31, 2024
Author

JorjMcKie May 31, 2024
Maintainer

zwjat Jun 1, 2024
Author

zwjat Jun 1, 2024
Author

JorjMcKie
Jun 1, 2024
Maintainer