can we replace/blur some text in pdf without breaking any layout or formatting? #31

priyanshujain · 2018-04-24T07:59:09Z

No description provided.

gettalong · 2018-04-24T18:19:39Z

Replacing text depends on whether the replacement text fits into the place of the replaced text. If so, then yes, this is possible.

As for bluring text: This may be possible by using PDF transparency and overlaying suitable graphics. However, this would only visually blur the text. Text selection or text extraction would still find the text underneath.

priyanshujain · 2018-06-07T08:10:16Z

@gettalong I tried this for bluring but did't get how to replace text using hexapdf.

require 'hexapdf'

class ShowTextProcessor < HexaPDF::Content::Processor

  def initialize(page, to_hide_arr)
    super()
    @canvas = page.canvas(type: :overlay)
    @to_hide_arr = to_hide_arr
  end

  def show_text(str)
    boxes = decode_text_with_positioning(str)
    return if boxes.string.empty?
    if @to_hide_arr.include? boxes.string
        @canvas.stroke_color(0, 0 , 0)

        boxes.each do |box|
          x, y = *box.lower_left
          tx, ty = *box.upper_right
          @canvas.rectangle(x, y, tx - x, ty - y).fill
        end
    end

  end
  alias :show_text_with_positioning :show_text

end

file_name = ARGV[0]
strings_to_black = ARGV[1].split("|")

doc = HexaPDF::Document.open(file_name)
puts "Blacken strings [#{strings_to_black}], inside [#{file_name}]."
doc.pages.each.with_index do |page, index|
  processor = ShowTextProcessor.new(page, strings_to_black)
  page.process_contents(processor)
end

new_file_name = "#{file_name.split('.').first}_updated.pdf"
doc.write(new_file_name, optimize: true)

puts "Writing updated file [#{new_file_name}]."

gettalong · 2018-06-08T04:35:09Z

The check for the boxes in #show_text is invalid, you need to use something like if @to_hide_arr.any? {|str| boxes.string.include?(str) }.

If you want to black-out a whole boxes instance, you don't need to iterate over its characters, you can just boxes.lower_left and boxes.upper_right but only if the text is layed out in a straight line.

Generally, text may be rotated, skewed, etc., so if you want to make sure to get the correct area, you need to iterate over the GlyphBoxes use the #points methods

And note the the decoded string boxes.string may contain a single character, a whole word or a sentence. If the PDF you are processing is rather well behaved, they try to minimize the amount of text drawing instructions and your code will work fine (except in the case where a word is broken over a line). However, it won't work in the general case since text instructions in PDF may appear in any order, including backwards, upwards or random. Therefore an analysis of all text on a page has to be done, reconstructing the normal flow of text and extrapolating words and sentences to achieve a somewhat consistent behaviour.

priyanshujain · 2018-06-08T08:02:41Z

@gettalong Can we do something for black out strings for decoded string boxes.string contains a single character or how can we iterate through all boxes.
or

If we can consider whole page content in a singles boxes element string then we can iterate through boxes and black out specific box.

priyanshujain · 2018-06-08T12:28:10Z

@gettalong I found some temporal solution with

require 'hexapdf'

class ShowTextProcessor < HexaPDF::Content::Processor

  def initialize(page, to_hide_arr)
    super()
    @canvas = page.canvas(type: :overlay)
    @to_hide_arr = to_hide_arr
    @boxeslist = []
  end

  def show_text(str)
    boxes = decode_text_with_positioning(str)
    boxes.each do |box|
        @boxeslist << box
    end
  end

  def blackout_text()
    @to_hide_arr.each do |hide_item|
      @boxeslist.each_with_index do |box, index|
        #puts sum_string(index, hide_item.length)
        if hide_item == sum_string(index, hide_item.length)
          blackout_array(index, hide_item.length)
        end
      end
    end
  end

  def blackout_array(start_ind, end_ind)
    sum = ""
    i = start_ind
    while i < start_ind+end_ind  do
      box = @boxeslist[i]
      @canvas.fill_color(255, 255, 255)
      x, y = *box.lower_left
      tx, ty = *box.upper_right
      @canvas.rectangle(x, y, tx - x, ty - y).fill
      i +=1
    end
  end

  def sum_string(start_ind, end_ind)
    sum = ""
    i = start_ind
    while i < start_ind+end_ind  do
      begin
        sum += @boxeslist[i].string
      rescue NoMethodError 
        print ""
      end
      i +=1
    end
    return sum
  end 

  alias :show_text_with_positioning :show_text

end


file_name = ARGV[0]
strings_to_black = ARGV[1].split("|")

doc = HexaPDF::Document.open(file_name)
puts "Blacken strings [#{strings_to_black}], inside [#{file_name}]."
doc.pages.each.with_index do |page, index|
  processor = ShowTextProcessor.new(page, strings_to_black)
  page.process_contents(processor)
  processor.blackout_text()
end

new_file_name = "#{file_name.split('.').first}_updated.pdf"
doc.write(new_file_name, optimize: true)

puts "Writing updated file [#{new_file_name}]."

gettalong · 2018-06-09T03:59:36Z

This code might work as long as the text was laid out in a linear order by the PDF writer. However, since you don't make use of the positional information while concatenating strings, there will be cases where it won't work.

One example is non-linear text output by the PDF writer, another is non-output of spaces between words.

However, in the general case, it should work fine because linear text output is one easy way to save space when writing a content stream.

priyanshujain · 2018-06-20T18:25:58Z

@gettalong Thanks for detailed explanation.

gettalong self-assigned this Apr 24, 2018

gettalong added the question label Apr 24, 2018

priyanshujain closed this as completed Jun 20, 2018

nigeljonez mentioned this issue Jul 17, 2019

Corruption of PDF mysociety/alaveteli#3446

Closed

joshco mentioned this issue Jun 6, 2020

How to remove a section of text? #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can we replace/blur some text in pdf without breaking any layout or formatting? #31

can we replace/blur some text in pdf without breaking any layout or formatting? #31

priyanshujain commented Apr 24, 2018

gettalong commented Apr 24, 2018

priyanshujain commented Jun 7, 2018 •

edited

Loading

gettalong commented Jun 8, 2018 •

edited

Loading

priyanshujain commented Jun 8, 2018 •

edited

Loading

priyanshujain commented Jun 8, 2018

gettalong commented Jun 9, 2018

priyanshujain commented Jun 20, 2018

can we replace/blur some text in pdf without breaking any layout or formatting? #31

can we replace/blur some text in pdf without breaking any layout or formatting? #31

Comments

priyanshujain commented Apr 24, 2018

gettalong commented Apr 24, 2018

priyanshujain commented Jun 7, 2018 • edited Loading

gettalong commented Jun 8, 2018 • edited Loading

priyanshujain commented Jun 8, 2018 • edited Loading

priyanshujain commented Jun 8, 2018

gettalong commented Jun 9, 2018

priyanshujain commented Jun 20, 2018

priyanshujain commented Jun 7, 2018 •

edited

Loading

gettalong commented Jun 8, 2018 •

edited

Loading

priyanshujain commented Jun 8, 2018 •

edited

Loading