Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can we replace/blur some text in pdf without breaking any layout or formatting? #31

Closed
priyanshujain opened this issue Apr 24, 2018 · 7 comments
Assignees
Labels

Comments

@priyanshujain
Copy link

No description provided.

@gettalong gettalong self-assigned this Apr 24, 2018
@gettalong
Copy link
Owner

Replacing text depends on whether the replacement text fits into the place of the replaced text. If so, then yes, this is possible.

As for bluring text: This may be possible by using PDF transparency and overlaying suitable graphics. However, this would only visually blur the text. Text selection or text extraction would still find the text underneath.

@priyanshujain
Copy link
Author

priyanshujain commented Jun 7, 2018

@gettalong I tried this for bluring but did't get how to replace text using hexapdf.

require 'hexapdf'

class ShowTextProcessor < HexaPDF::Content::Processor

  def initialize(page, to_hide_arr)
    super()
    @canvas = page.canvas(type: :overlay)
    @to_hide_arr = to_hide_arr
  end

  def show_text(str)
    boxes = decode_text_with_positioning(str)
    return if boxes.string.empty?
    if @to_hide_arr.include? boxes.string
        @canvas.stroke_color(0, 0 , 0)

        boxes.each do |box|
          x, y = *box.lower_left
          tx, ty = *box.upper_right
          @canvas.rectangle(x, y, tx - x, ty - y).fill
        end
    end

  end
  alias :show_text_with_positioning :show_text

end

file_name = ARGV[0]
strings_to_black = ARGV[1].split("|")

doc = HexaPDF::Document.open(file_name)
puts "Blacken strings [#{strings_to_black}], inside [#{file_name}]."
doc.pages.each.with_index do |page, index|
  processor = ShowTextProcessor.new(page, strings_to_black)
  page.process_contents(processor)
end

new_file_name = "#{file_name.split('.').first}_updated.pdf"
doc.write(new_file_name, optimize: true)

puts "Writing updated file [#{new_file_name}]."

@gettalong
Copy link
Owner

gettalong commented Jun 8, 2018

The check for the boxes in #show_text is invalid, you need to use something like if @to_hide_arr.any? {|str| boxes.string.include?(str) }.

If you want to black-out a whole boxes instance, you don't need to iterate over its characters, you can just boxes.lower_left and boxes.upper_right but only if the text is layed out in a straight line.

Generally, text may be rotated, skewed, etc., so if you want to make sure to get the correct area, you need to iterate over the GlyphBoxes use the #points methods

And note the the decoded string boxes.string may contain a single character, a whole word or a sentence. If the PDF you are processing is rather well behaved, they try to minimize the amount of text drawing instructions and your code will work fine (except in the case where a word is broken over a line). However, it won't work in the general case since text instructions in PDF may appear in any order, including backwards, upwards or random. Therefore an analysis of all text on a page has to be done, reconstructing the normal flow of text and extrapolating words and sentences to achieve a somewhat consistent behaviour.

@priyanshujain
Copy link
Author

priyanshujain commented Jun 8, 2018

@gettalong Can we do something for black out strings for decoded string boxes.string contains a single character or how can we iterate through all boxes.
or

If we can consider whole page content in a singles boxes element string then we can iterate through boxes and black out specific box.

@priyanshujain
Copy link
Author

@gettalong I found some temporal solution with

require 'hexapdf'

class ShowTextProcessor < HexaPDF::Content::Processor

  def initialize(page, to_hide_arr)
    super()
    @canvas = page.canvas(type: :overlay)
    @to_hide_arr = to_hide_arr
    @boxeslist = []
  end

  def show_text(str)
    boxes = decode_text_with_positioning(str)
    boxes.each do |box|
        @boxeslist << box
    end
  end

  def blackout_text()
    @to_hide_arr.each do |hide_item|
      @boxeslist.each_with_index do |box, index|
        #puts sum_string(index, hide_item.length)
        if hide_item == sum_string(index, hide_item.length)
          blackout_array(index, hide_item.length)
        end
      end
    end
  end

  def blackout_array(start_ind, end_ind)
    sum = ""
    i = start_ind
    while i < start_ind+end_ind  do
      box = @boxeslist[i]
      @canvas.fill_color(255, 255, 255)
      x, y = *box.lower_left
      tx, ty = *box.upper_right
      @canvas.rectangle(x, y, tx - x, ty - y).fill
      i +=1
    end
  end

  def sum_string(start_ind, end_ind)
    sum = ""
    i = start_ind
    while i < start_ind+end_ind  do
      begin
        sum += @boxeslist[i].string
      rescue NoMethodError 
        print ""
      end
      i +=1
    end
    return sum
  end 

  alias :show_text_with_positioning :show_text

end


file_name = ARGV[0]
strings_to_black = ARGV[1].split("|")

doc = HexaPDF::Document.open(file_name)
puts "Blacken strings [#{strings_to_black}], inside [#{file_name}]."
doc.pages.each.with_index do |page, index|
  processor = ShowTextProcessor.new(page, strings_to_black)
  page.process_contents(processor)
  processor.blackout_text()
end

new_file_name = "#{file_name.split('.').first}_updated.pdf"
doc.write(new_file_name, optimize: true)

puts "Writing updated file [#{new_file_name}]."

@gettalong
Copy link
Owner

This code might work as long as the text was laid out in a linear order by the PDF writer. However, since you don't make use of the positional information while concatenating strings, there will be cases where it won't work.

One example is non-linear text output by the PDF writer, another is non-output of spaces between words.

However, in the general case, it should work fine because linear text output is one easy way to save space when writing a content stream.

@priyanshujain
Copy link
Author

@gettalong Thanks for detailed explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants