You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PDF with multiple columns doesn’t extract text properly
When I tried to extract text in a PDF with 2 columns style. The text is read in a row by row fashion.
For example, I have a pdf like so
the reader will extract the text like so:
How can I configure it to extract text column by column?
I had this same problem and this function extracts the text in the columns and puts it in one. It wont probably work for all cases (specially if there are multiple spaces between words inside the same column) but maybe you can try it and see if it works for you. In my case it was a two column pdf and seems to extract the text mostly fine.
def parse_pdf_columns(pdf_url, numcols = 2)
io = URI.open(pdf_url)
reader = PDF::Reader.new(io)
parsed_text = ""
reader.pages.each_with_index do |page, pIndex|
columns_text = Array.new(numcols) { "" }
lines_in_columns = []
p "Processing page #{pIndex}"
page.text.split("\n").each do |line|
#First remove up to 50 leading spaces, otherwise we might think a padded heading belongs to another column,
#then split by 5 or more spaces
lines_in_columns << line.sub(/^\s{0,50}/, '').split(/\s{5,}/)
end
lines_in_columns.each do |line|
(0..numcols-1).each do |colIndex|
if line[colIndex].present?
columns_text[colIndex] += line[colIndex] + "\n"
end
end
end
parsed_text += columns_text.join("\n")
end
return parsed_text
end
PDF with multiple columns doesn’t extract text properly
When I tried to extract text in a PDF with 2 columns style. The text is read in a row by row fashion.
For example, I have a pdf like so
the reader will extract the text like so:
How can I configure it to extract text column by column?
Code snippet
The text was updated successfully, but these errors were encountered: