Extract text with whitespace and all new lines #383
-
Hello. I have a report in a PDF that I would like to extract with the line spacing and white spaces. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi, this gets me most of the way. I usually use a GAP of .3 and OrigRow is true. The GAP is basically used to add space between each character, so you can play with it depending on the font size. Things get a bit messy if the fonts change size quite a lot. oh and you can swap the output to be either spool or lines if you want an xml dump of the text data. using System;
} |
Beta Was this translation helpful? Give feedback.
-
I did it slightly differently. I used the Y coordinate (which is the bottom of the letter). Then simply compared the previous to the next letter. If y(1) > y(2) then the text must be lower. I.e. a new line. Here's the C# code: using Microsoft.Extensions.Configuration; using (var stream = File.OpenRead("newpdf.pdf")) // Hope this helps someone. |
Beta Was this translation helpful? Give feedback.
Hi, this gets me most of the way. I usually use a GAP of .3 and OrigRow is true. The GAP is basically used to add space between each character, so you can play with it depending on the font size. Things get a bit messy if the fonts change size quite a lot.
Let me know if you find a better way.
oh and you can swap the output to be either spool or lines if you want an xml dump of the text data.
using System;
using System.Text;
using System.IO;
using Org.BouncyCastle.Cms;
using System.Collections.Generic;
using System.Xml;
using System.Xml.Linq;
using System.Linq;
using System.Collections;
using UglyToad.PdfPig;
using UglyToad.PdfPig.DocumentLayoutAnalysis.TextExtractor;
namespace PDFTools
{
…