Extract text with whitespace and all new lines #383

EspressoWillie · 2021-10-22T20:15:07Z

EspressoWillie
Oct 22, 2021

Hello. I have a report in a PDF that I would like to extract with the line spacing and white spaces.
Tried extracting with ContentOrderTextExtractor.GetText(page, True), but it removes the white spaces and extra new lines.
Thoughts?
Thank you.

Answered by topcat30

Feb 22, 2022

Hi, this gets me most of the way. I usually use a GAP of .3 and OrigRow is true. The GAP is basically used to add space between each character, so you can play with it depending on the font size. Things get a bit messy if the fonts change size quite a lot.
Let me know if you find a better way.

oh and you can swap the output to be either spool or lines if you want an xml dump of the text data.

using System;
using System.Text;
using System.IO;
using Org.BouncyCastle.Cms;
using System.Collections.Generic;
using System.Xml;
using System.Xml.Linq;
using System.Linq;
using System.Collections;
using UglyToad.PdfPig;
using UglyToad.PdfPig.DocumentLayoutAnalysis.TextExtractor;
namespace PDFTools
{
…

View full answer

topcat30 · 2022-02-22T02:20:33Z

topcat30
Feb 22, 2022

Hi, this gets me most of the way. I usually use a GAP of .3 and OrigRow is true. The GAP is basically used to add space between each character, so you can play with it depending on the font size. Things get a bit messy if the fonts change size quite a lot.
Let me know if you find a better way.

oh and you can swap the output to be either spool or lines if you want an xml dump of the text data.

using System;
using System.Text;
using System.IO;
using Org.BouncyCastle.Cms;
using System.Collections.Generic;
using System.Xml;
using System.Xml.Linq;
using System.Linq;
using System.Collections;
using UglyToad.PdfPig;
using UglyToad.PdfPig.DocumentLayoutAnalysis.TextExtractor;
namespace PDFTools
{
public class Pdf2Text
{

    public string Encoded(string EncDocin, double gap, bool OrigRow)
    {
        StringBuilder text = new StringBuilder();


        string lines = "";
        string spool = "";
        byte[] str2 = Convert.FromBase64String(EncDocin);
        MemoryStream mstream = new MemoryStream(str2);

        using (var r = PdfDocument.Open(mstream))
        {
            string pageText;
            string pagetext2;
            XElement xmlElements = new XElement("Root");
            for (int i = 1; i <= r.NumberOfPages; i++)

            {
                //ContentOrderTextExtractor ct = new ContentOrderTextExtractor(GetText)

                pageText = "";
                pagetext2 = "";
                var page = r.GetPage(i);
              

                xmlElements.Add(new XElement("Page" + i));
                int groupby;
                if (OrigRow == true) groupby = 1; else groupby = 10;
                var sorted = page.Letters.OrderByDescending(x => Math.Round((x.StartBaseLine.Y) / groupby) * groupby).ThenBy(x => x.GlyphRectangle.Left).ToList();
                //var sorted = page.Letters.OrderByDescending(x => (x.StartBaseLine.Y) / groupby * groupby).ThenBy(x => x.GlyphRectangle.Left).ToList();



                //  var grouped = sorted.GroupBy(x => Math.Round((x.StartBaseLine.Y) / 10) * 10).ToList();
                var grouped = sorted.GroupBy(x => Math.Round((x.StartBaseLine.Y) / groupby) * groupby).ToList();
                //var grouped = sorted.GroupBy(x => x.StartBaseLine.Y*1).ToList();
                //var grouped = sorted.GroupBy(x => x.StartBaseLine.Y).ToList();

                int n = 1;
                string Row = "";
                foreach (var group in grouped)
                {
                    //double gap = .3;
                    Row = "";
                    int LoopCount = 0;
                    int LineCount = 0;
                    double endloop = 0;
                    xmlElements.Element("Page" + i).Add(new XElement("ROW" + n));
                    string varText = "";
                    double previousRight = 0;
                    double previousLeft = 0;
                    string previousText = "";
                    int rowLength = 0;

                    foreach (var p in group)
                    {
                        endloop = 0;
                        LoopCount = 1;
                        // varText = p.Value.PadLeft((int)p.GlyphRectangle.Left - rowLength, ' ');
                        double chunkDif = (p.StartBaseLine.X) - previousRight;
                        //var diff = Math.Abs(LineCount - (p.GlyphRectangle.Left/.3));
                        xmlElements.Element("Page" + i).Element("ROW" + n).Add(new XElement("Chunk", new XElement("Value", p.Value), new XElement("leftPos", p.GlyphRectangle.Left), new XElement("bottomPos", p.GlyphRectangle.Bottom), new XElement("topPosOrig", p.GlyphRectangle.Top), new XElement("topPos", Math.Round((p.GlyphRectangle.Top) / 10) * 10), new XElement("rightPos", p.GlyphRectangle.Right), new XElement("Length", Math.Round(p.GlyphRectangle.Width, 0, MidpointRounding.AwayFromZero)), new XElement("Height", Math.Round(p.GlyphRectangle.Height)), new XElement("Type", p.GlyphRectangle.Rotation), new XElement("Diff", chunkDif)));

                        if (LineCount == 1)
                        {
                            endloop = (p.StartBaseLine.X * gap);
                        }
                        else
                        {
                            //endloop = (diff)/2;
                            if (LineCount <= (p.StartBaseLine.X * gap))
                            { endloop = (((p.StartBaseLine.X * gap)) - LineCount) + p.Value.Length; }
                            else { endloop = (p.StartBaseLine.X * gap) + p.Value.Length; }

                        }
                        if (chunkDif > gap)
                        {
                            //varText = p.Value;
                            varText = p.Value.PadLeft((int)endloop, ' ');
                        }
                        else
                        { varText = p.Value; }



                        Row = Row + varText.Replace("\r\n", "");
                        rowLength = Row.Length;
                        LineCount = Row.Length;//(LoopCount + p.Value.Replace("\r\n", "").Length) + LineCount;
                        previousRight = p.EndBaseLine.X;
                        previousLeft = (p.StartBaseLine.X);
                        previousText = varText;
                    }
                    previousRight = 0;
                    //Console.Write(Row);
                    // Console.ReadLine();
                    if (n == 1)
                    { pageText = pageText + Row; }
                    else { pageText = pageText + "\r\n" + Row; }
                    n++;
                }
                if (i == 1)
                {
                    spool = spool + pageText;
                }
                else { spool = spool + "\f" + pageText; }

                page = null;
            }
            text.Append(xmlElements);
            lines = text.ToString();
            text.Length = 0;
            text.Capacity = 0;
        }

    //    File.WriteAllText(@"C:\Temp\csc2.txt", spool);
    //    File.WriteAllText(@"C:\Temp\csc.xml", lines);
        //Console.WriteLine(lines);
        //Console.ReadLine();
        return spool;
       
    }



  












}

}

0 replies

briang84 · 2024-01-19T21:37:21Z

briang84
Jan 19, 2024

I did it slightly differently. I used the Y coordinate (which is the bottom of the letter). Then simply compared the previous to the next letter. If y(1) > y(2) then the text must be lower. I.e. a new line. Here's the C# code:

using Microsoft.Extensions.Configuration;
using UglyToad.PdfPig.Content;

using (var stream = File.OpenRead("newpdf.pdf"))
using (UglyToad.PdfPig.PdfDocument document = UglyToad.PdfPig.PdfDocument.Open(stream))
{
string result = "";
string output = "";
var page = document.GetPage(1);
result = string.Join("\r\n", page.Letters);
var letterMeta = result.Split("\r\n");
string[]? letterArray = null;
string line = "";
float letterYpos = 0;
foreach (var letter in letterMeta )
{
try
{
letterArray = letter.Split(" ");
if (letterYpos > float.Parse(letterArray[2].Replace("y:", "").Replace(")", "")))
{
output += "\r\n\r\n";
}
letterYpos = float.Parse(letterArray[2].Replace("y:", "").Replace(")", ""));
output += letterArray[0];
}
catch
{
output += " ";
}
}
Console.WriteLine(output);
}

// Hope this helps someone.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract text with whitespace and all new lines #383

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Extract text with whitespace and all new lines #383

EspressoWillie Oct 22, 2021

Replies: 2 comments

topcat30 Feb 22, 2022

briang84 Jan 19, 2024

EspressoWillie
Oct 22, 2021

topcat30
Feb 22, 2022

briang84
Jan 19, 2024