Text extraction code for columns. #366

peterwilliams97 · 2020-06-05T04:34:05Z

This is a major update to the text extraction code that works with text arranged in columns.

extractor/text.go is now split across multiple text_*.go files.
the new design is summarised in the extractor README.

Here are new PDFs and text extraction references files for extractor/text_test.go.

reference.zip + eu.page005.txt +[Productivity.page001.txt] (https://github.com/unidoc/unipdf/files/4735832/Productivity.page001.txt) + we-dms.page001.txt + radar-eng.page002.txt + Nuance.page001.txt
pdfs.zip + eu.pdf + Productivity.pdf + we-dms.pdf + radar-eng.pdf +Nuance.pdf

You can also run pdf_extract_text.go to see the extraction. There is an updated version of this test here that makes it easier to test a corpus of PDFs.

This change is

…ncodings.

…to *textMark in a lot of code.

…r thanh.pdf

…olumns

…ted text.

…olumns

…ction changes.

Performance improvements in several places. Commented code.

…olumns

…there weren't any.

…n word bag.

…ont errorrs. See https://blog.golang.org/go1.13-errors

peterwilliams97 · 2020-06-27T01:44:03Z

extractor/text.go, line 101 at r6 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

Should use Trace for this? or specifically want to look at this? Trace has a lot of stuff

I have needed to look at the operators a few times while developing the columns code. Eventually this code will be known to be bug free, but I am not 100% sure that it is now.

…olumns

gunnsth

Reviewed 2 of 23 files at r1, 1 of 7 files at r4, 1 of 6 files at r8, 6 of 14 files at r9, 1 of 1 files at r10, 2 of 2 files at r11, 2 of 3 files at r12, 3 of 4 files at r13, 5 of 5 files at r14.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @adrg and @peterwilliams97)

extractor/extractor.go, line 50 at r14 (raw file):

	mediaBox, err := page.GetMediaBox()
	if err != nil {
		return nil, fmt.Errorf("extractor requires mediaBox. %w", err)

We need to wait with this until we drop support for 1.12

extractor/text_test.go, line 84 at r14 (raw file):

		0 1 -1 0 0 0 Tm
		(Hello World!)Tj
		0 -25 Td

Any reason for changing from -10 to -25?

extractor/text_test.go, line 602 at r14 (raw file):

	// XXX(peterwilliams97): The new text extraction changes TextMark contents. From now on we
	// only test their behaviour, not their implementation.

In that case should we remove the commented test codes?

extractor/text_test.go, line 653 at r14 (raw file):

	}
	for i, filename := range pathList {
		// 4865ab395ed664c3ee17.pdf is a corrupted file in the test corpus.

should we remove it, if its corrupt?

extractor/text_utils.go, line 41 at r14 (raw file):

// addNeighbours fills out the below and right fields of the paras in `paras`.
// For each para `a`:
//    a.below is the unique highest para completely below `a` that overlaps it in the x-direction

x-direction, same as reading direction, and y-direction depth direction? Or purely x/y at this level?

internal/textencoding/simple.go, line 58 at r14 (raw file):

	if !ok {
		common.Log.Debug("ERROR: NewSimpleTextEncoder. Unknown encoding %q", baseName)
		return nil, fmt.Errorf("unsupported font encoding: %q (%w)", baseName, core.ErrNotSupported)

Needs to work with go 1.12

model/const.go, line 24 at r14 (raw file):

	ErrEncrypted                = errors.New("file needs to be decrypted first")
	ErrNoFont                   = errors.New("font not defined")
	ErrFontNotSupported         = fmt.Errorf("unsupported font (%w)", core.ErrNotSupported)

needs to work with 1.12

model/internal/fonts/ttfparser.go, line 212 at r14 (raw file):

	if version == "OTTO" {
		// See https://docs.microsoft.com/en-us/typography/opentype/spec/otff
		return TtfType{}, fmt.Errorf("fonts based on PostScript outlines are not supported (%w)",

check

model/internal/fonts/ttfparser.go, line 380 at r14 (raw file):

	format := t.ReadUShort()
	if format != 4 {
		return fmt.Errorf("unexpected subtable format: %d (%w)", format, core.ErrNotSupported)

check

gunnsth

Looking great. Biggest comment now regarding error handling. We want to keep go 1.12 compatibility
so need to stick with it
We could use
https://godoc.org/golang.org/x/xerrors
in the meantime which provides some of this functionality needed which was added in go 1.13?

gunnsth · 2020-06-29T08:42:04Z

extractor/text.go

 	cstreamParser := contentstream.NewContentStreamParser(contents)
 	operations, err := cstreamParser.Parse()
 	if err != nil {
-		common.Log.Debug("ERROR: extractPageText parse failed. err=%v", err)
+		common.Log.Debug("ERROR: extractPageText parse failed. err=%w", err)


check for go 1.12 compatibility

gunnsth · 2020-06-29T08:44:29Z

extractor/text.go

@@ -240,7 +245,8 @@ func (e *Extractor) extractPageText(contents string, resources *model.PdfPageRes
 					return err
 				}
 				err = to.setFont(name, size)
-				if err != nil {
+				to.invalidFont = errors.Is(err, core.ErrNotSupported)


go 1.12 compatibility.. need to stick with it
Could use
https://godoc.org/golang.org/x/xerrors
in the meantime which provides some of this functionality?

gunnsth · 2020-06-29T08:45:41Z

extractor/text.go

-			(*to.fontStack)[len(*to.fontStack)-1] = font
+	if err != nil {
+		if err == model.ErrFontNotSupported {
+			// TODO(peterwilliams97): Do we need to handle this case in a special way?


if font is not supported, is there anything that makes sense to do?
Probably need to collect such cases and look at.

peterwilliams97 · 2020-06-29T11:04:57Z

extractor/text.go, line 492 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

if font is not supported, is there anything that makes sense to do?
Probably need to collect such cases and look at.

This case doesn't happen.

peterwilliams97 · 2020-06-29T11:09:22Z

extractor/text_test.go, line 84 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

Any reason for changing from -10 to -25?

It should always have been -25 to match the unrotated case. I can't recall why I set it to -10 for the old text extraction code. The new text extraction code correctly treats the -10 case as overlapping text and the test is expecting non-overlapping text.

peterwilliams97 · 2020-06-29T11:14:02Z

extractor/text_test.go, line 653 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

should we remove it, if its corrupt?

Done

peterwilliams97 · 2020-06-29T11:18:09Z

extractor/text_utils.go, line 41 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

x-direction, same as reading direction, and y-direction depth direction? Or purely x/y at this level?

This only gets used for table cell detection so it is x/y.

peterwilliams97 · 2020-06-29T11:21:24Z

model/internal/fonts/ttfparser.go, line 212 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

check

Sorry. I don't understand that.

peterwilliams97 · 2020-06-29T22:25:45Z

extractor/text_mark.go, line 26 at r6 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

Yes this can be useful

Done.

peterwilliams97 · 2020-06-29T22:25:55Z

extractor/text_test.go, line 84 at r14 (raw file):

Previously, peterwilliams97 (Peter Williams) wrote…

It should always have been -25 to match the unrotated case. I can't recall why I set it to -10 for the old text extraction code. The new text extraction code correctly treats the -10 case as overlapping text and the test is expecting non-overlapping text.

Done.

peterwilliams97 · 2020-06-29T22:26:04Z

extractor/text_test.go, line 602 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

In that case should we remove the commented test codes?

Done.

peterwilliams97 · 2020-06-29T22:26:24Z

internal/textencoding/simple.go, line 58 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

Needs to work with go 1.12

Done.

peterwilliams97 · 2020-06-29T22:26:29Z

model/const.go, line 24 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

needs to work with 1.12

Done.

gunnsth

Reviewed 7 of 7 files at r15, 1 of 1 files at r16.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @adrg)

adrg

Looks good.

gunnsth

LGTM

peterwilliams97 added 23 commits May 19, 2020 11:46

Fixed filename:page in logging

6fe0d20

Got CMap working for multi-rune entries

22680be

Treat CMap entries as strings instead of runes to handle multi-byte e…

a9910e7

…ncodings.

Added a test for multibyte encoding.

0c54cec

Merge branch 'development' of https://github.com/unidoc/unipdf into cmap

6103fb8

Merge branch 'cmap' into columns

e9c46fa

First version of text extraction that recognizes columns

6b13a99

Added an expanation of the text columns code to README.md.

a5c538f

fixed typos

8303318

Abstracted textWord depth calculation. This required change textMark …

c515472

…to *textMark in a lot of code.

Added function comments.

603b5ff

Fixed text state save/restore.

fad1552

Adjusted inter-word search distance to make paragrah division work fo…

6b4314f

…r thanh.pdf

Got text_test.go passing.

d21e2f8

Reinstated hyphen suppression

418f859

Handle more cases of fonts not being set in text extraction code.

2260e24

Fixed typo

a14d8e7

More verbose logging

49bbef0

Adding tables to text extractor.

40806d7

Merge branch 'development' of https://github.com/unidoc/unipdf into c…

29f2d9b

…olumns

Added tests for columns extraction.

af9508c

Removed commented code

16b3c1c

Check for textParas that are on the same line when writing out extrac…

30fc953

…ted text.

gunnsth marked this pull request as draft June 5, 2020 09:43

peterwilliams97 added 6 commits June 5, 2020 21:43

Absorb text to the left of paras into paras e.g. Footnote numbers

b4d90b6

Removed funny character from text_test.go

975e038

Merge branch 'development' of https://github.com/unidoc/unipdf into c…

e6be021

…olumns

Merge branch 'development' of https://github.com/unidoc/unipdf into c…

a7779a3

…olumns

Commented out a creator_test.go test that was broken by my text extra…

5d7e4aa

…ction changes.

Big changes to columns text extraction code for PR.

acb5caa

Performance improvements in several places. Commented code.

peterwilliams97 added 7 commits June 25, 2020 14:34

Merge branch 'development' of https://github.com/unidoc/unipdf into c…

b39f205

…olumns

Added color fields to TextMark

d5c344d

Updated README

fe6afef

Reinstated the disabled tests I missed before.

8be2607

Tightened definition for tables to prevent detection of tables where …

a5e21a7

…there weren't any.

Compute line splitting search range based on fontsize of first word i…

8f64966

…n word bag.

Use errors.Is(err, core.ErrNotSupported) to distinguish unsupported f…

25414d4

…ont errorrs. See https://blog.golang.org/go1.13-errors

peterwilliams97 added 2 commits June 27, 2020 12:04

Fixed some naming and added some comments.

cf91ad6

Merge branch 'development' of https://github.com/unidoc/unipdf into c…

9caa40e

…olumns

gunnsth requested a review from adrg June 29, 2020 00:36

gunnsth reviewed Jun 29, 2020

View reviewed changes

gunnsth requested changes Jun 29, 2020

View reviewed changes

peterwilliams97 added 2 commits June 29, 2020 20:53

errors.Is -> xerrors.Is and %w -> %v for go 1.12 compatibility

b7f91fd

Removed code that doesn't ever get called.

d3deac8

Removed unused test

fe35826

gunnsth approved these changes Jun 30, 2020

View reviewed changes

adrg approved these changes Jun 30, 2020

View reviewed changes

gunnsth approved these changes Jun 30, 2020

View reviewed changes

gunnsth merged commit 88fda44 into unidoc:development Jun 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text extraction code for columns. #366

Text extraction code for columns. #366

peterwilliams97 commented Jun 5, 2020 •

edited

Loading

peterwilliams97 commented Jun 27, 2020

gunnsth left a comment

gunnsth left a comment

gunnsth Jun 29, 2020

gunnsth Jun 29, 2020

gunnsth Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

gunnsth left a comment

adrg left a comment

gunnsth left a comment

Text extraction code for columns. #366

Text extraction code for columns. #366

Conversation

peterwilliams97 commented Jun 5, 2020 • edited Loading

peterwilliams97 commented Jun 27, 2020

gunnsth left a comment

Choose a reason for hiding this comment

gunnsth left a comment

Choose a reason for hiding this comment

gunnsth Jun 29, 2020

Choose a reason for hiding this comment

gunnsth Jun 29, 2020

Choose a reason for hiding this comment

gunnsth Jun 29, 2020

Choose a reason for hiding this comment

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

peterwilliams97 commented Jun 29, 2020

gunnsth left a comment

Choose a reason for hiding this comment

adrg left a comment

Choose a reason for hiding this comment

gunnsth left a comment

Choose a reason for hiding this comment

peterwilliams97 commented Jun 5, 2020 •

edited

Loading