Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text extraction code for columns. #366

Merged
merged 54 commits into from
Jun 30, 2020
Merged

Conversation

peterwilliams97
Copy link
Contributor

@peterwilliams97 peterwilliams97 commented Jun 5, 2020

This is a major update to the text extraction code that works with text arranged in columns.

  • extractor/text.go is now split across multiple text_*.go files.
  • the new design is summarised in the extractor README.

Here are new PDFs and text extraction references files for extractor/text_test.go.

You can also run pdf_extract_text.go to see the extraction. There is an updated version of this test here that makes it easier to test a corpus of PDFs.


This change is Reviewable

@gunnsth gunnsth marked this pull request as draft June 5, 2020 09:43
@peterwilliams97
Copy link
Contributor Author


extractor/text.go, line 101 at r6 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

Should use Trace for this? or specifically want to look at this? Trace has a lot of stuff

I have needed to look at the operators a few times while developing the columns code. Eventually this code will be known to be bug free, but I am not 100% sure that it is now.

@gunnsth gunnsth requested a review from adrg June 29, 2020 00:36
Copy link
Contributor

@gunnsth gunnsth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 23 files at r1, 1 of 7 files at r4, 1 of 6 files at r8, 6 of 14 files at r9, 1 of 1 files at r10, 2 of 2 files at r11, 2 of 3 files at r12, 3 of 4 files at r13, 5 of 5 files at r14.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @adrg and @peterwilliams97)


extractor/extractor.go, line 50 at r14 (raw file):

	mediaBox, err := page.GetMediaBox()
	if err != nil {
		return nil, fmt.Errorf("extractor requires mediaBox. %w", err)

We need to wait with this until we drop support for 1.12


extractor/text_test.go, line 84 at r14 (raw file):

		0 1 -1 0 0 0 Tm
		(Hello World!)Tj
		0 -25 Td

Any reason for changing from -10 to -25?


extractor/text_test.go, line 602 at r14 (raw file):

	// XXX(peterwilliams97): The new text extraction changes TextMark contents. From now on we
	// only test their behaviour, not their implementation.

In that case should we remove the commented test codes?


extractor/text_test.go, line 653 at r14 (raw file):

	}
	for i, filename := range pathList {
		// 4865ab395ed664c3ee17.pdf is a corrupted file in the test corpus.

should we remove it, if its corrupt?


extractor/text_utils.go, line 41 at r14 (raw file):

// addNeighbours fills out the below and right fields of the paras in `paras`.
// For each para `a`:
//    a.below is the unique highest para completely below `a` that overlaps it in the x-direction

x-direction, same as reading direction, and y-direction depth direction? Or purely x/y at this level?


internal/textencoding/simple.go, line 58 at r14 (raw file):

	if !ok {
		common.Log.Debug("ERROR: NewSimpleTextEncoder. Unknown encoding %q", baseName)
		return nil, fmt.Errorf("unsupported font encoding: %q (%w)", baseName, core.ErrNotSupported)

Needs to work with go 1.12


model/const.go, line 24 at r14 (raw file):

	ErrEncrypted                = errors.New("file needs to be decrypted first")
	ErrNoFont                   = errors.New("font not defined")
	ErrFontNotSupported         = fmt.Errorf("unsupported font (%w)", core.ErrNotSupported)

needs to work with 1.12


model/internal/fonts/ttfparser.go, line 212 at r14 (raw file):

	if version == "OTTO" {
		// See https://docs.microsoft.com/en-us/typography/opentype/spec/otff
		return TtfType{}, fmt.Errorf("fonts based on PostScript outlines are not supported (%w)",

check


model/internal/fonts/ttfparser.go, line 380 at r14 (raw file):

	format := t.ReadUShort()
	if format != 4 {
		return fmt.Errorf("unexpected subtable format: %d (%w)", format, core.ErrNotSupported)

check

Copy link
Contributor

@gunnsth gunnsth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great. Biggest comment now regarding error handling. We want to keep go 1.12 compatibility
so need to stick with it
We could use
https://godoc.org/golang.org/x/xerrors
in the meantime which provides some of this functionality needed which was added in go 1.13?

cstreamParser := contentstream.NewContentStreamParser(contents)
operations, err := cstreamParser.Parse()
if err != nil {
common.Log.Debug("ERROR: extractPageText parse failed. err=%v", err)
common.Log.Debug("ERROR: extractPageText parse failed. err=%w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check for go 1.12 compatibility

@@ -240,7 +245,8 @@ func (e *Extractor) extractPageText(contents string, resources *model.PdfPageRes
return err
}
err = to.setFont(name, size)
if err != nil {
to.invalidFont = errors.Is(err, core.ErrNotSupported)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go 1.12 compatibility.. need to stick with it
Could use
https://godoc.org/golang.org/x/xerrors
in the meantime which provides some of this functionality?

(*to.fontStack)[len(*to.fontStack)-1] = font
if err != nil {
if err == model.ErrFontNotSupported {
// TODO(peterwilliams97): Do we need to handle this case in a special way?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if font is not supported, is there anything that makes sense to do?
Probably need to collect such cases and look at.

@peterwilliams97
Copy link
Contributor Author


extractor/text.go, line 492 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

if font is not supported, is there anything that makes sense to do?
Probably need to collect such cases and look at.

This case doesn't happen.

@peterwilliams97
Copy link
Contributor Author


extractor/text_test.go, line 84 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

Any reason for changing from -10 to -25?

It should always have been -25 to match the unrotated case. I can't recall why I set it to -10 for the old text extraction code. The new text extraction code correctly treats the -10 case as overlapping text and the test is expecting non-overlapping text.

@peterwilliams97
Copy link
Contributor Author


extractor/text_test.go, line 653 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

should we remove it, if its corrupt?

Done

@peterwilliams97
Copy link
Contributor Author


extractor/text_utils.go, line 41 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

x-direction, same as reading direction, and y-direction depth direction? Or purely x/y at this level?

This only gets used for table cell detection so it is x/y.

@peterwilliams97
Copy link
Contributor Author


model/internal/fonts/ttfparser.go, line 212 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

check

Sorry. I don't understand that.

@peterwilliams97
Copy link
Contributor Author


extractor/text_mark.go, line 26 at r6 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

Yes this can be useful

Done.

@peterwilliams97
Copy link
Contributor Author


extractor/text_test.go, line 84 at r14 (raw file):

Previously, peterwilliams97 (Peter Williams) wrote…

It should always have been -25 to match the unrotated case. I can't recall why I set it to -10 for the old text extraction code. The new text extraction code correctly treats the -10 case as overlapping text and the test is expecting non-overlapping text.

Done.

@peterwilliams97
Copy link
Contributor Author


extractor/text_test.go, line 602 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

In that case should we remove the commented test codes?

Done.

@peterwilliams97
Copy link
Contributor Author


internal/textencoding/simple.go, line 58 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

Needs to work with go 1.12

Done.

@peterwilliams97
Copy link
Contributor Author


model/const.go, line 24 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

needs to work with 1.12

Done.

Copy link
Contributor

@gunnsth gunnsth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r15, 1 of 1 files at r16.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @adrg)

Copy link
Collaborator

@adrg adrg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Copy link
Contributor

@gunnsth gunnsth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gunnsth gunnsth merged commit 88fda44 into unidoc:development Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants