-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text extraction code for columns. #366
Conversation
…to *textMark in a lot of code.
Performance improvements in several places. Commented code.
…there weren't any.
extractor/text.go, line 101 at r6 (raw file): Previously, gunnsth (Gunnsteinn Hall) wrote…
I have needed to look at the operators a few times while developing the columns code. Eventually this code will be known to be bug free, but I am not 100% sure that it is now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 23 files at r1, 1 of 7 files at r4, 1 of 6 files at r8, 6 of 14 files at r9, 1 of 1 files at r10, 2 of 2 files at r11, 2 of 3 files at r12, 3 of 4 files at r13, 5 of 5 files at r14.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @adrg and @peterwilliams97)
extractor/extractor.go, line 50 at r14 (raw file):
mediaBox, err := page.GetMediaBox() if err != nil { return nil, fmt.Errorf("extractor requires mediaBox. %w", err)
We need to wait with this until we drop support for 1.12
extractor/text_test.go, line 84 at r14 (raw file):
0 1 -1 0 0 0 Tm (Hello World!)Tj 0 -25 Td
Any reason for changing from -10 to -25?
extractor/text_test.go, line 602 at r14 (raw file):
// XXX(peterwilliams97): The new text extraction changes TextMark contents. From now on we // only test their behaviour, not their implementation.
In that case should we remove the commented test codes?
extractor/text_test.go, line 653 at r14 (raw file):
} for i, filename := range pathList { // 4865ab395ed664c3ee17.pdf is a corrupted file in the test corpus.
should we remove it, if its corrupt?
extractor/text_utils.go, line 41 at r14 (raw file):
// addNeighbours fills out the below and right fields of the paras in `paras`. // For each para `a`: // a.below is the unique highest para completely below `a` that overlaps it in the x-direction
x-direction, same as reading direction, and y-direction depth direction? Or purely x/y at this level?
internal/textencoding/simple.go, line 58 at r14 (raw file):
if !ok { common.Log.Debug("ERROR: NewSimpleTextEncoder. Unknown encoding %q", baseName) return nil, fmt.Errorf("unsupported font encoding: %q (%w)", baseName, core.ErrNotSupported)
Needs to work with go 1.12
model/const.go, line 24 at r14 (raw file):
ErrEncrypted = errors.New("file needs to be decrypted first") ErrNoFont = errors.New("font not defined") ErrFontNotSupported = fmt.Errorf("unsupported font (%w)", core.ErrNotSupported)
needs to work with 1.12
model/internal/fonts/ttfparser.go, line 212 at r14 (raw file):
if version == "OTTO" { // See https://docs.microsoft.com/en-us/typography/opentype/spec/otff return TtfType{}, fmt.Errorf("fonts based on PostScript outlines are not supported (%w)",
check
model/internal/fonts/ttfparser.go, line 380 at r14 (raw file):
format := t.ReadUShort() if format != 4 { return fmt.Errorf("unexpected subtable format: %d (%w)", format, core.ErrNotSupported)
check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great. Biggest comment now regarding error handling. We want to keep go 1.12 compatibility
so need to stick with it
We could use
https://godoc.org/golang.org/x/xerrors
in the meantime which provides some of this functionality needed which was added in go 1.13?
extractor/text.go
Outdated
cstreamParser := contentstream.NewContentStreamParser(contents) | ||
operations, err := cstreamParser.Parse() | ||
if err != nil { | ||
common.Log.Debug("ERROR: extractPageText parse failed. err=%v", err) | ||
common.Log.Debug("ERROR: extractPageText parse failed. err=%w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check for go 1.12 compatibility
extractor/text.go
Outdated
@@ -240,7 +245,8 @@ func (e *Extractor) extractPageText(contents string, resources *model.PdfPageRes | |||
return err | |||
} | |||
err = to.setFont(name, size) | |||
if err != nil { | |||
to.invalidFont = errors.Is(err, core.ErrNotSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go 1.12 compatibility.. need to stick with it
Could use
https://godoc.org/golang.org/x/xerrors
in the meantime which provides some of this functionality?
extractor/text.go
Outdated
(*to.fontStack)[len(*to.fontStack)-1] = font | ||
if err != nil { | ||
if err == model.ErrFontNotSupported { | ||
// TODO(peterwilliams97): Do we need to handle this case in a special way? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if font is not supported, is there anything that makes sense to do?
Probably need to collect such cases and look at.
extractor/text.go, line 492 at r14 (raw file): Previously, gunnsth (Gunnsteinn Hall) wrote…
This case doesn't happen. |
extractor/text_test.go, line 84 at r14 (raw file): Previously, gunnsth (Gunnsteinn Hall) wrote…
It should always have been -25 to match the unrotated case. I can't recall why I set it to -10 for the old text extraction code. The new text extraction code correctly treats the -10 case as overlapping text and the test is expecting non-overlapping text. |
extractor/text_test.go, line 653 at r14 (raw file): Previously, gunnsth (Gunnsteinn Hall) wrote…
Done |
extractor/text_utils.go, line 41 at r14 (raw file): Previously, gunnsth (Gunnsteinn Hall) wrote…
This only gets used for table cell detection so it is x/y. |
model/internal/fonts/ttfparser.go, line 212 at r14 (raw file): Previously, gunnsth (Gunnsteinn Hall) wrote…
Sorry. I don't understand that. |
extractor/text_mark.go, line 26 at r6 (raw file): Previously, gunnsth (Gunnsteinn Hall) wrote…
Done. |
extractor/text_test.go, line 84 at r14 (raw file): Previously, peterwilliams97 (Peter Williams) wrote…
Done. |
extractor/text_test.go, line 602 at r14 (raw file): Previously, gunnsth (Gunnsteinn Hall) wrote…
Done. |
internal/textencoding/simple.go, line 58 at r14 (raw file): Previously, gunnsth (Gunnsteinn Hall) wrote…
Done. |
model/const.go, line 24 at r14 (raw file): Previously, gunnsth (Gunnsteinn Hall) wrote…
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r15, 1 of 1 files at r16.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @adrg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This is a major update to the text extraction code that works with text arranged in columns.
Here are new PDFs and text extraction references files for extractor/text_test.go.
reference.zip + eu.page005.txt +[Productivity.page001.txt] (https://github.com/unidoc/unipdf/files/4735832/Productivity.page001.txt) + we-dms.page001.txt + radar-eng.page002.txt + Nuance.page001.txt
pdfs.zip + eu.pdf + Productivity.pdf + we-dms.pdf + radar-eng.pdf +Nuance.pdf
You can also run pdf_extract_text.go to see the extraction. There is an updated version of this test here that makes it easier to test a corpus of PDFs.
This change is