Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vsualized region of figure or table by pdfbox #528

Closed
Sunnycheey opened this issue Dec 20, 2019 · 8 comments
Closed

Vsualized region of figure or table by pdfbox #528

Sunnycheey opened this issue Dec 20, 2019 · 8 comments
Labels
question There's no such thing as a stupid question

Comments

@Sunnycheey
Copy link

Is there some demo code for marking the content out by the coordinate (i.e, p, x, y, w, h) with pdfbox, since I really don't want to go deep with pdf....

Any help will be appreciated.

@kermitt2
Copy link
Owner

Hello @Sunnycheey

Yes you can have a look at the class FigureTableVisualizer.java under grobid/grobid-core/src/main/java/org/grobid/core/visualization/. The other classes in the package are good examples too of usage of pdfbox 2.* for extra-annotations with the Grobid results.

@kermitt2 kermitt2 added the question There's no such thing as a stupid question label Dec 21, 2019
@ehapmgs
Copy link

ehapmgs commented Dec 3, 2020

Any luck in doing so?

I believe it should look something like this based on the code sample

// grobid coords
const [page, x, y, w, h] = [6, 177.28, 358.04, 305.92, 8.42]
// page media box 
const mediaBox = { lowerX: 0, lowerY: 0, width: 667.28, height: 913.89 }
const annX = x + mediaBox.lowerX
const annY = mediaBox.height - (y + h) + mediaBox.lowerY
const annRightX = x + w + mediaBox.lowerY
const annTopY = mediaBox.height - y + mediaBox.lowerY
console.log(annX, annY, annRightX , annTopY)

but it totally points to something else in my case

@kermitt2
Copy link
Owner

kermitt2 commented Dec 3, 2020

@ehapmgs hello!

Do you want to annotate a PDF with pdfbox (so producing a new PDF with annotations) or annotate a PDF with PDF.js in a browser?

If first, you have examples under grobid/grobid-core/src/main/java/org/grobid/core/visualization/ as indicated above.

If second, you are in the wrong issue, but look at the web demo:
https://github.com/kermitt2/grobid/blob/master/grobid-service/src/main/resources/web/grobid/grobid.js#L332 (PDF display)
https://github.com/kermitt2/grobid/blob/master/grobid-service/src/main/resources/web/grobid/grobid.js#L646 (annotation layer on top of the PDF)

Basically you need to scale the annotations (which have no unit) according to the canvas where the PDF is displayed.

                var page = thePos.p;
		var pageDiv = $('#page-'+page);
		var canvas = pageDiv.children('canvas').eq(0);;

		var canvasHeight = canvas.height();
		var canvasWidth = canvas.width();
		var scale_x = canvasHeight / page_height;
		var scale_y = canvasWidth / page_width;

		var x = thePos.x * scale_x;
		var y = thePos.y * scale_y;
		var width = thePos.w * scale_x;
		var height = thePos.h * scale_y;

@ehapmgs
Copy link

ehapmgs commented Dec 3, 2020

Hey @kermitt2

Thanks for the help yes it is the first case but it seems there is something wrong with the coords probably a bug that I need to open a new issue for.

Using the example in grobid/grobid-core/src/main/java/org/grobid/core/visualization/ you can see the annotations are not located correctly

Screen Shot 2020-12-03 at 6 59 51 PM

I believe it used to work in version 0.5.4 I can see that the coords for the same figure is different between the master branch and 0.5.4

@kermitt2
Copy link
Owner

kermitt2 commented Dec 3, 2020

Indeed thanks ! I didn't look at this part since very long, I am re-discovering it :)

There are actually web services to get the annotated PDF:

curl -v --form input=@./s41523-020-00198-1.pdf --form type=2 localhost:8070/api/annotatePDF > s41523-020-00198-1-annot.pdf

s41523-020-00198-1-annot.pdf gives the PDF with annotations, the attribute type indicates which annotations to put (0: citations, 1: blocks, 2: figure).

This is working on some PDF:

Screenshot from 2020-12-03 18-46-24

But for the one you tested, there is a scaling problem. I think Achraf fixed this problem in pdfalto with kermitt2/pdfalto#43 - see #330

So just need to do the same with pdfbox here... mmm

@kermitt2
Copy link
Owner

kermitt2 commented Dec 3, 2020

For reference, failing PDF https://onlinelibrary.wiley.com/doi/pdf/10.1087/20100308

@kermitt2
Copy link
Owner

kermitt2 commented Dec 3, 2020

Normally fixed with 33b50b7

curl -v --form input=@./20100308.pdf --form type=2 localhost:8070/api/annotatePDF > 20100308-annot.pdf

Screenshot from 2020-12-03 19-07-59

@ehapmgs
Copy link

ehapmgs commented Dec 3, 2020

@kermitt2 nice! it is working as expected now thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question There's no such thing as a stupid question
Projects
None yet
Development

No branches or pull requests

3 participants