Error case for the segmentation model #632

lfoppiano · 2020-09-01T08:45:11Z

I have this PDF (https://www.nature.com/articles/s41598-020-58065-9.pdf) and I notice some issue in processing it.

It seems that everything goes in the body section from the segmentation parser:
segmentation.txt

This is the output from pdfalto:
3SHt4RC7GX.xml.txt

I also noticed that this pdf contains a lot of characters that are not encoded correctly, maybe can be used as a test case for pdfalto?

kermitt2 · 2020-09-01T20:19:40Z

It looks like it only concerns the header zone recognition, the rest of the segmentation labels are more or less good. But as it's the header, it is very visible of course. Normally Nature open access publication are CC-BY so we could add some training data and fix the error case like that. At some point, the segmentation will need again some work with better features, it works very badly when special characters/equations are around, but we need more training data available to make a meaningful job on this.

lfoppiano · 2022-10-18T21:37:16Z

This was implemented in #951

lfoppiano added the error cases Some error/test case for future improvements label Sep 1, 2020

lfoppiano changed the title ~~Error case - Nature~~ Error case for the segmentation model Sep 17, 2020

lfoppiano self-assigned this Jul 28, 2021

This was referenced Aug 11, 2021

Missing (very few tokens) in the generated segmentation training data #812

Open

Add training data for one special error case #813

Closed

lfoppiano closed this as completed Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error case for the segmentation model #632

Error case for the segmentation model #632

lfoppiano commented Sep 1, 2020

kermitt2 commented Sep 1, 2020

lfoppiano commented Oct 18, 2022

Error case for the segmentation model #632

Error case for the segmentation model #632

Comments

lfoppiano commented Sep 1, 2020

kermitt2 commented Sep 1, 2020

lfoppiano commented Oct 18, 2022