You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It looks like it only concerns the header zone recognition, the rest of the segmentation labels are more or less good. But as it's the header, it is very visible of course. Normally Nature open access publication are CC-BY so we could add some training data and fix the error case like that. At some point, the segmentation will need again some work with better features, it works very badly when special characters/equations are around, but we need more training data available to make a meaningful job on this.
lfoppiano
changed the title
Error case - Nature
Error case for the segmentation model
Sep 17, 2020
I have this PDF (https://www.nature.com/articles/s41598-020-58065-9.pdf) and I notice some issue in processing it.
It seems that everything goes in the body section from the segmentation parser:
segmentation.txt
This is the output from pdfalto:
3SHt4RC7GX.xml.txt
I also noticed that this pdf contains a lot of characters that are not encoded correctly, maybe can be used as a test case for pdfalto?
The text was updated successfully, but these errors were encountered: