Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error case for the segmentation model #632

Closed
lfoppiano opened this issue Sep 1, 2020 · 2 comments
Closed

Error case for the segmentation model #632

lfoppiano opened this issue Sep 1, 2020 · 2 comments
Assignees
Labels
error cases Some error/test case for future improvements

Comments

@lfoppiano
Copy link
Collaborator

I have this PDF (https://www.nature.com/articles/s41598-020-58065-9.pdf) and I notice some issue in processing it.

It seems that everything goes in the body section from the segmentation parser:
segmentation.txt

This is the output from pdfalto:
3SHt4RC7GX.xml.txt

I also noticed that this pdf contains a lot of characters that are not encoded correctly, maybe can be used as a test case for pdfalto?

@lfoppiano lfoppiano added the error cases Some error/test case for future improvements label Sep 1, 2020
@kermitt2
Copy link
Owner

kermitt2 commented Sep 1, 2020

It looks like it only concerns the header zone recognition, the rest of the segmentation labels are more or less good. But as it's the header, it is very visible of course. Normally Nature open access publication are CC-BY so we could add some training data and fix the error case like that. At some point, the segmentation will need again some work with better features, it works very badly when special characters/equations are around, but we need more training data available to make a meaningful job on this.

@lfoppiano lfoppiano changed the title Error case - Nature Error case for the segmentation model Sep 17, 2020
@lfoppiano lfoppiano self-assigned this Jul 28, 2021
@lfoppiano
Copy link
Collaborator Author

This was implemented in #951

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error cases Some error/test case for future improvements
Projects
None yet
Development

No branches or pull requests

2 participants