-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing irrelevant cruft from publisher HTML #32
Comments
Made good progress yesterday for Taylor and Francis (which is full of cruft and repeated text). Here's briesf stylsheet:
|
That's great - thanks! How do I go about applying this template to the HTML? Is there a method already built-in to one of the ContentMine tools (eg. norma), or do I need to do this separately? |
It's built into Norma. I think the production version works. Needs two passes. First to create XHTML, next to strip cruft (and third to normalize XHTML to formal SHTML) |
Here's my test:
That SHOULD transfer into:
You then need the updated symbol file
and the stylesheet
|
see https://github.com/ContentMine/norma/blob/master/docs/TRANSFORM.md - please see if this works and comment. This can be done for other publishers. |
Many/most HTML from publishers includes large amounts of material not relevant to the scholarly narrative. These include:
Much of this can be managed by XSLT stylesheets which "snip off" this cruft. I don't think there is a simple way of tackling this - it has to be a per-publisher or per journal solution. That means we need a way of locating and using stylesheets from the commandline.
Ideally we need:
into
I propose XSLT and XPath for the first two. It's possible that the restructuring can also tackle 3; we'd need XSLT2 with Saxon.
The text was updated successfully, but these errors were encountered: