-
Notifications
You must be signed in to change notification settings - Fork 17
ami:section
To split a document in a Ctree
into sections. Based on:
- tags from JATS, etc.
- text labels in document
-
@class
attributes
Will extract tables and figures.
-
fulltext.xml
(in JATS) -
scholarly.html
(if semantic and structured).
- difficult to extract from PDF or many raw HTML files.
Uses default heuristics and common section names
ami -p <cproject> section
produces output below (output
).
Description
===========
Splits XML files into sections using XPath.
Creates names from titles of sections (or 'elem<num>.xml' if cannot)
optionally writes HTML (slow) using specified stylesheet
examples:
--sections ALL --html nlm2html
//not sure this works --sections ABSTRACT ACK_FUND --write false
--forcemake --extract table fig --summary figure table // this seems to create sections OK, use this?
Options
=======
--boldsections convert paras with bold first sentence/phrase into subsections.
e.g. <sec id='s2.1'><p><bold>Extraction of Oils.</bold>. more text...</p></sec>
=> <sec id='s2.1'><sec id='s2.1.1'><title>Extraction of Oils.</title>. <p>more text...
</p></sec>
--extract[=<extractList>...]
extract float elements to subdirectory,default table, fig, supplementary)
Default: [table, fig, supplementary]
-h, --help Show this help message and exit.
--html=<xsltName> convert sections to HTML using stylesheet (convention as in --transform). recommend:
nlm2html; if omitted defaults to no HTML currently 201909 very slow since XSLT seems to be
slow, seems to be size related (references can take 1 sec)
--sections[=<sectionTagList>...]
sections to extract (uses JATSSectionTagger)
if none, lists Tagger tags
ALL selects all tags in Tagger
AUTO creates hierchical tree based on JATS and heuristics (default)
,
Default: [AUTO]
--sectiontype=<sectionType>
Type of section (XML or HTML) default XML. Probably only used in development
Default: XML
--summary=<summaryList>...
create summary files for sections
Default: []
-V, --version Print version information and exit.
--write write section files (may be customised later);
tree -h -L 2 PMC6808808/sections/
PMC6808808/sections/
├── [ 128] 0_front
│ ├── [ 256] 0_journal-meta
│ └── [ 736] 1_article-meta
├── [ 96] 1_back
│ └── [ 197] 0_notes.xml
└── [ 64] 2_floats-group
- all sections are pre-numbered to avoid collisions. (e.g. later there are two
pub-date
records). Numbers reflect the reading/document order. This document has nobody
(it's an abstract). -
[ddd]
sections show size in bytes
PMC6808808/section/
0_front
├── [ 256] 0_journal-meta
│ ├── [ 113] 0_journal-id.xml
│ ├── [ 117] 1_journal-id.xml
│ ├── [ 102] 2_journal-id.xml
│ ├── [ 151] 3_journal-title-group.xml
│ ├── [ 80] 4_issn.xml
│ └── [ 162] 5_publisher.xml
└── [ 736] 1_article-meta
├── [ 94] 0_article-id.xml
├── [ 164] 10_pub-date.xml
├── [ 144] 11_pub-date.xml
├── [ 60] 12_volume.xml
├── [ 64] 13_issue.xml
├── [ 90] 14_issue-title.xml
├── [ 61] 15_fpage.xml
├── [ 61] 16_lpage.xml
├── [1005] 17_permissions.xml
├── [ 125] 18_self-uri.xml
├── [2.8K] 19_abstract.xml
├── [ 109] 1_article-id.xml
├── [ 87] 20_counts.xml
├── [ 105] 2_article-id.xml
├── [ 286] 3_article-categories.xml
├── [ 209] 4_title-group.xml
├── [1.3K] 5_contrib-group.xml
├── [ 159] 6_aff.xml
├── [ 180] 7_aff.xml
├── [ 182] 8_aff.xml
└── [ 127] 9_pub-date.xml
These are all the tagged sections in the front
partition.
0_front
has two children:
Metadata about the journal (its id, publisher, journal title, etc.)
Metadata about the article (its dates, findability, ids, abstract, ). Authors are in contrib-group.
Note the repeated tags (e.g. journal-id
have unique pre-numbers).
Using a later article:
tree -h -L 2 PMC6994851/sections/
PMC6994851/sections/
├── [ 128] 0_front
│ ├── [ 224] 0_journal-meta
│ └── [ 736] 1_article-meta
├── [ 256] 1_body
│ ├── [ 256] 0_introduction
│ ├── [ 256] 1_methods_and_methodology
│ ├── [ 224] 2_results
│ ├── [ 320] 3_discussion
│ ├── [ 192] 4_conclusion_and_recommenda
│ └── [ 224] 5_declarations
├── [ 128] 2_back
│ ├── [1.1K] 0_ref-list
│ └── [ 403] 1_ack.xml
└── [ 288] 3_floats-group
├── [1.8K] 0_table 1.xml
├── [ 323] 1_figure 1.xml
├── [ 316] 2_figure 2.xml
├── [ 317] 3_figure 3.xml
├── [1.3K] 4_table 2.xml
├── [ 314] 5_figure 4.xml
└── [ 316] 6_figure 5.xml
The body
has non-standard sections, but they clearly map onto our proposed:
- introduction
- methods
- results
- discussion
There's another section ("conclusions") which is often conflated with discussion, The final section is complex:
└── [ 224] 5_declarations
│ ├── [ 69] 0_title.xml
│ ├── [ 224] 1_author_contribution_state
│ │ ├── [ 86] 0_title.xml
│ │ ├── [ 146] 1_p.xml
│ │ ├── [ 153] 2_p.xml
│ │ ├── [ 161] 3_p.xml
│ │ └── [ 196] 4_p.xml
│ ├── [ 128] 2_funding_statement
│ │ ├── [ 74] 0_title.xml
│ │ └── [ 184] 1_p.xml
│ ├── [ 128] 3_competing_interest_statem
│ │ ├── [ 85] 0_title.xml
│ │ └── [ 104] 1_p.xml
│ └── [ 128] 4_additional_information
│ ├── [ 79] 0_title.xml
│ └── [ 114] 1_p.xml
some are clearly classifiable ("funding statement"), others are not ("additional information").
├── [ 128] 2_back
│ ├── [1.1K] 0_ref-list
│ │ ├── [ 67] 0_title.xml
│ │ ├── [ 997] 10_ref.xml
│ │ ├── [1.1K] 11_ref.xml
...
│ │ ├── [ 642] 6_ref.xml
│ │ ├── [1.0K] 7_ref.xml
│ │ ├── [1.1K] 8_ref.xml
│ │ └── [ 786] 9_ref.xml
│ └── [ 403] 1_ack.xml
(Note that tree
sorts them in lexical order). ack
is thanks and may have funder information.
"float"s are chunks that don't fit into reading order, normally tables and figures. ami section
will move floats to a special area floats-group
although this is often provided by JATS.
-
Sectioning of the dataset is usually done for greater precision.
-
To download a corpus of 50 articles in XML format in the directory project.
-
Open the Command Prompt and give the syntax:
getpapers -q "Viral epidemics" -o project -f mycorpus/log.txt -k 50 -x -p
-
To divide the content of your papers into sections, again open the Command Prompt and give the syntax:
ami -p project section
-
This will create a subfolder of sections in each folder of the scientific paper which is there in your directory.
-
Open the folder 'sections' and you will get subfolders as - Front, Body, Back, floats group etc.
-
This completes the sectioning part of your Cproject.
- Make sure that you have no spaces in your directory name as this will break the path of your command. eg. It can be My_project not My project
Beta tester: Ambreen H
An attempt was made to split all full-length papers (in XML
format) within a directory into sections
-
20 papers were downloaded into an output directory in XML format using
getpapers
:getpapers -q "viral epidemics" -o sectioning\project -k 20 -x -p
-
17 full-length articles could be retrieved using this query
-
The directory project was next used as an input directory for
ami section
:ami -p sectioning\project section
-
Successful query execution with a few warnings. These warnings were generated for all papers where clear element tags were unavailable eg. papers with no subsections for introduction, methodology etc.
Generic values (AMISectionTool)
================================
-v to see generic values
Specific values (AMISectionTool)
================================
xslt null
boldSections false
extract [table, fig, supplementary]
sectionList [AUTO]
sectiontype XML
summaryList []
write true
AMISectionTool cTree: PMC3561042
AMISectionTool cTree: PMC6517453
no class for: journal-subtitle
0 [main] WARN org.contentmine.norma.sections.JATSFactory - Unknown JATS Span journal-subtitle
0 [main] WARN org.contentmine.norma.sections.JATSFactory - Unknown JATS Span journal-subtitle
JATSElement untagged element: isbn
JATSElement untagged element: isbn
AMISectionTool cTree: PMC7120695
JATSElement untagged element: isbn
AMISectionTool cTree: PMC7300792 ...
- Query results:
- Each full-length research paper successfully sectioned as indicated above
- Review articles sectioned as per subheadings within the article
ami -p <project> section --hypertree mincount=2
generates files hypertree.html
and hypertree.xml
as CProject
children . This aggregates all the sections with common
titles and adds counts. Preliminary results for 950 "viral epidemics" (without the documents) gives
https://github.com/petermr/openVirus/tree/master/viral_epidemics
(some common titles - see link for better display)
cTree 950 eupmc_result.json 950 results 950 search 950 country 950 drugs 950 funders 950 word 806 frequencies 806 sections 806 floats-group 806 front 806 article-meta 806 UNKspan 4 journal-meta 806 body 782 introduction 337 protein__and_peptide_base 4 discussion 206 limitations 4 results 184 contribution_of_all_virus 3 deficiency_of_hif_1α_in_a 3 phylogenetic_analysis_of 3 symbolic_transfer_entropy 3
The hypertree
creation was tested in the disease corpus by using the syntax
ami -v -p disease/1-part section --summary all --hypertree mincount=2
The files hypertree.html and hypertree.xml were created.