Skip to content
Lakshmi Devi Priya edited this page Aug 14, 2020 · 11 revisions

ami section

purpose

To split a document in a Ctree into sections. Based on:

  • tags from JATS, etc.
  • text labels in document
  • @class attributes

Will extract tables and figures.

prerequisite:

  • fulltext.xml (in JATS)
  • scholarly.html (if semantic and structured).

constraints

  • difficult to extract from PDF or many raw HTML files.

syntax

simple

Uses default heuristics and common section names

 ami -p <cproject> section

produces output below (output).

help

Description
===========
Splits XML files into sections using XPath.
Creates names from titles of sections (or 'elem<num>.xml' if cannot)
optionally writes HTML (slow) using specified stylesheet
examples:
    --sections ALL --html nlm2html
         //not sure this works    --sections ABSTRACT ACK_FUND --write false

    --forcemake --extract table fig --summary figure table         // this seems to create sections OK, use this?
Options
=======
      --boldsections      convert paras with bold first sentence/phrase into subsections.
                          e.g. <sec id='s2.1'><p><bold>Extraction of Oils.</bold>. more text...</p></sec>
                          =>  <sec id='s2.1'><sec id='s2.1.1'><title>Extraction of Oils.</title>. <p>more text...
                            </p></sec>

      --extract[=<extractList>...]
                          extract float elements to subdirectory,default table, fig, supplementary)
                            Default: [table, fig, supplementary]
  -h, --help              Show this help message and exit.
      --html=<xsltName>   convert sections to HTML using stylesheet (convention as in --transform). recommend:
                            nlm2html; if omitted defaults to no HTML currently 201909 very slow since XSLT seems to be
                            slow,  seems to be size related (references can take 1 sec)
      --sections[=<sectionTagList>...]
                          sections to extract (uses JATSSectionTagger)
                          if none, lists Tagger tags
                          ALL selects all tags in Tagger
                          AUTO creates hierchical tree based on JATS and heuristics (default)
                          ,
                            Default: [AUTO]
      --sectiontype=<sectionType>
                          Type of section (XML or HTML) default XML. Probably only used in development
                            Default: XML
      --summary=<summaryList>...
                          create summary files for sections
                            Default: []
  -V, --version           Print version information and exit.
      --write             write section files (may be customised later);

output

 tree -h -L 2 PMC6808808/sections/
PMC6808808/sections/
├── [ 128]  0_front
│   ├── [ 256]  0_journal-meta
│   └── [ 736]  1_article-meta
├── [  96]  1_back
│   └── [ 197]  0_notes.xml
└── [  64]  2_floats-group

Notes:

  • all sections are pre-numbered to avoid collisions. (e.g. later there are two pub-date records). Numbers reflect the reading/document order. This document has no body (it's an abstract).
  • [ddd] sections show size in bytes

front section

PMC6808808/section/
0_front
├── [ 256]  0_journal-meta
│   ├── [ 113]  0_journal-id.xml
│   ├── [ 117]  1_journal-id.xml
│   ├── [ 102]  2_journal-id.xml
│   ├── [ 151]  3_journal-title-group.xml
│   ├── [  80]  4_issn.xml
│   └── [ 162]  5_publisher.xml
└── [ 736]  1_article-meta
    ├── [  94]  0_article-id.xml
    ├── [ 164]  10_pub-date.xml
    ├── [ 144]  11_pub-date.xml
    ├── [  60]  12_volume.xml
    ├── [  64]  13_issue.xml
    ├── [  90]  14_issue-title.xml
    ├── [  61]  15_fpage.xml
    ├── [  61]  16_lpage.xml
    ├── [1005]  17_permissions.xml
    ├── [ 125]  18_self-uri.xml
    ├── [2.8K]  19_abstract.xml
    ├── [ 109]  1_article-id.xml
    ├── [  87]  20_counts.xml
    ├── [ 105]  2_article-id.xml
    ├── [ 286]  3_article-categories.xml
    ├── [ 209]  4_title-group.xml
    ├── [1.3K]  5_contrib-group.xml
    ├── [ 159]  6_aff.xml
    ├── [ 180]  7_aff.xml
    ├── [ 182]  8_aff.xml
    └── [ 127]  9_pub-date.xml

These are all the tagged sections in the front partition. 0_front has two children:

journal_meta

Metadata about the journal (its id, publisher, journal title, etc.)

article_meta

Metadata about the article (its dates, findability, ids, abstract, ). Authors are in contrib-group.

Note the repeated tags (e.g. journal-id have unique pre-numbers).

body section

Using a later article:

tree -h -L 2 PMC6994851/sections/
PMC6994851/sections/
├── [ 128]  0_front
│   ├── [ 224]  0_journal-meta
│   └── [ 736]  1_article-meta
├── [ 256]  1_body
│   ├── [ 256]  0_introduction
│   ├── [ 256]  1_methods_and_methodology
│   ├── [ 224]  2_results
│   ├── [ 320]  3_discussion
│   ├── [ 192]  4_conclusion_and_recommenda
│   └── [ 224]  5_declarations
├── [ 128]  2_back
│   ├── [1.1K]  0_ref-list
│   └── [ 403]  1_ack.xml
└── [ 288]  3_floats-group
    ├── [1.8K]  0_table 1.xml
    ├── [ 323]  1_figure 1.xml
    ├── [ 316]  2_figure 2.xml
    ├── [ 317]  3_figure 3.xml
    ├── [1.3K]  4_table 2.xml
    ├── [ 314]  5_figure 4.xml
    └── [ 316]  6_figure 5.xml

The body has non-standard sections, but they clearly map onto our proposed:

  • introduction
  • methods
  • results
  • discussion

There's another section ("conclusions") which is often conflated with discussion, The final section is complex:

    └── [ 224]  5_declarations
│       ├── [  69]  0_title.xml
│       ├── [ 224]  1_author_contribution_state
│       │   ├── [  86]  0_title.xml
│       │   ├── [ 146]  1_p.xml
│       │   ├── [ 153]  2_p.xml
│       │   ├── [ 161]  3_p.xml
│       │   └── [ 196]  4_p.xml
│       ├── [ 128]  2_funding_statement
│       │   ├── [  74]  0_title.xml
│       │   └── [ 184]  1_p.xml
│       ├── [ 128]  3_competing_interest_statem
│       │   ├── [  85]  0_title.xml
│       │   └── [ 104]  1_p.xml
│       └── [ 128]  4_additional_information
│           ├── [  79]  0_title.xml
│           └── [ 114]  1_p.xml

some are clearly classifiable ("funding statement"), others are not ("additional information").

back

├── [ 128]  2_back
│   ├── [1.1K]  0_ref-list
│   │   ├── [  67]  0_title.xml
│   │   ├── [ 997]  10_ref.xml
│   │   ├── [1.1K]  11_ref.xml
...
│   │   ├── [ 642]  6_ref.xml
│   │   ├── [1.0K]  7_ref.xml
│   │   ├── [1.1K]  8_ref.xml
│   │   └── [ 786]  9_ref.xml
│   └── [ 403]  1_ack.xml

(Note that tree sorts them in lexical order). ack is thanks and may have funder information.

floats-group

"float"s are chunks that don't fit into reading order, normally tables and figures. ami section will move floats to a special area floats-group although this is often provided by JATS.

Tester 1: Vaishali Arora

  • Sectioning of the dataset is usually done for greater precision.

  • To download a corpus of 50 articles in XML format in the directory project.

  • Open the Command Prompt and give the syntax:

    getpapers -q "Viral epidemics" -o project -f mycorpus/log.txt -k 50 -x -p

  • To divide the content of your papers into sections, again open the Command Prompt and give the syntax:

    ami -p project section

  • This will create a subfolder of sections in each folder of the scientific paper which is there in your directory.

  • Open the folder 'sections' and you will get subfolders as - Front, Body, Back, floats group etc.

  • This completes the sectioning part of your Cproject.

Issues:

  • Make sure that you have no spaces in your directory name as this will break the path of your command. eg. It can be My_project not My project

Test 2

Beta tester: Ambreen H

An attempt was made to split all full-length papers (in XML format) within a directory into sections

  1. 20 papers were downloaded into an output directory in XML format using getpapers : getpapers -q "viral epidemics" -o sectioning\project -k 20 -x -p

  2. 17 full-length articles could be retrieved using this query

  3. The directory project was next used as an input directory for ami section : ami -p sectioning\project section

  4. Successful query execution with a few warnings. These warnings were generated for all papers where clear element tags were unavailable eg. papers with no subsections for introduction, methodology etc.

Generic values (AMISectionTool)
================================
-v to see generic values

Specific values (AMISectionTool)
================================
xslt                    null
boldSections            false
extract                 [table, fig, supplementary]
sectionList             [AUTO]
sectiontype             XML
summaryList             []
write                   true

AMISectionTool cTree: PMC3561042
AMISectionTool cTree: PMC6517453

no class for: journal-subtitle
0    [main] WARN  org.contentmine.norma.sections.JATSFactory  - Unknown JATS Span journal-subtitle
0 [main] WARN org.contentmine.norma.sections.JATSFactory  - Unknown JATS Span journal-subtitle
JATSElement untagged element: isbn
JATSElement untagged element: isbn
AMISectionTool cTree: PMC7120695
JATSElement untagged element: isbn
AMISectionTool cTree: PMC7300792 ...

  1. Query results:
  • Each full-length research paper successfully sectioned as indicated above
  • Review articles sectioned as per subheadings within the article

hypertree

ami -p <project> section --hypertree mincount=2

generates files hypertree.html and hypertree.xml as CProject children . This aggregates all the sections with common titles and adds counts. Preliminary results for 950 "viral epidemics" (without the documents) gives

https://github.com/petermr/openVirus/tree/master/viral_epidemics

(some common titles - see link for better display)

cTree 950
eupmc_result.json 950
results 950
search 950
country 950
drugs 950
funders 950
word 806
frequencies 806
sections 806
floats-group 806
front 806
article-meta 806
UNKspan 4
journal-meta 806
body 782
introduction 337
protein__and_peptide_base 4
discussion 206
limitations 4
results 184
contribution_of_all_virus 3
deficiency_of_hif_1α_in_a 3
phylogenetic_analysis_of 3
symbolic_transfer_entropy 3

Tester: Lakshmi Devi Priya

The hypertree creation was tested in the disease corpus by using the syntax

ami -v -p disease/1-part section --summary all --hypertree mincount=2

The files hypertree.html and hypertree.xml were created.

Clone this wiki locally