Skip to content

Commit

Permalink
Merge pull request #77 from omicsNLP/dev-docs
Browse files Browse the repository at this point in the history
Add some developer documentation
  • Loading branch information
Thomas-Rowlands authored Nov 1, 2024
2 parents 4acd8e6 + 3211c5d commit 0ba1bbd
Show file tree
Hide file tree
Showing 8 changed files with 86 additions and 56 deletions.
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ about: Create a report to help us improve
labels: bug
---

## Describe the bug
## Describe the bug <!-- markdownlint-disable-line MD041 -->

A clear and concise description of what the bug is, including error messages.

Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ about: Suggest an idea for this project
labels: enhancement
---

## Is your feature request related to a problem? Please describe
## Is your feature request related to a problem? Please describe <!-- markdownlint-disable-line MD041 -->

A clear and concise description of what the problem is. E.g. I'm always frustrated when [...]

Expand Down
4 changes: 0 additions & 4 deletions .markdownlint.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,2 @@
# For list of markdownlint rules, see: https://github.com/markdownlint/markdownlint/blob/main/docs/RULES.md
MD013: false
MD033: false
MD036: false
MD040: false
MD041: false
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ Please follow these steps to have your contribution considered by the maintainer
the new feature.
4. Follow the [styleguides](#styleguides)
5. After you submit your pull request, verify that all [status
checks](https://help.github.com/articles/about-status-checks/) are passing <details><summary>What
checks](https://help.github.com/articles/about-status-checks/) are passing <details><summary>What <!-- markdownlint-disable-line MD033 -->
if the status checks are failing?</summary>If a status check is failing, and you believe that the
failure is unrelated to your change, please leave a comment on the pull request explaining why
you believe the failure is unrelated. A maintainer will re-run the status check for you. If we
Expand Down
110 changes: 72 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,47 @@
# Auto-CORPus

[![DOI:10.1101/2021.01.08.425887](http://img.shields.io/badge/DOI-10.1101/2021.01.08.425887-BE2536.svg)](https://doi.org/10.1101/2021.01.08.425887)
[![DOI:10.3389/fdgth.2022.788124](http://img.shields.io/badge/DOI-10.3389/fdgth.2022.788124-70286A.svg)](https://doi.org/10.3389/fdgth.2022.788124)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)

# Auto-CORPus

*Requires Python 3.10+*
*Requires Python 3.10+* <!-- markdownlint-disable-line MD036 -->

The Automated pipeline for Consistent Outputs from Research Publications (Auto-CORPus) is a tool for the standardisation and conversion of publication HTML to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to [BioC format](http://bioc.sourceforge.net/). Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition.

We present a JSON format for sharing table content and metadata that is based on the BioC format. The [JSON schema](keyFiles/table_schema.json) for the tables JSON can be found within the [keyfiles](keyFiles) directory.

**Config files**
## Installation

Install with pip

```sh
pip install autocorpus
```

## Usage

Run the below command for a single file example

```sh
auto-corpus -c "configs/config_pmc.json" -t "output" -f "path/to/html/file" -o JSON
```

Run the main app for a directory of files example

```sh
auto-corpus -c "configs/config_pmc.json" -t "output" -f "path/to/directory/of/html/files" -o JSON
```

### Available arguments

| Flag | Name | Description |
| -------- | ------- | ------- |
| `-f` | Input File Path | File or directory to run Auto-CORPus on |
| `-t` | Output File Path | Directory path where Auto-CORPus should save output files |
| `-c` | Config | Which config file to use |
| `-o` | Output Format | Either `JSON` or `XML` (defaults to `JSON`) |

## Config files

If you wish to contribute or edit a config file then please follow the instructions in the [config guide](docs/config_tutorial.md).

Expand All @@ -21,7 +52,7 @@ Auto-CORPus is able to parse HTML from different publishers, which utilise diffe
- Full text HTML documents covering the entire article
- HTML files which describe a single table

Current work in progress is extending this to include images of tables. See the [Alpha Testing](#alpha) section below.
Current work in progress is extending this to include images of tables. See the [Alpha Testing](#alpha-testing) section below.

Auto-CORPus does not provide functionality to retrieve input files directly from the publisher. Input file retrieval must be completed by the user in a way which the publisher permits.

Expand All @@ -40,7 +71,7 @@ Auto-CORPus will first group files based on common elements in their file name {

**Input:**

```
```sh
PMC1.html
PMC1_table_1.html
PMC1_table_2.html
Expand All @@ -51,7 +82,7 @@ PMC1_table_2.html

**Output:**

```
```sh
PMC1_bioc.json
PMC1_abbreviations.json
PMC1_tables.json (contains table 1 & 2 and any tables described within the main text)
Expand All @@ -62,48 +93,50 @@ PMC1_tables.json (contains table 1 & 2 and any tables described within the main
A log file is produced in the output directory providing details of the day/time Auto-CORPus was run,
the arguments used and information about which files were successfully/unsuccessfully processed with a relevant error message.

**Getting started:**
## For developers

Clone the repo, e.g.:
This is a Python application that uses [poetry](https://python-poetry.org) for packaging
and dependency management. It also provides [pre-commit](https://pre-commit.com/) hooks
for various linters and formatters and automated tests using
[pytest](https://pytest.org/) and [GitHub Actions](https://github.com/features/actions).

```
git clone [email protected]:omicsNLP/Auto-CORPus.git # (using SSH)
git clone https://github.com/omicsNLP/Auto-CORPus.git # (using HTTPS)
```

```
cd Auto-CORPus
```
To get started:

```
poetry install
```
1. [Download and install Poetry](https://python-poetry.org/docs/#installation) following the instructions for your OS.
1. Clone this repository and make it your working directory
1. Set up the virtual environment:

Run the below command for a single file example
```sh
poetry install
```

```
auto-corpus -c "configs/config_pmc.json" -t "output" -f "path/to/html/file" -o JSON
```
1. Activate the virtual environment (alternatively, ensure any Python-related command is preceded by `poetry run`):

Run the below command for a directory of files example
```sh
poetry shell
```

```
auto-corpus -c "configs/config_pmc.json" -t "output" -f "path/to/directory/of/html/files" -o JSON
```
1. Install the git hooks:

**Note:** `python -m autocorpus` can be used instead of `auto-corpus`
```sh
pre-commit install
```

**Available arguments:**
1. Run the main app for a single file example:

`-f` (input file path) - file or directory to run Auto-CORPus on
```sh
python -m autocorpus -c "configs/config_pmc.json" -t "output" -f "path/to/html/file" -o JSON
```

`-t` (output file path) - file path where Auto-CORPus should output files
1. Run the main app for a directory of files example

`-c` (config) - which config file to use
```sh
python -m autocorpus -c "configs/config_pmc.json" -t "output" -f "path/to/directory/of/html/files" -o JSON
```

`-o`(output format) - either JSON or XML (defaults to JSON)
**Note:** The `auto-corpus` commandline script is also available and will behave the same as `python -m autocorpus`

<h3><a name="alpha">Alpha testing</a></h3>
## Alpha testing

We are developing an Auto-CORPus plugin to process images of tables and we include an alpha version of this
functionality. Table image files can be processed in either .png or .jpeg/jpg formats. We are working on improving the accuracy of both the table layout and character recognition aspects, and we will update this repo as the plugin advances.
Expand All @@ -120,7 +153,8 @@ Table image file: {any_name_you_want}_table_X.png/jpg/jpeg

- {any_name_you_want} must be identical to the name given to the full text file followed by_table_X where X is the table number

**Additional argument:**
### Additional argument

`-s` (trained dataset) - trained dataset to use for pytesseract OCR. Value should be given in a format
recognised by pytesseract with a "+" between each datafile, such as "eng+all".
| Flag | Name | Description |
| -------- | ------- | ------- |
| `-s` | Trained Dataset | Trained dataset to use for pytesseract OCR. Value should be given in a format recognised by pytesseract with a "+" between each datafile, such as "eng+all" |
18 changes: 9 additions & 9 deletions docs/config_tutorial.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
**How to create/edit a config file**
# How to create/edit a config file

For instructions on how to submit your own configs or make changes to existing config files please see [submit/update](#submit)
For instructions on how to submit your own configs or make changes to existing config files please see [submit/update](#submitting-and-editing-config-files)

*The [config_pmc.json](https://github.com/omicsNLP/Auto-CORPus/blob/main/configs/config_pmc.json) file is used as the example in this tutorial.*

Expand All @@ -10,7 +10,7 @@ so we recommend starting from the template config file or a working config file

For each section in a publication, the config declares `data` and `defined-by` entities.

```
```json
{
"section":{
"defined-by":[],
Expand All @@ -22,7 +22,7 @@ For each section in a publication, the config declares `data` and `defined-by` e
The `defined-by` entity provides a list of HTML tags and attributes which Auto-CORPus can utilise to find occurrences of the
section within the source HTML. Each section must contain a `defined-by` entity.

```
```json
{
"section":{
"defined-by":[
Expand All @@ -40,7 +40,7 @@ The `data` entity allows HTML tags and attributes for areas of interest within
the defined section to be defined. This could be the title of a table or the heading of a section. Some of these `data` elements
are required to allow Auto-CORPus to accurately parse the source HTML whereas others are optional to allow the user to parse extra information from certain sections if provided by a HTML source. Further details about the `data` elements can be found in [data_elements.md](data_elements.md).

```
```json
{
"section":{
"defined-by":[],
Expand All @@ -66,7 +66,7 @@ one of the provided `tag`/`attrs` pairs.
Each `defined-by` or `data` element list object can contain a `tag` and an `attrs` entry. The `tag` entry defines the HTML tag used to denote the section. The `attrs` entry is used to pass in HTML attributes which can uniquely identify
this section from others.

```
```json
{
"tag": "div",
"attrs": {"class": ["ref-cit-blk"]}
Expand All @@ -79,7 +79,7 @@ Regular expressions can be used within the `tag` entry value and `attrs` entry v
Auto-CORPus will automatically enclose any `tag` and `attrs` entries with the regex start (`^`) and end (`$`) anchors, this is to ensure there are no
erroneous matches. In [config_pmc.json](https://github.com/omicsNLP/Auto-CORPus/blob/main/configs/config_pmc.json), article sections are defined using the below `attrs`:

```
```json
"attrs": {"class": "sec"}
```

Expand All @@ -89,7 +89,7 @@ Without the inclusion of the start and end anchors, Auto-CORPus would also find

Below are two further examples of how this regex approach can be used:

```
```json
{
"tag": "p",
"attrs": {"id": "_{0,2}p\\d+"}
Expand All @@ -107,7 +107,7 @@ headers at the same time.

Within the first example, notice the use of "\\\d" instead of the usual "\d" for identifying any digit. This is due to the regex pattern being defined within the config which is a JSON file. For further information about escaping special characters within JSON have a look at [this guide by tutorials point](https://www.tutorialspoint.com/json_simple/json_simple_escape_characters.htm).

<h3><a name="submit">Submitting/editing config files</a></h3>
## Submitting and editing config files

To submit a new config file or edit an existing one, please follow these instructions:

Expand Down
2 changes: 1 addition & 1 deletion docs/data_elements.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
**Use of data elements**
# Use of data elements

The `data` entities within the config file sections allows Auto-CORPus to parse details out of the sections, such as the section
header, table title or footer. Some `data` entity elements are required to allow Auto-CORPus to parse source HTML files
Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
--8<-- "README.md"
--8<-- "README.md" <!-- markdownlint-disable-line MD041 -->

0 comments on commit 0ba1bbd

Please sign in to comment.