Skip to content

Commit

Permalink
Merge pull request #58 from tsmbland/pre-commit
Browse files Browse the repository at this point in the history
Add pre-commit-config and apply hooks
  • Loading branch information
AdrianDAlessandro authored Oct 31, 2024
2 parents a46a546 + f8ede21 commit e8bf730
Show file tree
Hide file tree
Showing 29 changed files with 4,432 additions and 3,989 deletions.
1 change: 1 addition & 0 deletions .codespell_ignore.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
tabl
37 changes: 37 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
hooks:
- id: check-merge-conflict
- id: debug-statements
- id: trailing-whitespace
exclude: ^src/IAO_dicts/
- id: end-of-file-fixer
- id: pretty-format-json
args: [--autofix, --indent, '4', --no-sort]
- repo: https://github.com/macisamuele/language-formatters-pre-commit-hooks
rev: v2.14.0
hooks:
- id: pretty-format-yaml
args: [--autofix, --indent, '2', --offset, '2']
- repo: https://github.com/python-jsonschema/check-jsonschema
rev: 0.28.3
hooks:
- id: check-github-workflows
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.4.4
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
- id: ruff-format
- repo: https://github.com/igorshubovych/markdownlint-cli
rev: v0.41.0
hooks:
- id: markdownlint-fix
args: [--disable, MD013, MD033, MD036, MD041, MD040, --]
- repo: https://github.com/codespell-project/codespell
rev: v2.3.0
hooks:
- id: codespell
args: [-I, .codespell_ignore.txt]
exclude: ^tests/data/
10 changes: 5 additions & 5 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
identity and expression, level of experience, education, socioeconomic status,
nationality, personal appearance, race, religion, or sexual identity
and orientation.

Expand Down Expand Up @@ -60,7 +60,7 @@ representative at an online or offline event.

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
[email protected].
<[email protected]>.
All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the
Expand Down Expand Up @@ -116,13 +116,13 @@ the community.

This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.0, available at
https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
<https://www.contributor-covenant.org/version/2/0/code_of_conduct.html>.

Community Impact Guidelines were inspired by [Mozilla's code of conduct
enforcement ladder](https://github.com/mozilla/diversity).

[homepage]: https://www.contributor-covenant.org

For answers to common questions about this code of conduct, see the FAQ at
https://www.contributor-covenant.org/faq. Translations are available at
https://www.contributor-covenant.org/translations.
<https://www.contributor-covenant.org/faq>. Translations are available at
<https://www.contributor-covenant.org/translations>.
29 changes: 14 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,14 @@ Auto-CORPus does not provide functionality to retrieve input files directly from
Auto-CORPus relies on a standard naming convention to recognise the files and identify the correct order of tables. The naming convention can be seen below:

Full article HTML: {any_name_you_want}.html

- {any_name_you_want} is how Auto-CORPus will group articles and linked tables/image files

Linked table HTML: {any_name_you_want}_table_X.html
- {any_name_you_want} must be identical to the name given to the full text file followed by _table_X where X is the table number

If passing a single file via the file path then that file will be processed in the most suitable manner, if a directory is passed then
- {any_name_you_want} must be identical to the name given to the full text file followed by_table_X where X is the table number

If passing a single file via the file path then that file will be processed in the most suitable manner, if a directory is passed then
Auto-CORPus will first group files based on common elements in their file name {any_name_you_want} and process all related files at once. Related files in separate directories will not be processed at the same time. Files processed at the same time will be output into the same files, an example input and output directory can be seen below:

**Input:**
Expand All @@ -52,16 +54,15 @@ Auto-CORPus will first group files based on common elements in their file name {
PMC1_tables.json (contains table 1 & 2 and any tables described within the main text)
/subdir
PMC1_tables.json (contains tables 3 & 4 only)

A log file is produced in the output directory providing details of the day/time Auto-CORPus was run,
the arguments used and information about which files were successfully/unsuccessfully processed with a relevant error message.


**Getting started:**

Clone the repo, e.g.:

$ git clone [email protected]:omicsNLP/Auto-CORPus.git or (using HTTPS) git clone https://github.com/omicsNLP/Auto-CORPus.git
$ git clone <[email protected]>:omicsNLP/Auto-CORPus.git or (using HTTPS) git clone <https://github.com/omicsNLP/Auto-CORPus.git>

$ cd Auto-CORPus

Expand All @@ -71,12 +72,12 @@ $ source env/bin/activate or (for Windows users) path/to/env/Scripts/activate.ba

$ pip install .

You might get an error here `ModuleNotFoundError: No module named 'skbuild'` if you do then run
You might get an error here `ModuleNotFoundError: No module named 'skbuild'` if you do then run

$ pip install --upgrade pip
$ pip install --upgrade pip

Or you might need to install the Microsoft Build Tools for Visual Studio
(see https://www.scivision.dev/python-windows-visual-c-14-required for minimal installation requirements so that python-Levenshtein package can be installed)
Or you might need to install the Microsoft Build Tools for Visual Studio
(see <https://www.scivision.dev/python-windows-visual-c-14-required> for minimal installation requirements so that python-Levenshtein package can be installed)
first and then re-run

$ pip install .
Expand All @@ -99,26 +100,24 @@ $ python run_app.py -c "configs/config_pmc.json" -t "output" -f "path/to/direct

`-o`(output format) - either JSON or XML (defaults to JSON)



<h3><a name="alpha">Alpha testing</a></h3>

We are developing an Auto-CORPus plugin to process images of tables and we include an alpha version of this
We are developing an Auto-CORPus plugin to process images of tables and we include an alpha version of this
functionality. Table image files can be processed in either .png or .jpeg/jpg formats. We are working on improving the accuracy of both the table layout and character recognition aspects, and we will update this repo as the plugin advances.

We utilise [opencv](https://pypi.org/project/opencv-python/) for cell detection and [tesseract](https://github.com/tesseract-ocr/tesseract) for optical character recognition. Tesseract will need to be installed separately onto your system for the table image recognition aspect of Auto-CORPus to work. Please follow the guidance given by tesseract on how to do this.

We have made trained datasets available for use with this feature, but we will continue to train these datasets to
We have made trained datasets available for use with this feature, but we will continue to train these datasets to
increase their accuracy, and it is very likely that the trained datasets we offer will be updated frequently during
active development periods.

As with HTML input files, the image input files should be retrieved by the user in a way which the publisher permits. The naming convention is:

Table image file: {any_name_you_want}_table_X.png/jpg/jpeg
- {any_name_you_want} must be identical to the name given to the full text file followed by _table_X where X is the table number

- {any_name_you_want} must be identical to the name given to the full text file followed by_table_X where X is the table number

**Additional argument:**

`-s` (trained dataset) - trained dataset to use for pytesseract OCR. Value should be given in a format
recognised by pytesseract with a "+" between each datafile, such as "eng+all".

Loading

0 comments on commit e8bf730

Please sign in to comment.