Merge pull request #58 from tsmbland/pre-commit

Add pre-commit-config and apply hooks
omicsNLP · Oct 31, 2024 · e8bf730 · e8bf730
2 parents a46a546 + f8ede21
commit e8bf730
Show file tree

Hide file tree

Showing 29 changed files with 4,432 additions and 3,989 deletions.
diff --git a/.codespell_ignore.txt b/.codespell_ignore.txt
@@ -0,0 +1 @@
+tabl
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,37 @@
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.6.0
+    hooks:
+      - id: check-merge-conflict
+      - id: debug-statements
+      - id: trailing-whitespace
+        exclude: ^src/IAO_dicts/
+      - id: end-of-file-fixer
+      - id: pretty-format-json
+        args: [--autofix, --indent, '4', --no-sort]
+  - repo: https://github.com/macisamuele/language-formatters-pre-commit-hooks
+    rev: v2.14.0
+    hooks:
+      - id: pretty-format-yaml
+        args: [--autofix, --indent, '2', --offset, '2']
+  - repo: https://github.com/python-jsonschema/check-jsonschema
+    rev: 0.28.3
+    hooks:
+      - id: check-github-workflows
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.4.4
+    hooks:
+      - id: ruff
+        args: [--fix, --exit-non-zero-on-fix]
+      - id: ruff-format
+  - repo: https://github.com/igorshubovych/markdownlint-cli
+    rev: v0.41.0
+    hooks:
+      - id: markdownlint-fix
+        args: [--disable, MD013, MD033, MD036, MD041, MD040, --]
+  - repo: https://github.com/codespell-project/codespell
+    rev: v2.3.0
+    hooks:
+      - id: codespell
+        args: [-I, .codespell_ignore.txt]
+        exclude: ^tests/data/
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -5,7 +5,7 @@
 We as members, contributors, and leaders pledge to make participation in our
 community a harassment-free experience for everyone, regardless of age, body
 size, visible or invisible disability, ethnicity, sex characteristics, gender
-identity and expression, level of experience, education, socio-economic status,
+identity and expression, level of experience, education, socioeconomic status,
 nationality, personal appearance, race, religion, or sexual identity
 and orientation.
 
@@ -60,7 +60,7 @@ representative at an online or offline event.
 
 Instances of abusive, harassing, or otherwise unacceptable behavior may be
 reported to the community leaders responsible for enforcement at
-[email protected].
+<[email protected]>.
 All complaints will be reviewed and investigated promptly and fairly.
 
 All community leaders are obligated to respect the privacy and security of the
@@ -116,13 +116,13 @@ the community.
 
 This Code of Conduct is adapted from the [Contributor Covenant][homepage],
 version 2.0, available at
-https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
+<https://www.contributor-covenant.org/version/2/0/code_of_conduct.html>.
 
 Community Impact Guidelines were inspired by [Mozilla's code of conduct
 enforcement ladder](https://github.com/mozilla/diversity).
 
 [homepage]: https://www.contributor-covenant.org
 
 For answers to common questions about this code of conduct, see the FAQ at
-https://www.contributor-covenant.org/faq. Translations are available at
-https://www.contributor-covenant.org/translations.
+<https://www.contributor-covenant.org/faq>. Translations are available at
+<https://www.contributor-covenant.org/translations>.
diff --git a/README.md b/README.md
@@ -28,12 +28,14 @@ Auto-CORPus does not provide functionality to retrieve input files directly from
 Auto-CORPus relies on a standard naming convention to recognise the files and identify the correct order of tables. The naming convention can be seen below:
 
 Full article HTML: {any_name_you_want}.html
+
 - {any_name_you_want} is how Auto-CORPus will group articles and linked tables/image files
 
 Linked table HTML: {any_name_you_want}_table_X.html
-- {any_name_you_want} must be identical to the name given to the full text file followed by _table_X where X is the table number
 
-If passing a single file via the file path then that file will be processed in the most suitable manner, if a directory is passed then 
+- {any_name_you_want} must be identical to the name given to the full text file followed by_table_X where X is the table number
+
+If passing a single file via the file path then that file will be processed in the most suitable manner, if a directory is passed then
 Auto-CORPus will first group files based on common elements in their file name {any_name_you_want} and process all related files at once. Related files in separate directories will not be processed at the same time. Files processed at the same time will be output into the same files, an example input and output directory can be seen below:
 
 **Input:**
@@ -52,16 +54,15 @@ Auto-CORPus will first group files based on common elements in their file name {
     PMC1_tables.json (contains table 1 & 2 and any tables described within the main text)
     /subdir
         PMC1_tables.json (contains tables 3 & 4 only)
-   
+
 A log file is produced in the output directory providing details of the day/time Auto-CORPus was run,
 the arguments used and information about which files were successfully/unsuccessfully processed with a relevant error message.
 
-
 **Getting started:**
 
 Clone the repo, e.g.:
 
-$ git clone [email protected]:omicsNLP/Auto-CORPus.git or (using HTTPS) git clone https://github.com/omicsNLP/Auto-CORPus.git
+$ git clone <[email protected]>:omicsNLP/Auto-CORPus.git or (using HTTPS) git clone <https://github.com/omicsNLP/Auto-CORPus.git>
 
 $ cd Auto-CORPus
 
@@ -71,12 +72,12 @@ $ source env/bin/activate or (for Windows users) path/to/env/Scripts/activate.ba
 
 $ pip install .
 
-You might get an error here `ModuleNotFoundError: No module named 'skbuild'` if you do then run 
+You might get an error here `ModuleNotFoundError: No module named 'skbuild'` if you do then run
 
-$ pip install --upgrade pip 
+$ pip install --upgrade pip
 
-Or you might need to install the Microsoft Build Tools for Visual Studio 
-(see https://www.scivision.dev/python-windows-visual-c-14-required for minimal installation requirements so that python-Levenshtein package can be installed)
+Or you might need to install the Microsoft Build Tools for Visual Studio
+(see <https://www.scivision.dev/python-windows-visual-c-14-required> for minimal installation requirements so that python-Levenshtein package can be installed)
 first and then re-run
 
 $ pip install .
@@ -99,26 +100,24 @@ $  python run_app.py -c "configs/config_pmc.json" -t "output" -f "path/to/direct
 
 `-o`(output format) - either JSON or XML (defaults to JSON)
 
-
-
 <h3><a name="alpha">Alpha testing</a></h3>
 
-We are developing an Auto-CORPus plugin to process images of tables and we include an alpha version of this 
+We are developing an Auto-CORPus plugin to process images of tables and we include an alpha version of this
 functionality. Table image files can be processed in either .png or .jpeg/jpg formats. We are working on improving the accuracy of both the table layout and character recognition aspects, and we will update this repo as the plugin advances.
 
 We utilise [opencv](https://pypi.org/project/opencv-python/) for cell detection and [tesseract](https://github.com/tesseract-ocr/tesseract) for optical character recognition. Tesseract will need to be installed separately onto your system for the table image recognition aspect of Auto-CORPus to work. Please follow the guidance given by tesseract on how to do this.
 
-We have made trained datasets available for use with this feature, but we will continue to train these datasets to 
+We have made trained datasets available for use with this feature, but we will continue to train these datasets to
 increase their accuracy, and it is very likely that the trained datasets we offer will be updated frequently during
 active development periods.
 
 As with HTML input files, the image input files should be retrieved by the user in a way which the publisher permits. The naming convention is:
 
 Table image file: {any_name_you_want}_table_X.png/jpg/jpeg
-- {any_name_you_want} must be identical to the name given to the full text file followed by _table_X where X is the table number
+
+- {any_name_you_want} must be identical to the name given to the full text file followed by_table_X where X is the table number
 
 **Additional argument:**
 
 `-s` (trained dataset) - trained dataset to use for pytesseract OCR. Value should be given in a format
     recognised by pytesseract with a "+" between each datafile, such as "eng+all".
-