Restructure package (cont.) (#36)

* fix: fix wrong PIL import * feat: add cast for better typing * feat: clean `CustomCollator` (mostly style edits) * style: clean colpali_processing_utils and add better typing * feat: factorize the ColPali processing utils in CustomCollator * feat: factorize the ColIdefics processing utils in CustomCollator * feat: restructure the `models` module * feat: big refacto of the collator classes * style: tweak bi-encoder losses * feat: add ColPaliConfig * doc: tweaks * build: remove all `import *` * feat: deprecate `TextRetrieverCollator` * feat: remove redundant `tokenizer` attribute from `BaseVisualRetrieverProcessor` * fix: address Manu's comments * fix: fix typos in `ColIdefics2Processor` * fix: fix HardNegCollator + style tweaks * doc: tweak * feat: deprecate HardNegDocmatixCollator * feat: revert removing abstract attribute `tokenizer` from BaseVisualRetrieverProcessor * doc: fix typos * feat: update `__init__.py` files * feat: fix typing for `ColPaliProcessor.from_pretrained` * feat: add better typing and remove prints from CustomEvaluator * feat: rename CustomEvaluator to CustomRetrievalEvaluator * feat: tweak `get_torch_device` * feat: turn `main_input_name` into ClassVar in ColPali * feat: better `from_pretrained` methods * feat: use PaliGemma tokenizer in `process_queries` * feat: modify the processor classes * feat: deprecate ColPaliConfig * feat: rename ColPaliProcessor init arg * feat: better `CustomRetrievalEvaluator` * feat: move `CustomRetrievalEvaluator` in `evaluation` module * feat: add input length guardrail in `CustomRetrievalEvaluator` * feat: add tests for ColPali * feat: add `hf_token` arg to `ColPaliProcessor` * Revert "feat: use PaliGemma tokenizer in `process_queries`" This reverts commit 7ec95cb. * feat: reduce mock images's size * build: remove `.vscode/` * feat: revert `embedding_dim` attribute to `dim` in ColPali * feat: put all model directories in 1st level of `models` module * build: update module path for models in config files * feat: sort models module by vlm backbone * fix: fix imports in tests * feat: rename all Idefics* classes to Idefics2* * feat: add missing processors for Bi* models * untested: processor is inherited directly * feat: inherit processor directly in ColIdefics2Processor * doc: update docstrings in processor classes * build: loosen dev deps * fix: add missing casts in processor tests * feat: restructure test file structure * fix: fix wrong init in Bi* processors * rename * fix: add texts query to list * fix: ruff * feat: remove unused __future__ imports * build: move pytest conifg to pyproject * feat: add logging in `get_torch_device` * feat: set default device to cpu in `test_retrieval_evaluator.py` * build: add "Ruff" and "Test" CI pipelines * build: add missing `pillow` dep * build: update ruff config in pyproject * build: move `mteb` to compulsory deps + format pyproject * build: tweak project details in pyproject * build: remove black and use ruff formatter instead * build: add missing HF_TOKEN secret in test CI * feat: remove all `|` for python 3.9 compatibility * feat: tweak ColPaliProcessor test * feat: add test for ColPali collator * build: remove `.python-version` * fix: fix typo in `compute_hardnegs.py` * build: unfreeze the numpy dep and make it compulsory * feat: deprecate `mteb` metrics and remove `mteb` dep * feat: tweak `CustomRetrievalEvaluator.evaluate` * feat: rename `CustomRetrievalEvaluator` to `RetrievalScorer` + tweaks * feat: add `CustomRetrievalEvaluator` as a `mteb` wrapper + update `ColModelTraining` * chore: update CHANGELOG * Add scorer in processor (#46) * add: scorer in processor * fix: lint * fix: tests * fix: bugs * fix: tests pass * fix: lint * fix: tony's coms * style: lint * fix: fix wrong typing in processor classes * fix: fix wrong `score` method override in processors --------- Co-authored-by: ManuelFay <[email protected]> Co-authored-by: Manuel Faysse <[email protected]>
illuin-tech · Sep 10, 2024 · 2c75550 · 2c75550
1 parent 0eb0878
commit 2c75550
Show file tree

Hide file tree

Showing 60 changed files with 981 additions and 586 deletions.
diff --git a/.github/workflows/ruff.yml b/.github/workflows/ruff.yml
@@ -0,0 +1,13 @@
+name: Ruff
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+jobs:
+  ruff:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - name: "Linting & Flaking"
+        uses: chartboost/ruff-action@v1
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -0,0 +1,33 @@
+name: Test
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e ".[dev]"
+
+      - name: Run tests with pytest (except "slow" tests)
+        run: |
+          pytest -m "not slow"
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,7 @@
 # Custom
 !*/configs/data/
 .DS_Store
+/.vscode/
 /data/
 /logs/
 /models/

diff --git a/.python-version b/.python-version
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,36 +5,60 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](http://keepachangelog.com/)
 and this project adheres to [Semantic Versioning](http://semver.org/).
 
-## Unreleased
+## [0.3.0] - 2024-09-10
+
+✨ This release is an exhaustive package refacto, making ColPali more modular and easier to use.
+
+🚨 It is **NOT** backward-compatible with previous versions.
 
 ### Added
 
-- feat: Deprecate `interpretability` and `eval_manager` modules
-- feat: Deprecate unused util modules
-- feat: Revamp module organization
-- feat: Restructure the `utils` module
-- feat: Move `ColModelTraining` module
-- feat: Lint code + tweaks
-- feat: deprecated a lot of unused modules and legacy code
+- Restructure the `utils` module
+- Restructure the model training code
+- Add custom `Processor` classes to easily process images and/or queries
+- Enable module-level imports
+- Add scoring to processor
+- Add `CustomRetrievalEvaluator`
+- Add missing typing
+- Add tests for model, processor, scorer, and collator
+- Lint `Changelog`
+- Add missing docstrings
+- Add "Ruff" and "Test" CI pipelines
 
 ### Changed
 
-- doc: Lint Changelog
-- doc: Tweak README
-- feat: The processing function in `colpali_engine.utils.processing_utils.colpali_processing_utils` `process_queries` has a changed API and does not require a Mock Image anymore.
+- Restructure all modules to closely follow the [`transformers`](https://github.com/huggingface/transformers) architecture
+- Hugely simplify the collator implementation to make it model-agnostic
+- `ColPaliProcessor`'s `process_queries` doesn't need a mock image input anymore
+- Clean `pyproject.toml`
+- Loosen the required dependencies
+- Replace `black` with the `ruff` linter
+
+### Removed
+
+- Deprecate `interpretability` and `eval_manager` modules
+- Deprecate unused utils
+- Deprecate `TextRetrieverCollator`
+- Deprecate `HardNegDocmatixCollator`
+
+### Fixed
+
+- Fix wrong PIL import
+- Fix dependency issues
 
 ## [0.2.2] - 2024-09-06
 
 ### Fixed
+
 - Remove forced "cuda" usage in Retrieval Evaluator
 
 ## [0.2.1] - 2024-09-02
- 
+
 Patch query preprocessing helper function disalignement with training scheme.
 
 ### Fixed
-- Add 10 extra pad token by default to the query to act as reasoning buffers. This was added in the collator but not the external helper function for inference purposes.
 
+- Add 10 extra pad token by default to the query to act as reasoning buffers. This was added in the collator but not the external helper function for inference purposes.
 
 ## [0.2.0] - 2024-08-29
 

diff --git a/colpali_engine/__init__.py b/colpali_engine/__init__.py
@@ -0,0 +1,9 @@
+from .models import (
+    BiIdefics2,
+    BiPali,
+    BiPaliProj,
+    ColIdefics2,
+    ColIdefics2Processor,
+    ColPali,
+    ColPaliProcessor,
+)
diff --git a/colpali_engine/collators/custom_collator.py b/colpali_engine/collators/custom_collator.py