[Refactor] Modular package organisation, pre-commit linting suite #58

lmmx · 2024-01-01T23:08:56Z

I wanted to review the code here to see how it worked, and I did some refactoring so I could read it more easily.

ba55b97 introduces an explicit namespace (using __all__ variables) and splits apart classes which do not need to share a module, for ease of reading individual components.
- This is purely a reorganisation with no modification to the interface (in case anyone is relying on the current one)
- I checked this work with ruff and followed its guidance to avoid using * imports so imports are also explicit
- I also switched to relative imports which are more concise when using multiple levels:

$ tree src/uform
src/uform
├── chat.py
├── gen_model.py
├── __init__.py
├── models
│   ├── encoders
│   │   ├── __init__.py
│   │   ├── network_layers.py
│   │   ├── text
│   │   │   ├── block.py
│   │   │   ├── encoder.py
│   │   │   └── __init__.py
│   │   └── visual
│   │       ├── block.py
│   │       ├── encoder.py
│   │       └── __init__.py
│   ├── image_utils.py
│   ├── __init__.py
│   ├── triton.py
│   └── vlm.py
└── setup_model.py

4 directories, 16 files

aca18fe adds a pre-commit config with ruff, toml-sort and pyupgrade (set to 3.7+). The 2 lines at the start will run this on CI (via https://pre-commit.ci) for the repo - this avoids requiring contributors to set this tool up
6dd0e3f runs the pre-commit suite, which made some very minor adjustments (newlines at end of file etc).

…amespace with `__all__`

…tional commit)

…in `models` interface

ashvardanian · 2024-01-01T23:32:51Z

Looks very nice, thank you @lmmx! I am pretty sure there will be collisions with #57, so give us a bit of time to find the optimal way to merge it 🤗

In the meantime, if you have any other recommendations about this or other repos I maintain (like SimSIMD), please let us know!

lmmx · 2024-01-02T01:38:21Z

Ah I just got the benchmarks to run and looks like I have made a mistake, I'm getting a warning from running the bench script (whereas no warning from the PyPI packaged version).

Click to show warning

(uform) louis 🌟 ~/lab/uform/uform $ python scripts/bench.py 
UForm-Gen
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.86it/s]
Some weights of the model checkpoint at unum-cloud/uform-gen were not used when initializing VLMForCausalLM: ['image_encoder.blocks.1.ls2.weight', 'image_encoder.blocks.9.ls1.weight', 'image_encoder.blocks.4.ls2.weight', 'image_encoder.blocks.4.ls1.weight', 'image_encoder.blocks.1.ls1.weight', 'image_encoder.blocks.7.ls1.weight', 'image_encoder.blocks.6.ls2.weight', 'image_encoder.blocks.10.ls2.weight', 'image_encoder.blocks.3.ls1.weight', 'image_encoder.blocks.0.ls1.weight', 'image_encoder.blocks.5.ls1.weight', 'image_encoder.blocks.6.ls1.weight', 'image_encoder.blocks.8.ls1.weight', 'image_encoder.blocks.2.ls2.weight', 'image_encoder.blocks.11.ls2.weight', 'image_encoder.blocks.3.ls2.weight', 'image_encoder.blocks.2.ls1.weight', 'image_encoder.blocks.7.ls2.weight', 'image_encoder.blocks.11.ls1.weight', 'image_encoder.blocks.10.ls1.weight', 'image_encoder.blocks.0.ls2.weight', 'image_encoder.blocks.9.ls2.weight', 'image_encoder.blocks.5.ls2.weight', 'image_encoder.blocks.8.ls2.weight']
- This IS expected if you are initializing VLMForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing VLMForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of VLMForCausalLM were not initialized from the model checkpoint at unum-cloud/uform-gen and are newly initialized: ['image_encoder.blocks.0.ls1.gamma', 'image_encoder.blocks.1.ls2.gamma', 'image_encoder.blocks.5.ls2.gamma', 'image_encoder.blocks.2.ls1.gamma', 'image_encoder.blocks.7.ls2.gamma', 'image_encoder.blocks.1.ls1.gamma', 'image_encoder.blocks.3.ls2.gamma', 'image_encoder.blocks.4.ls2.gamma', 'image_encoder.blocks.3.ls1.gamma', 'image_encoder.blocks.10.ls2.gamma', 'image_encoder.blocks.2.ls2.gamma', 'image_encoder.blocks.0.ls2.gamma', 'image_encoder.blocks.6.ls1.gamma', 'image_encoder.blocks.7.ls1.gamma', 'image_encoder.blocks.10.ls1.gamma', 'image_encoder.blocks.6.ls2.gamma', 'image_encoder.blocks.5.ls1.gamma', 'image_encoder.blocks.8.ls2.gamma', 'image_encoder.blocks.9.ls1.gamma', 'image_encoder.blocks.4.ls1.gamma', 'image_encoder.blocks.8.ls1.gamma', 'image_encoder.blocks.9.ls2.gamma', 'image_encoder.blocks.11.ls2.gamma', 'image_encoder.blocks.11.ls1.gamma']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Will put back to draft until I find. I'll take a look thanks Ash!

Update it was the LayerScale class in gen_model.py, I presumed it could be substituted for the other definition but seemingly not! ab47953 (the checkpoint at unum-cloud/uform-gen expects a layer named weight rather than gamma). I put it in the same dataclass form 79872fd so at least the 2 classes are similar (maybe worth renaming them to distinguish explicitly). Returned this PR to ready for review.

ashvardanian · 2024-01-09T14:23:46Z

Hi @lmmx! Thank you for the contribution!

We've looked into it with a team and can't merge it in its entirety right now, as it will introduce structural differences between public UForm and our private training repositories.

That said, the CI upgrades look interesting! Would you be open to reverting the structural changes and keeping just the CI?

Thanks again!

lmmx · 2024-01-11T12:22:12Z

Fair enough, sure I will cancel this PR and make a fresh one (#62) 📥

lmmx added 6 commits January 1, 2024 21:14

refactor(modules): use clearly defined modules and explicit package n…

ba55b97

…amespace with `__all__`

chore(lint): use pre-commit suite (ruff, pyupgrade, toml-sort, conven…

aca18fe

…tional commit)

fix(pre-commit): remove Conventional Commit (commit message linting)

331d993

refactor(lambda): use a partial not a lambda function to benchmark

e6475e3

style(nit): apply pre-commit suite (ruff, pyupgrade, toml-sort)

6dd0e3f

fix: expose network layer classes (MLP, Attention, LayerScale) …

f1ac7ac

…in `models` interface

lmmx added 2 commits January 2, 2024 00:58

fix: positional argument mismatch in benchmark script

12e615b

fix: missed dataclass decorator

aca9701

lmmx marked this pull request as draft January 2, 2024 01:38

revert: put deduplicated LayerScale class back in gen_model module

ab47953

lmmx marked this pull request as ready for review January 2, 2024 10:33

refactor(gen_model): use dataclass form like the other LayerScale class

79872fd

lmmx closed this Jan 11, 2024

lmmx mentioned this pull request Jan 11, 2024

chore: add pre-commit config and run linters #62

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Modular package organisation, pre-commit linting suite #58

[Refactor] Modular package organisation, pre-commit linting suite #58

lmmx commented Jan 1, 2024

ashvardanian commented Jan 1, 2024

lmmx commented Jan 2, 2024 •

edited

Loading

ashvardanian commented Jan 9, 2024

lmmx commented Jan 11, 2024 •

edited

Loading

[Refactor] Modular package organisation, pre-commit linting suite #58

[Refactor] Modular package organisation, pre-commit linting suite #58

Conversation

lmmx commented Jan 1, 2024

ashvardanian commented Jan 1, 2024

lmmx commented Jan 2, 2024 • edited Loading

ashvardanian commented Jan 9, 2024

lmmx commented Jan 11, 2024 • edited Loading

lmmx commented Jan 2, 2024 •

edited

Loading

lmmx commented Jan 11, 2024 •

edited

Loading