v1.0 merge #910

bghira · 2024-09-01T02:04:20Z

Here's a summary of the major changes in version 1.0. These are huge changes, so it would be nice if others can retrieve the main branch and run some old configurations and ensure things still look as you'd expect.

Refactoring and Enhancements:

Refactor train.py into a Trainer Class:
- The core logic of train.py has been restructured into a Trainer class, improving modularity and maintainability.
- Exposes an SDK for reuse elsewhere.
Model Family Unification:
- References to specific model types (--sd3, --flux, etc.) have been replaced with a unified --model_family argument, streamlining model specification and reducing clutter in configurations.
Configuration System Overhaul:
- Switched from .env configuration files to JSON (config.json), with multiple backends supporting JSON configuration loading. This allows more flexible and readable configuration management.
- Updated the configuration loader to auto-detect the best backend when launching.
Enhanced Argument Handling:
- Deprecated old argument references and moved argument parsing to helpers/configuration/cmd_args.py for better organization.
- Introduced support for new arguments such as --model_card_safe_for_work, --flux_schedule_shift, and --disable_bucket_pruning.
Improved Hugging Face Integration:
- Modified configure.py to avoid asking for Hugging Face model name details unless required.
- Added the ability to pass the SFW (safe-for-work) argument into the training script.
Optimizations and Bug Fixes:
- Fixed several references to learning rate (lr) initialization and corrected --optimizer usage.
- Addressed issues with attention masks swapping and fixed the persistence of text encoders in RAM after refactoring.
Training and Validation Enhancements:
- Added better dataset examples with support for multiple resolutions and mixed configurations.
- Configured training scripts to disable gradient accumulation steps by default and provided better control over training options via the updated config files.
Enhanced Logging and Monitoring:
- Improved the handling of Weights & Biases (wandb) logs and updated tracker argument references.
Documentation Updates:
- Revised documentation to reflect changes in model family handling, argument updates, and configuration management.
- Added guidance on setting up the new configuration files and examples for multi-resolution datasets.
Miscellaneous Improvements:
- Enabled support for NSFW tags in model cards enabled by default.
- Updated train.sh to minimal requirements, reducing complexity and streamlining the training process.

…buckets with crops

…ushing it there; allow user to configure multidatabackend using multi-res and crop by default

…f different model-specific names

…v vars

…ner into refactor/main-loop

The docs show `instance_prompt` when the value should be `instanceprompt`. ``` 2024-08-29 15:28:24,677 [ERROR] (__main__) Unsupported caption strategy: instance_prompt. Supported: 'filename', 'textfile', 'parquet', 'instanceprompt', traceback: Traceback (most recent call last): ```

Dataloader Docs - Correct caption strategy for instance prompt

…onstant_with_warmup

fix constant_with_warmup not being so constant or warming up

…ile preparing and loading state

…cycling

follow-up fix for setting last_epoch

fix multigpu schedule issue with LR on resume

… overcome accelerate issue

multiply the resume state step by the number of GPUs in an attempt to overcome accelerate v0.33 issue

default to json/toml before the env file in case multigpu is configured

fix json/toml configs str bool values

…run on CPU

bypass some "helpful" diffusers logic that makes random decisions to run on CPU

bghira and others added 30 commits August 28, 2024 17:53

refactor train.py into Trainer class

399314b

add support for nsfw model card tags

1610ce9

pass sfw arg into train when set to true

44005bd

docs: add better dataset example for multiple resolutions and mixing …

e79cd14

…buckets with crops

add --model_card_safe_for_work

0d612e8

configure.py: do not ask for hub model name details if they are not p…

9ae28b4

…ushing it there; allow user to configure multidatabackend using multi-res and crop by default

remove debug print

d532c5c

disable gradient accum steps by default

c1d20f5

update default example layout to use MODEL_FAMILY instead of a mess o…

73d6904

…f different model-specific names

configure.py should set up MODEL_FAMILY and not add model-specific en…

cf01054

…v vars

update references to --sd3, --flux etc to model_family

bce8d07

sdxl/kolors remove debug print

c455feb

remove error when freezing unet/transformer

aad81e8

follow-up

7675ebb

multigpu barriers

c315eb6

swap order of quantising

ae0a8cd

fix reference to --adamw_bfloat16

bbb39e4

fix display of learning rate

2314f61

disable text encoder quant by default

467d4bb

trainer: unload text encoder for lora/lyco

3f44210

update config

2f30db1

fix wandb_logs

7f20602

Merge branch 'main' into refactor/main-loop

de8e979

update options doc to reflect model_family change

d5e6751

remove outdated kohya_config conversion script

8f0a7ed

Merge branch 'refactor/main-loop' of ssh://github.com/bghira/SimpleTu…

d238c55

…ner into refactor/main-loop

fix smoldit

9c7efc8

remove deprecated args

79ded80

Merge pull request #902 from barakyo/patch-1

948212f

Dataloader Docs - Correct caption strategy for instance prompt

bghira and others added 28 commits September 1, 2024 14:07

constant_with_warmup epoch state override

bd9cc35

set last_epoch after epoch is set. do not override lr on resume for c…

00dde08

…onstant_with_warmup

Merge pull request #919 from bghira/bugfix/lr-reset-constant-with-warmup

ea65721

fix constant_with_warmup not being so constant or warming up

follow-up fix for setting last_epoch

d482ec7

trainer: do not set last_epoch

5a26789

refactor how lr_scheduler is handled so that we update it properly wh…

c39fcc7

…ile preparing and loading state

use epoch step from prev checkpoint to init, and ensure we do proper …

4288b56

…cycling

formatting

762a3ef

fix for fresh LR starting up

907e9e1

Merge pull request #920 from bghira/bugfix/lr-reset-constant-with-warmup

d149685

follow-up fix for setting last_epoch

enable benchmark by default, mirroring env file from prev release

4ed7530

attempt to fix multigpu schedule issue with LR on resume

cf05936

Merge pull request #921 from bghira/bugfix/lr-reset-constant-with-warmup

e0c5d8c

fix multigpu schedule issue with LR on resume

multiply the resume state step by the number of GPUs in an attempt to…

8e4e04b

… overcome accelerate issue

Merge pull request #922 from bghira/test/multigpu-num-proc-init-state

40caa39

multiply the resume state step by the number of GPUs in an attempt to overcome accelerate v0.33 issue

default to json/toml before the env file in case multigpu is configured

c3f8c7d

Merge pull request #923 from bghira/test/multigpu-num-proc-init-state

f1729c1

default to json/toml before the env file in case multigpu is configured

toml: fix true vs false strings

0620bb3

json: ignore false string as well as false bool

e42147d

Merge pull request #924 from bghira/test/multigpu-num-proc-init-state

7d0b90a

fix json/toml configs str bool values

bypass some "helpful" diffusers logic that makes random decisions to …

5b5082b

…run on CPU

fix mention to output_dir in sigma doc

c0da1f1

update more docs, outdated recommendations

5f131a0

update dreambooth docs to remove ignore_epochs

b8ee55d

ignore ignore_epochs, and indicate its removal

d945a74

update guidance for dreambooth

1746b50

update options list

fe5974f

Merge pull request #925 from bghira/test/multigpu-num-proc-init-state

7ebe773

bypass some "helpful" diffusers logic that makes random decisions to run on CPU

bghira marked this pull request as ready for review September 2, 2024 20:53

bghira merged commit bf4951d into release Sep 2, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0 merge #910

v1.0 merge #910

bghira commented Sep 1, 2024

v1.0 merge #910

v1.0 merge #910

Conversation

bghira commented Sep 1, 2024

Refactoring and Enhancements: