fix(deps): update dependency lightning to v2.1.0 #975
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
2.0.9
->2.1.0
Release Notes
Lightning-AI/lightning (lightning)
v2.1.0
: Lightning 2.1: Train Bigger, Better, FasterCompare Source
Lightning AI is excited to announce the release of Lightning 2.1 ⚡ It's the culmination of work from 79 contributors who have worked on features, bug-fixes, and documentation for a total of over 750+ commits since v2.0.
The theme of 2.1 is "bigger, better, faster": Bigger because training large multi-billion parameter models has gotten even more efficient thanks to FSDP, efficient initialization and sharded checkpointing improvements, better because it's easier than ever to scale models without making substantial code changes or installing third-party packages and faster because it leverages the latest hardware features to speed up training in low-bit precision thanks to new precision plugins like bitsandbytes and transformer engine.
And of course, as the name implies, this release fully leverages the latest features in PyTorch 2.1 🎉
Highlights
Improvements To Large-Scale Training With FSDP
The FSDP strategy for training large billion-parameter models gets substantial improvements and new features in Lightning 2.1, both in Trainer and Fabric (in case you didn't know, Fabric is the latest addition to the Lightning family of tools to scale models without the boilerplate code).
FSDP is now more user-friendly to configure, has memory management and speed improvements, and we have a brand new end-to-end user guide with best practices (Trainer, Fabric).
Efficient Saving and Loading of Large Checkpoints
When training large billion-parameter models with FSDP, saving and resuming training, or even just loading model parameters for finetuning can be challenging, as users are are often plagued by out-of-memory errors and speed bottlenecks.
In 2.1, we made several improvements. Starting with saving checkpoints, we added support for distributed/sharded checkpoints, enabled through the setting
state_dict_type
in the strategy (#18364, #18358):Trainer:
Fabric:
Distributed checkpoints are the fastest and most memory efficient way to save the state of very large models.
The distributed checkpoint format also makes it efficient to load these checkpoints back for resuming training in parallel, and it reduces the impact on CPU memory usage significantly. Furthermore, we've also introduced lazy-loading for non-distributed checkpoints (#18150, #18379), which greatly reduces the impact on CPU memory usage when loading a consolidated (single-file) checkpoint (e.g. for finetuning). Learn more about these features in our FSDP guides (Trainer, Fabric).
Fast and Memory-Optimized Initialization
A major challenge that users face when working with large models such as LLMs is dealing with the extreme memory requirements. Even something as simple as instantiating a model becomes non-trivial if the model is so large it won't fit in a single GPU or even a single machine. In Lightning 2.1, we are introducing empty-weights initialization through the
Fabric.init_module()
(#17462, #17627) andTrainer.init_module()
/LightningModule.configure_model()
(#18004, #18004, #18385) methods:Trainer:
Fabric:
Read more about this new feature and its other benefits in our docs (Trainer, Fabric).
User-Friendly Configuration
We made it super easy to configure the sharding- and activation-checkpointing policy when you want to auto-wrap particular layers of your model for advanced control (#18045, #18084).
Furthermore, the sharding strategy can now be conveniently set with a string value (#18087):
You no longer need to remember the long PyTorch imports! Fabric also supports all these improvements shown above.
True Half-Precision
Lightning now supports true half-precision for training and inference with all built-in strategies (#18193, #18217, #18213, #18219). With this setting, the memory required to store the model weights is only half of what is normally needed when running with float32. In addition, you get the same speed benefits as mixed precision training (
precision="16-mixed"
) has:The same settings are also available in Fabric! We recommend to try bfloat16 training (
precision="bf16-true"
) as it is often more numerically stable than regular 16-bit precision (precision="16-true"
).Bitsandbytes Quantization
With the new Bitsandbytes precision plugin #18655, you can now quantize your model for significant memory savings during training, finetuning, or inference with a selection of several state-of-the-art quantization algorithms (int8, fp4, nf4 and more). For the first time, Trainer and Fabric make bitsandbytes easy to use for general models.
Trainer:
Fabric:
Learn more!
Transformer Engine
The Transformer Engine by NVIDIA is a library for accelerating transformer layers on the new Hopper (H100) generation of GPUs. With the integration in Lightning Trainer and Fabric (#17597, #18459), you have easy access to the 8-bit mixed precision for significant speed ups:
Trainer:
Fabric:
More configuration options are available through the respective plugins in Trainer and Fabric.
Lightning on TPU Goes Brrr
Lightning 2.1 runs on the latest generation of TPU hardware on Google Cloud! TPU-v4 and TPU-v5 (#17227) are now fully supported both in Fabric and Trainer and run using the new PjRT runtime by default (#17352). PjRT is the runtime used by Jax and has shown an average improvement of 35% on benchmarks.
Trainer:
Fabric:
And what's even more exciting, you can now scale massive multi-billion parameter models on TPUs using FSDP (#17421).
You can find a full end-to-end finetuning example script in our Lit-GPT repository. The new XLA-FSDP strategy is experimental and currently only available in Fabric. Support in the Trainer will follow in the future.
Granular Control Over Checkpoints in Fabric
Several improvements for checkpoint saving and loading have landed in Fabric, enabling more fine-grained control over what is saved/loaded while reducing boilerplate code:
There is a new
Fabric.load_raw()
method with which you can load model- or optimizer state-dicts saved externally by a non-Fabric application (e.g., raw PyTorch) (#18049)A model weights file saved by your friend who doesn't use Fabric
Equivalent to this:
model.load_state_dict(torch.load("path/to/model.pt"))
A new parameter
Fabric.load(..., strict=True|False)
to disable strict loading (#17645)strict loading is the default
disable strict loading
A new parameter
Fabric.save(..., filter=...)
that enables you to exclude certain parameters of your model without writing boilerplate code for it (#17845)save only the weights that match a pattern
You can read more about the new options in our checkpoint guide.
Backward Incompatible Changes
The release of PyTorch Lightning 2.0 was a big step into a new chapter: It brought a more polished API and removed a lot of legacy code and outdated as well as experimental features, at the cost of a long list of breaking changes resulting in more work needed than usual to upgrade from 1.9 to 2.0. Moving forward, we promised to maintain full backward compatibility of our public core APIs to guarantee a smooth upgrade experience for everyone, and with 2.1 we are happy to deliver on this promise. A few exceptions were made in places where the change was justified if it significantly improves the user experience, improves performance, or fixes the correctness of a feature. These changes will likely not impact most users.
PyTorch Lightning
TPU/XLA Changes
When selecting device indices via
devices=[i]
, the Trainer now selects the i-th TPU core (0-based, previously it was 1-based) (#17227)Before:
Now:
Multi-GPU in Jupyter Notebooks
Due to lack of reliability, Trainer now only runs on one GPU instead of all GPUs in a Jupyter notebook if
devices="auto"
(default) (#18291)Before:
Now:
Device Access in Setup Hook
LightningModule.setup()
, theself.device
now returns the device the module will be placed on instead ofcpu
(#18021)Before:
Now:
Miscellaneous Changes
self.log
ed tensors are now kept in the original device to reduce unnecessary host-to-device synchronizations (#17334)FSDPStrategy
now loads checkpoints after theconfigure_model
/configure_sharded_model
hook (#18358)FSDPStrategy.load_optimizer_state_dict
andFSDPStrategy.load_model_state_dict
are a no-op now (#18358)torchdistx
due to a lack of project maintenance (#17995)Lightning Fabric
We thank the community for the amazing feedback we got for Fabric so far - keep it coming. The list of breaking changes is short and won't affect the vast majority of users.
Sharding Context Manager in Fabric.run()
We removed automatic sharding support with
Fabric.run
or usingfabric.launch(fn)
. This only impacts FSDP and DeepSpeed strategy users who use this way of launching. Please note thatFabric.run
is a legacy construct from theLightningLite
days, and is not recommended today. Please instantiate your large FSDP or DeepSpeed model under the newly addedfabric.init_module
context manager (#17832).Before:
Now:
Multi-GPU in Jupyter Notebooks
Due to lack of reliability, Fabric now only runs on one GPU instead of all GPUs in a Jupyter notebook if
devices="auto"
(default) (#18291)Before:
Now:
CHANGELOG
PyTorch Lightning
Added
metrics_format
attribute toRichProgressBarTheme
class (#18373)CHECKPOINT_EQUALS_CHAR
attribute toModelCheckpoint
class (#17999)**summarize_kwargs
toModelSummary
andRichModelSummary
callbacks (#16788)max_size_cycle|max_size|min_size
iteration modes during evaluation (#17163)XLAStrategy(sync_module_states=bool)
to control whether to broadcast the parameters to all devices (#17522)FSDPStrategy
(#16558)LightningDataModule.from_datasets
to support arbitrary iterables (#17402)SaveConfigCallback.save_config
to ease use cases such as saving the config to a logger (#17475)FSDPStrategy(timeout=...)
for the FSDP strategy (#17274)FSDPStrategy(activation_checkpointing_policy=...)
to customize the layer policy for automatic activation checkpointing (requires torch>=2.1) (#18045)--map-to-cpu
to the checkpoint upgrade script to enable converting GPU checkpoints on a CPU-only machine (#17527)LearningRateMonitor
to log monitored values totrainer.callback_metrics
(#17626)log_weight_decay
argument toLearningRateMonitor
callback (#18439)Trainer.print()
to print on local rank zero only (#17980)Trainer.init_module()
context manager to instantiate large models efficiently directly on device, dtype (#18004)torch.float32
,torch.float64
) depending on the 'true' precision choice inTrainer(precision='32-true'|'64-true')
LightningModule.configure_model()
hook to instantiate large models efficiently directly on device, dtype, and with sharding support (#18004)Trainer.init_module(empty_init=True)
in FSDP (#18385)lightning.pytorch.plugins.PrecisionPlugin.module_init_context()
andlightning.pytorch.strategies.Strategy.tensor_init_context()
context managers to control model and tensor instantiation (#18004)xla_model.mark_step()
before saving checkpoints with XLA (#17882)torch.distributed.fsdp.ShardingStrategy
via string inFSDPStrategy
(#18087)XLAStrategy
(#18194)Trainer(precision="16-true"|"bf16-true")
(#18193, #18217, #18213, #18219)devices
andnum_nodes
when running withSLURM
orTorchElastic
(#18292)FSDPStrategy(state_dict_type="full"|"sharded")
(#18364)Trainer(precision="transformer-engine")
using Nvidia's Transformer Engine (#18459)Trainer(plugins=BitsandbytesPrecision())
using bitsandbytes (#18655)FSDPStrategy
(#18583)lightning.pytorch.utilities.suggested_max_num_workers
to assist with setting a good value in distributed settings (#18591)num_workers
warning to give a more accurate upper limit on thenum_workers
suggestion (#18591)lightning.pytorch.utilities.is_shared_filesystem
utility function to automatically check whether the filesystem is shared between machines (#18586)Mapping
fromLightningModule.training_step()
(#18657)LightningModule.on_validation_model_zero_grad()
to allow overriding the behavior of zeroing the gradients before entering the validation loop (#18710)Changed
round(..., 3)
to".3f"
format string inMetricsTextColumn
class (#18483)self.trainer.model.parameters()
inLightningModule.configure_optimizers()
(#17309)Trainer(accelerator="tpu", devices=[i])"
now selects the i-th TPU core (0-based, previously it was 1-based) (#17227)self.log
ed tensors are now kept in the original device to reduce unnecessary host-to-device synchronizations (#17334)WandbLogger
lazy to avoid creating artifacts when the CLI is used (#17573)*_step
methods in strategies by removing the_LightningModuleWrapperBase
wrapper module (#17531)wandb
versions older than 0.12.0 inWandbLogger
(#17876)LightningModule.setup()
, theself.device
now returns the device the module will be placed on instead ofcpu
(#18021)wandb
version forWandbLogger
from 0.12.0 to 0.12.10 (#18171)FSDPStrategy
now loads checkpoints after theconfigure_model
/configure_sharded_model
hook (#18358)FSDPStrategy.load_optimizer_state_dict
andFSDPStrategy.load_model_state_dict
are a no-op now (#18358)Trainer.num_val_batches
,Trainer.num_test_batches
andTrainer.num_sanity_val_batches
now return a list of sizes per dataloader instead of a single integer (#18441)*_step(dataloader_iter)
flavor now no longer takes thebatch_idx
in the signature (#18390)next(dataloader_iter)
now returns a triplet(batch, batch_idx, dataloader_idx)
(#18390)next(combined_loader)
now returns a triplet(batch, batch_idx, dataloader_idx)
(#18390)devices="auto"
(default) (#18291)batch_idx
argument optional invalidation_step
,test_step
andpredict_step
to maintain consistency withtraining_step
(#18512)TQDMProgressBar
now consistently shows it/s for the speed even when the iteration time becomes larger than one second (#18593)LightningDataModule.load_from_checkpoint
andLightningModule.load_from_checkpoint
methods now raise an error if they are called on an instance instead of the class (#18432)torchrun
in a SLURM environment; theTorchElasticEnvironment
now gets chosen over theSLURMEnvironment
if both are detected (#18618)OMP_NUM_THREADS
tonum_cpus / num_processes
when launching subprocesses (e.g. when DDP is used) to avoid system overload for CPU-intensive tasks (#18677)ModelCheckpoint
no longer deletes files under the save-top-k mechanism when resuming from a folder that is not the same as the current checkpoint folder (#18750)ModelCheckpoint
no longer deletes the file that was passed toTrainer.fit(ckpt_path=...)
(#18750)trainer.fit()
twice now raises an error with strategies that spawn subprocesses throughmultiprocessing
(ddp_spawn, xla) (#18776)ModelCheckpoint
now saves a symbolic link ifsave_last=True
andsave_top_k != 0
(#18748)Deprecated
Deprecated the
SingleTPUStrategy
(strategy="single_tpu"
) in favor ofSingleDeviceXLAStrategy
(strategy="single_xla"
) (#17383)Deprecated the
TPUAccelerator
in favor ofXLAAccelerator
(#17383)Deprecated the
TPUPrecisionPlugin
in favor ofXLAPrecisionPlugin
(#17383)Deprecated the
TPUBf16PrecisionPlugin
in favor ofXLABf16PrecisionPlugin
(#17383)Deprecated the
Strategy.post_training_step
method (#17531)Deprecated the
LightningModule.configure_sharded_model
hook in favor ofLightningModule.configure_model
(#18004)Deprecated the
LightningDoublePrecisionModule
wrapper in favor of callingTrainer.precision_plugin.convert_input()
(#18209)Removed
Removed the
XLAStrategy.is_distributed
property. It is always True (#17381)Removed the
SingleTPUStrategy.is_distributed
property. It is always False (#17381)Removed experimental support for
torchdistx
due to a lack of project maintenance (#17995)Removed support for PyTorch 1.11 (#18691)
Fixed
DeepSpeedStrategy
(#17531)batch_idx
argument in thetraining_step
would disable gradient accumulation (#18619)LightningModule.configure_callbacks
when the callback was a subclass of an existing Trainer callback (#18508)Trainer.log_dir
not returning the correct directory for theCSVLogger
(#18548)self.log
(#18686)CSVLogger
(#18567)Lightning Fabric
Added
XLAStrategy(sync_module_states=bool)
to control whether to broadcast the parameters to all devices (#17522)FSDPStrategy
(#17323)_FabricModule
that bypass the strategy-specific wrappers (#17424)Fabric.init_tensor()
context manager to instantiate tensors efficiently directly on device and dtype (#17488)Fabric.init_module()
context manager to instantiate large models efficiently directly on device, dtype, and with sharding support (#17462)torch.float32
,torch.float64
,torch.float16
, ortorch.bfloat16
) depending on the 'true' precision choice inFabric(precision='32-true'|'64-true'|'16-true'|'bf16-true')
Fabric.init_module(empty_init=True)
for checkpoint loading (#17627)Fabric.init_module(empty_init=True)
in FSDP (#18122)lightning.fabric.plugins.Precision.module_init_context()
andlightning.fabric.strategies.Strategy.module_init_context()
context managers to control model and tensor instantiation (#17462)lightning.fabric.strategies.Strategy.tensor_init_context()
context manager to instantiate tensors efficiently directly on device and dtype (#17607)Fabric(precision="16-true"|"bf16-true")
(#17287)Fabric(precision="transformer-engine")
using Nvidia's Transformer Engine (#17597)Fabric(plugins=BitsandbytesPrecision())
using bitsandbytes (#18655).launch()
when it is required (#17570)FSDPStrategy(state_dict_type="full"|"sharded")
(#17526)Fabric.call
(#17874)Fabric.load(..., strict=True|False)
to enable non-strict loading of partial checkpoint state (#17645)Fabric.save(..., filter=...)
to enable saving a partial checkpoint state (#17845)xla_model.mark_step()
before saving checkpoints with XLA (#17882)xla_model.mark_step()
afteroptimizer.step()
with XLA (#17883)FSDPStrategy(activation_checkpointing_policy=...)
to customize the layer policy for automatic activation checkpointing (requires torch>=2.1) (#18045)torch.distributed.fsdp.ShardingStrategy
via string inFSDPStrategy
(#18087)Fabric.load_raw()
for loading raw PyTorch state dict checkpoints for model or optimizer objects (#18049)XLAStrategy
(#18194)devices
andnum_nodes
when running withSLURM
orTorchElastic
(#18292)lightning.fabric.utilities.suggested_max_num_workers
to assist with setting a good value in distributed settings (#18591)lightning.fabric.utilities.is_shared_filesystem
utility function to automatically check whether the filesystem is shared between machines (#18586).load_state_dict(..., assign=True|False)
on Fabric-wrapped modules in PyTorch 2.1 or newer (#18690)Changed
devices="auto"
(default) (#18291)torchrun
in a SLURM environment; theTorchElasticEnvironment
now gets chosen over theSLURMEnvironment
if both are detected (#18618)OMP_NUM_THREADS
tonum_cpus / num_processes
when launching subprocesses (e.g. when DDP is used) to avoid system overload for CPU-intensive tasks (#18677)Deprecated
DDPStrategy.is_distributed
property. This strategy is distributed by definition (#17381)SingleTPUStrategy
(strategy="single_tpu"
) in favor ofSingleDeviceXLAStrategy
(strategy="single_xla"
) (#17383)TPUAccelerator
in favor ofXLAAccelerator
(#17383)TPUPrecision
in favor ofXLAPrecision
(#17383)TPUBf16Precision
in favor ofXLABf16Precision
(#17383)Removed
Fabric.run
or usingfabric.launch(fn)
. This only impacts FSDP and DeepSpeed strategy users. Please instantiate your module under the newly addedfabric.init_module
context manager (#17832)checkpoint_io
argument from theFSDPStrategy
(#18192)Fixed
.launch()
when using the DP-strategy (strategy="dp"
) (#17931)find_usable_cuda_devices(0)
incorrectly returning a list of devices (#18722)CSVLogger
(#18567)Lightning App
Added
gradio
components with lightning colors (#17054)Changed
LocalSourceCodeDir
cache_location to not use home in some certain cases (#17491)Removed
Full commit list: Lightning-AI/pytorch-lightning@2.0.0...2.1.0
Contributors
Veteran
@adamjstewart @akreuzer @ethanwharris @dmitsf @lantiga @nicolai86 @pl-ghost @carmocca @awaelchli @justusschock @edenlightning @belerico @lightningforever @nisheethlahoti @tchaton @yurijmikhalevich @mauvilsa @rlizzo @rusmux @yhl48 @Liyang90 @jerome-habana @JustinGoheen @Borda @speediedan @SkafteNicki @dcfidalgo
New
@saryazdi @parambharat @kshitij12345 @woqidaideshi @colehawkins @md-121 @gkroiz @idc9 @BoringDonut @OmerShubi @ishandutta0098 @ryan597 @leng-yue @alicanb @One-sixth @santurini @SpirinEgor @KogaiIrina @shanmugamr1992 @janeyx99 @asmith26 @dingusagar @AleksanderWWW @strawberrypie @solyaH @kaczmarj @voidful @water-vapor @bkiat1123 @rhiga2 @baskrahmer @felipewhitaker @mukhery @Quasar-Kim @robieta @one-matrix @jere357 @schmidt-ai @schuhschuh @anio @rjarun8 @callumhay @minhlong94 @klieret @giorgioskij @shihaoyin @JonathanRayner @NripeshN @marcimarc1 @bilelomrani1 @NikolasWolke @0x404 @quintenroets @Borodin @amorehead @SebastianGer @ioangatop @Tribhuvan0 @f0k @sameertantry @kwsp @nik777 @matsumotosan
Did you know?
When Chuck Norris trains a neural network, it not only learns, but it also gains the ability to defend itself from adversarial attacks by roundhouse kicking them into submission.
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Enabled.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR has been generated by Mend Renovate. View repository job log here.