Release v0.104 · LLNL/lbann

============================== Release Notes: v0.104 ==============================
C++ API:

Support for new training algorithms:

Support for new network structures:

Support for new layers:

Python front-end:

Support for PyTorch Module conversion to LBANN graphs (requires PyTorch 2.0
or newer, compiled with PyTorch Dynamo)

Performance optimizations:

Support in-place computations for capable layers as a memory optimization
Allow distconv-enabled convolution and batchnorm layers to reuse their
input activations as error signals as a memory optimization if the parent
layer does not need its activations in the backward pass. This optimization
can be disabled by setting the environment variable
DISTCONV_DISABLE_MEM_OPT=1.
Added support for selective weight sharding (also known as
Fully-Sharded Data Parallelism, or FSDP). To enable, set sharded=true
on weight objects.
Allow distconv to be disabled at runtime with LBANN_DISABLE_DISTCONV=1.
Activations are now deallocated when no longer needed via a reference counter,
disable with LBANN_DISABLE_ACT_GC=1.
Added option for LBANN to set the number of OMP threads to modest
default (4) if the environment doesn't specify anything.
Save memory on backpropagation by not replicating gradients between
GradientManager and data_type_optimizer
Save more memory in FSDP by synchronizing previous outstanding
async communication calls and freeing up local gradient contributions
FSDP: release full weight views after backprop
Batching heads in multi-head attention into single operations
instead of on a per-head basis
Stacking the weights and biases for queries/keys/values in
self-attention

Model portability & usability:

Experiments & Applications:

Updated CosmoFlow model to automatically scale the model
architecture and parallelism with input size.
Added a PyTorch reference implementation of CosmoFlow.

Internal features:

Removed the mini_batch_size parameter from the following functions
in the layer class hierarchy: fp_setup_inputs, fp_setup_outputs, bp_setup_gradient_wrt_inputs
and the distconv_adapter class: fp_setup, bp_setup
Support global and local gradient norm clipping with the clip_gradient_norm callback
Interactive progress bar with the progress_bar callback
Evaluate progress callback allows for periodic monitoring during
training with independent data set (intra-epoch evaluation)
Detailed memory usage profiling with the memory_profiler callback
Refactored subgraph parallelism

I/O & data readers:

Renamed percent_of_data_to_use more accurately to fraction_of_data_to_use.
DataReaderMetaData, training_dr_linearized_data_size, and num_parallel_readers
were removed from the model and layer API, and instead reside in the data
ingestion pipeline.
Fixed implementation of background I/O to achive better decoupling
of background data fetch. Can be enabled / disabled with runtime
flag.
Set the default number of I/O threads to 4
Changed the I/O and transform pipeline to use a bank of RNGs that
is now indexed by the sample ID in the load sequence, rather than the
I/O thread ID. This eliminates variablility when using different
numbers of I/O threads.
Moved state tracking current position in a data set from the data
reader to the dataset class.
Split the I/O RNGs into two banks one for training and one for all
other execution modes.

Build system:

Updated build script to use CachedCMakeProject mode, which should
simplfy the overall workflow
Set a default time limit for CI tests to avoid unnecessary stalls

Bug fixes:

Fixed a bug where in-place layers sometimes attached a locked view
of a matrix to a mutable view.
Fixed a bug when trying to use the legacy HDF5 data reader without data store.
Fixed concurrency bugs in the data store
Fixed DistConv memory optimization bug

Retired features:

Support for autoencoder strategy in the summarize images callback was removed
Removed deprecated Layer protobuf fields: weight_data,
num_neurons_from_data_reader
Removed support for calculating a global mini-batch across multiple
models using the imcomm callback or multiple trainers. The
mini-batch is now strictly contained to a single model in a single
trainer. This deprecates an unused (and old) multi-model
execution mode using imcomm callback that predated LTFB.
Removed the notion of effective mini-batch size versus current mini-batch size.
Remove world master mini-batch adjustment.
Remove model offset field. No longer necessary since data sets do not span models.
Remove the cached value of the current mini-batch size from the SGD
execution context. It is now only cached in the model.
Removed the imcomm "inter-model" callback
Removed the num-parallel-readers parameter to the I/O subsystem.
This eliminates an older version of I/O parallelism that relied on
a non-data-parallel I/O buffer and had different ranks fetching
entire mini-batches. It is superseded by standard data-parallel I/O.

Provide feedback