Skip to content

v0.104

Latest
Compare
Choose a tag to compare
@bvanessen bvanessen released this 08 Nov 01:12
· 83 commits to develop since this release

============================== Release Notes: v0.104 ==============================
C++ API:

Support for new training algorithms:

Support for new network structures:

  • Added GPT-3 transformers and training recipes

Support for new layers:

  • Select operator (set tensor value based on predicate)
  • Model parallelism for channel-wise fully-connected layers

Python front-end:

  • Support for PyTorch Module conversion to LBANN graphs (requires PyTorch 2.0
    or newer, compiled with PyTorch Dynamo)

Performance optimizations:

  • Support in-place computations for capable layers as a memory optimization
  • Allow distconv-enabled convolution and batchnorm layers to reuse their
    input activations as error signals as a memory optimization if the parent
    layer does not need its activations in the backward pass. This optimization
    can be disabled by setting the environment variable
    DISTCONV_DISABLE_MEM_OPT=1.
  • Added support for selective weight sharding (also known as
    Fully-Sharded Data Parallelism, or FSDP). To enable, set sharded=true
    on weight objects.
  • Allow distconv to be disabled at runtime with LBANN_DISABLE_DISTCONV=1.
  • Activations are now deallocated when no longer needed via a reference counter,
    disable with LBANN_DISABLE_ACT_GC=1.
  • Added option for LBANN to set the number of OMP threads to modest
    default (4) if the environment doesn't specify anything.
  • Save memory on backpropagation by not replicating gradients between
    GradientManager and data_type_optimizer
  • Save more memory in FSDP by synchronizing previous outstanding
    async communication calls and freeing up local gradient contributions
  • FSDP: release full weight views after backprop
  • Batching heads in multi-head attention into single operations
    instead of on a per-head basis
  • Stacking the weights and biases for queries/keys/values in
    self-attention

Model portability & usability:

  • Added support for profiling with Caliper

Experiments & Applications:

  • Updated CosmoFlow model to automatically scale the model
    architecture and parallelism with input size.
  • Added a PyTorch reference implementation of CosmoFlow.

Internal features:

  • Removed the mini_batch_size parameter from the following functions
    in the layer class hierarchy: fp_setup_inputs, fp_setup_outputs, bp_setup_gradient_wrt_inputs
    and the distconv_adapter class: fp_setup, bp_setup
  • Support global and local gradient norm clipping with the clip_gradient_norm callback
  • Interactive progress bar with the progress_bar callback
  • Evaluate progress callback allows for periodic monitoring during
    training with independent data set (intra-epoch evaluation)
  • Detailed memory usage profiling with the memory_profiler callback
  • Refactored subgraph parallelism

I/O & data readers:

  • Renamed percent_of_data_to_use more accurately to fraction_of_data_to_use.
  • DataReaderMetaData, training_dr_linearized_data_size, and num_parallel_readers
    were removed from the model and layer API, and instead reside in the data
    ingestion pipeline.
  • Fixed implementation of background I/O to achive better decoupling
    of background data fetch. Can be enabled / disabled with runtime
    flag.
  • Set the default number of I/O threads to 4
  • Changed the I/O and transform pipeline to use a bank of RNGs that
    is now indexed by the sample ID in the load sequence, rather than the
    I/O thread ID. This eliminates variablility when using different
    numbers of I/O threads.
  • Moved state tracking current position in a data set from the data
    reader to the dataset class.
  • Split the I/O RNGs into two banks one for training and one for all
    other execution modes.

Build system:

  • Updated build script to use CachedCMakeProject mode, which should
    simplfy the overall workflow
  • Set a default time limit for CI tests to avoid unnecessary stalls

Bug fixes:

  • Fixed a bug where in-place layers sometimes attached a locked view
    of a matrix to a mutable view.
  • Fixed a bug when trying to use the legacy HDF5 data reader without data store.
  • Fixed concurrency bugs in the data store
  • Fixed DistConv memory optimization bug

Retired features:

  • Support for autoencoder strategy in the summarize images callback was removed
  • Removed deprecated Layer protobuf fields: weight_data,
    num_neurons_from_data_reader
  • Removed support for calculating a global mini-batch across multiple
    models using the imcomm callback or multiple trainers. The
    mini-batch is now strictly contained to a single model in a single
    trainer. This deprecates an unused (and old) multi-model
    execution mode using imcomm callback that predated LTFB.
  • Removed the notion of effective mini-batch size versus current mini-batch size.
  • Remove world master mini-batch adjustment.
  • Remove model offset field. No longer necessary since data sets do not span models.
  • Remove the cached value of the current mini-batch size from the SGD
    execution context. It is now only cached in the model.
  • Removed the imcomm "inter-model" callback
  • Removed the num-parallel-readers parameter to the I/O subsystem.
    This eliminates an older version of I/O parallelism that relied on
    a non-data-parallel I/O buffer and had different ranks fetching
    entire mini-batches. It is superseded by standard data-parallel I/O.