- change
mixins
fromModuleList
toModuleDict
- return tokens and mems in
fill_sequence
, and mems becomes a tensor. CachedAutoRegressiveMixin
Example:
import torch
old = torch.load('xxxxx/mp_rank_00_model_states.pt.old', map_location='cpu')
# replace names, mixins index to keys
oldm = old['module']
for k in list(oldm.keys()):
if k.startswith('mixins.0'):
new_k = k.replace('mixins.0', 'mixins.extra_position_embedding')
elif k.startswith('mixins.1'):
new_k = k.replace('mixins.1', 'mixins.attention_plus')
else:
continue
oldm[new_k] = oldm[k]
del oldm[k]
# save to destination
torch.save(old, 'xxxxx/mp_rank_00_model_states.pt')
for the older framework, you also need:
old['module']['transformer.word_embeddings.weight'] = old['module']['word_embeddings.weight']
del old['module']['word_embeddings.weight']
- Add generation.autoregressive_sampling.evalute_perplexity
- fix Runtime Error in skipping Nan Loss
- Add non_conflict attention_fn
- Add Prefix-Tuning
- Now, you can use
kw_args['output_this_layer']
(any hooks in the transformer layers) to return values to final outputs andkw_args['output_cross_layer']
to pass values tokw_args
in the next layer.
Examples:
def attention_fn(...some_args):
...
kw_args['output_this_layer']['mem_kv'] = cache_kv
...
This will let the key 'mem_kv'
appear in the outputs_per_layers[i]
of logits, *outputs_per_layers = model(...)
.
def attention_fn(...some_args, **kw_args):
...
kw_args['output_cross_layer']['last_attention_map'] = attention_map
...
This will let the key 'last_attention_map'
appear in the next layer's kw_args
(all hooks).
- Ensure enough training data, no longer always 200 times
- You can use
kw_args['cross_layer_output']['new_key']=xxx
to pass other results to each layer inposition/word_embedding_forward
. - Add
--train-data-weights
.
- Add Vit
- Fix evaluation all_reduce bug
- split all the default hooks out
- change the order, model hooks will not override all the things. They now are the same as mixin hooks added in the front of all the mixins.
from_pretrained
now auto downloads models. There are two kinds of usages:SomeModel.from_pretrained(args, name)
will load the weights ofname
model to aSomeModel
with the same model arch hyper-params withname
;AutoModel.from_pretrained(args, name)
will return an official model (model_class
Class) with the pretrained weights.- ENV
SAT_HOME
is where we put the models in. Set it in your shell file. - don't necessarily need
deepspeed_config
, or pass model arch hyper-params forfrom_pretrained
. Usezero-stage 0/1/2
.
- Fix *flat_output bug.
- fix defualt mpu init_method bug.
Large update v.0.3.0
- delete
--sandwich-ln
from_pretrained(args, name) => from_pretrained(name, args=None)
- MODEL_URLS fix typo
- enable model-only mode
v.0.3.1 refactor SwissArmyTransformer as sat (package name SwissArmyTransformer)
v 0.3.2 fix model-only "create then inference" bug support deepspeed 0.8.x & 0.9.x model register first try
v 0.3.3
change the fp16 & to cuda order in get_model
.
v 0.3.4
- add example for nested transformer models
- move all print to logging, set
SAT_LOGLEVEL
to control
v. 0.3.5
- add repetition penalty
- add quantization
v. 0.3.6 support no deepspeed model-only test cpu inference test windows
v. 0.3.7 update vit add qlora/lora2
v. 0.4.0
- add xfomers memory efficient attention.
- pytorch 2.0 auto fast attention, attention_fn dispatch via version.
- add llama and chatglm2.
- add split model for model-parallel in inference mode.
- add r2 download
v. 0.4.1
- better model parallel support (training mode split)
- better default zero 1/2 config
- test bf16 training
- change qkv order of chatglm1
- only use pytorch 2.0 attention when full / causal.
v. 0.4.6
- add droppath and checkpoint last layer skip
- support multiple webdataset weighting
- fix lora merging
- add different lr in different parts, add a 'lr' attr for parameters in the
disable_untrainable_params
.