Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepspeed initialization AttributeError: 'EncoderDecoderConfig' object has no attribute 'hidden_size' #22176

Closed
2 of 4 tasks
ksopyla opened this issue Mar 15, 2023 · 8 comments

Comments

@ksopyla
Copy link

ksopyla commented Mar 15, 2023

System Info

System: Ubuntu 22.04

  • transformers version: 4.26.1
  • Platform: Linux-5.4.0-144-generic-x86_64-with-glibc2.29
  • Python version: 3.8.13
  • Huggingface_hub version: 0.12.1
  • PyTorch version (GPU?): 1.13.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes, 4x RTX 3090
  • Using distributed or parallel set-up in script?: yes, deepseed
packages - Click to expand!
Package                   Version
------------------------- ------------
absl-py                   1.4.0
abydos                    0.5.0
accelerate                0.17.0
aiohttp                   3.8.4
aiosignal                 1.3.1
alembic                   1.9.4
antlr4-python3-runtime    4.9.3
anyio                     3.6.2
appdirs                   1.4.4
argon2-cffi               21.3.0
argon2-cffi-bindings      21.2.0
arrow                     1.2.3
astroid                   2.14.2
asttokens                 2.2.1
async-timeout             4.0.2
attrs                     22.2.0
Babel                     2.12.1
backcall                  0.2.0
beautifulsoup4            4.11.2
black                     23.1.0
bleach                    6.0.0
cachetools                5.3.0
certifi                   2022.12.7
cffi                      1.15.1
charset-normalizer        3.0.1
click                     8.1.3
clldutils                 3.19.0
cloudpickle               2.2.1
codecov                   2.0.22
colorama                  0.4.6
coloredlogs               10.0
colorlog                  6.7.0
comm                      0.1.2
contourpy                 1.0.7
coverage                  5.5
csvw                      3.1.3
cycler                    0.11.0
Cython                    0.29.33
databricks-cli            0.17.4
dataclasses               0.6
datasets                  2.10.1
debugpy                   1.6.6
decorator                 5.1.1
deepspeed                 0.8.2
defusedxml                0.7.1
deprecation               2.1.0
dill                      0.3.6
docker                    6.0.1
docker-pycreds            0.4.0
editdistance              0.6.2
entrypoints               0.4
evaluate                  0.4.0
executing                 1.2.0
fairseq                   0.10.0
fastjsonschema            2.16.3
filelock                  3.9.0
Flask                     2.2.3
fonttools                 4.38.0
fqdn                      1.5.1
frozenlist                1.3.3
fsspec                    2023.1.0
fuzzywuzzy                0.17.0
gitdb                     4.0.10
GitPython                 3.1.31
google-auth               2.16.1
google-auth-oauthlib      0.4.6
greenlet                  2.0.2
grpcio                    1.51.3
gunicorn                  20.1.0
hjson                     3.1.0
huggingface-hub           0.12.1
humanfriendly             10.0
hydra-core                1.3.2
idna                      3.4
importlib-metadata        6.0.0
importlib-resources       5.12.0
ipykernel                 6.21.2
ipython                   8.11.0
ipython-genutils          0.2.0
ipywidgets                8.0.4
isodate                   0.6.1
isoduration               20.11.0
isort                     5.12.0
itsdangerous              2.1.2
jedi                      0.18.2
jellyfish                 0.7.2
Jinja2                    3.1.2
jiwer                     2.5.1
jmespath                  1.0.1
joblib                    1.2.0
jsonlines                 1.2.0
jsonpointer               2.3
jsonschema                4.17.3
jupyter                   1.0.0
jupyter_client            8.0.3
jupyter-console           6.6.2
jupyter_core              5.2.0
jupyter-events            0.6.3
jupyter_server            2.3.0
jupyter_server_terminals  0.4.4
jupyterlab-pygments       0.2.2
jupyterlab-widgets        3.0.5
kiwisolver                1.4.4
language-tags             1.2.0
latexcodec                2.0.1
lazy-object-proxy         1.9.0
Levenshtein               0.20.2
lightning-utilities       0.7.1
lingpy                    2.6.9
lxml                      4.9.2
Mako                      1.2.4
many-stop-words           0.2.2
Markdown                  3.4.1
MarkupSafe                2.1.2
matplotlib                3.7.0
matplotlib-inline         0.1.6
mccabe                    0.7.0
mistune                   2.0.5
mlflow                    1.27.0
more-itertools            9.1.0
multidict                 6.0.4
multiprocess              0.70.14
mypy-extensions           1.0.0
nbclassic                 0.5.2
nbclient                  0.7.2
nbconvert                 7.2.9
nbformat                  5.7.3
nest-asyncio              1.5.6
networkx                  3.0
newick                    1.7.0
ninja                     1.11.1
nltk                      3.8.1
notebook                  6.5.2
notebook_shim             0.2.2
numpy                     1.24.2
oauthlib                  3.2.2
omegaconf                 2.3.0
packaging                 23.0
pandas                    1.5.3
pandocfilters             1.5.0
parso                     0.8.3
pathspec                  0.11.0
pathtools                 0.1.2
pexpect                   4.8.0
pickleshare               0.7.5
Pillow                    9.4.0
pip                       23.0.1
pkgutil_resolve_name      1.3.10
platformdirs              3.0.0
pluggy                    0.13.1
portalocker               2.7.0
progress                  1.6
prometheus-client         0.16.0
prometheus-flask-exporter 0.22.2
prompt-toolkit            3.0.38
protobuf                  3.20.3
psutil                    5.9.4
ptyprocess                0.7.0
pure-eval                 0.2.2
py                        1.11.0
py-cpuinfo                9.0.0
pyarrow                   11.0.0
pyasn1                    0.4.8
pyasn1-modules            0.2.8
pybtex                    0.24.0
pycldf                    1.34.0
pycparser                 2.21
pydantic                  1.10.6
Pygments                  2.14.0
PyJWT                     2.6.0
pylatexenc                2.10
pylint                    2.16.2
pyparsing                 3.0.9
pyrsistent                0.19.3
pytest                    5.4.3
pytest-cov                2.8.1
python-dateutil           2.8.2
python-docx               0.8.11
python-frontmatter        1.0.0
python-json-logger        2.0.7
python-Levenshtein        0.12.2
python-nexus              2.9.0
pytorch-lightning         1.8.6
pytz                      2022.7.1
pyxDamerauLevenshtein     1.7.1
PyYAML                    6.0
pyzmq                     25.0.0
qtconsole                 5.4.0
QtPy                      2.3.0
querystring-parser        1.2.4
rapidfuzz                 2.13.7
rdflib                    6.2.0
regex                     2022.10.31
requests                  2.28.2
requests-oauthlib         1.3.1
responses                 0.18.0
rfc3339-validator         0.1.4
rfc3986                   1.5.0
rfc3986-validator         0.1.1
rope                      0.14.0
rsa                       4.9
rapidfuzz                 2.13.7
rdflib                    6.2.0
regex                     2022.10.31
requests                  2.28.2
requests-oauthlib         1.3.1
responses                 0.18.0
rfc3339-validator         0.1.4
rfc3986                   1.5.0
rfc3986-validator         0.1.1
rope                      0.14.0
rsa                       4.9
sacrebleu                 2.3.1
sacremoses                0.0.53
scikit-learn              0.22.2.post1
scipy                     1.10.1
seaborn                   0.11.2
Send2Trash                1.8.0
sentencepiece             0.1.97
sentry-sdk                1.16.0
setproctitle              1.3.2
setuptools                67.4.0
six                       1.16.0
smmap                     5.0.0
sniffio                   1.3.0
soupsieve                 2.4
SQLAlchemy                2.0.4
sqlparse                  0.4.3
sru                       3.0.0.dev6
stack-data                0.6.2
symspellpy                0.1.0
tabulate                  0.9.0
tensorboard               2.12.0
tensorboard-data-server   0.7.0
tensorboard-plugin-wit    1.8.1
tensorboardX              2.6
termcolor                 2.2.0
terminado                 0.17.1
textdistance              4.5.0
tinycss2                  1.2.1
tokenizers                0.13.2
tomli                     2.0.1
tomlkit                   0.11.6
torch                     1.13.1+cu117
torchmetrics              0.11.3
tornado                   6.2
tqdm                      4.64.1
traitlets                 5.9.0
transformers              4.26.1
typing_extensions         4.5.0
Unidecode                 1.3.6
uri-template              1.2.0
uritemplate               4.1.1
urllib3                   1.26.14
wandb                     0.13.10
wcwidth                   0.2.6
webcolors                 1.12
webencodings              0.5.1
websocket-client          1.5.1
weighted-levenshtein      0.2.2
Werkzeug                  2.2.3
wheel                     0.38.4
widgetsnbextension        4.0.5
wrapt                     1.15.0
xlrd                      1.2.0
xxhash                    3.2.0
yarl                      1.8.2
zipp                      3.15.0

Who can help?

HF Trainer: @stas00, Accelerate: @pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

deepspeed --num_gpus=4 training_enc_dec_model_from_scratch.py \
--output_dir="./hf_output/" \
--per_device_train_batch_size=128 \
--dataloader_num_workers=8 \
--gradient_accumulation_steps=1 \
--gradient_checkpointing=False \
--fp16 \
--logging_steps=500 \
--eval_steps=5000 \
--save_steps=50000 \
--num_train_epochs=2 \
--learning_rate=0.001 \
--warmup_steps=5000 \
--logging_first_step=True \
--eval_accumulation_steps=100 \
--log_level=warning \
--deepspeed deepspeed_zero2.json

deepspeed_zero2.json >>

{
    "wandb": {
        "enabled": true,
        "project": "Project"
    },
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": 3e-7
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": 2e-6,
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 3e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 3e8,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "steps_per_print": 500
}

The training script

from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    BartForConditionalGeneration,
    HfArgumentParser,
    BertConfig,
    EncoderDecoderConfig,
    EncoderDecoderModel,
    BartConfig,
    BartForConditionalGeneration,
    ReformerConfig,
    LEDConfig,
    LEDForConditionalGeneration,
)

from transformers.data.data_collator import DataCollatorForSeq2Seq

from encoder_decoder_utils import (
    DataCollatorForEncoderDecoder,
    Seq2SeqTrainerForEncoderDecoder,
)

import torch
import torch.distributed
import transformers
import datasets
import os
import sys
import logging
import socket
from datetime import datetime, date

logger = logging.getLogger(__name__)


if __name__ == "__main__":
    parser = HfArgumentParser(Seq2SeqTrainingArguments)
    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
        # If we pass only one argument to the script and it's the path to a json file,
        # let's parse it to get our arguments.
        training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
    else:
        training_args = parser.parse_args_into_dataclasses()

    # get the first value from tuple, probably lib error
    training_args = training_args[0]

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )
    log_level = training_args.get_process_log_level()
    logger.setLevel(log_level)
    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.enable_default_handler()
    transformers.utils.logging.enable_explicit_format()

    # Log on each process the small summary:
    logger.warning(
        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
    )
    logger.info(f"Training/evaluation parameters {training_args}")

    experiment_name = f"hf_enc_dec_custom"

    # just for loading tokenizer
    model_name = "allegro/herbert-base-cased"

    # %% define training parameters
    batch_size = training_args.per_device_train_batch_size

    output_dir = f"{training_args.output_dir}/"

    path_datatime = datetime.now().strftime("%Y_%m_%d-%I_%M_%S")

    training_args.run_name = f"{experiment_name}-{model_name}-{path_datatime}"
    training_args.predict_with_generate = True
    training_args.do_train = True
    training_args.do_eval = True
    training_args.evaluation_strategy = (
        transformers.trainer_utils.IntervalStrategy.STEPS
    )
    training_args.logging_strategy = (
        transformers.trainer_utils.IntervalStrategy.STEPS
    )  # "steps"
   
    training_args.save_total_limit = 5
 
    training_args.seed = 123
    training_args.report_to = ["wandb"]
   
    logger.info(f"After set new values Training/evaluation parameters {training_args}")

    #! data local machine
    data_file = "gec_data_file.jsonl" # 1M json line file
    # gec_data_file.jsonl content:  
    # {"correct": "Ciasne, koronkowe, podniecające.", "incorrect": "Ciasne, koronkowe, podniwcające."}
    # {"correct": "Ślinka cieknie, serce rwie żebra, szabla w dłoń.", "incorrect": "Ślinka ciekni4, srve  rwie żebra, sszabla w dloń."}
    num_proc = 1

    data_file = os.path.abspath(data_file)
    dataset_name = os.path.basename(data_file)
    hf_cache_dir = f"{training_args.output_dir}/{experiment_name}/data/{dataset_name}/"
    hf_cache_dir = os.path.abspath(hf_cache_dir)

    output_dir = f"{training_args.output_dir}/{experiment_name}/{dataset_name}/{path_datatime}"
    training_args.output_dir = f"{output_dir}/checkpoints/"

    dataset = datasets.load_dataset(
        "json",
        data_files=data_file,
        cache_dir=hf_cache_dir,
    )

    test_size = 2000
    train_size = len(dataset) - test_size 

    dataset = dataset.train_test_split(test_size=test_size, seed=123)

    train_data = dataset["train"]

    train_data = train_data.select(range(train_size))

    val_data = dataset["test"]

    logger.info(f"\n\n*********\nTrain={len(train_data)} val={len(val_data)}")
    # %%

    # %%
    # %% load tokenizer

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    tokenizer.model_max_length = 512
    tokenizer.bos_token = tokenizer.cls_token
    tokenizer.eos_token = tokenizer.sep_token

    # %% initialize the Model

    # all the parameters could be found here
    #  https://huggingface.co/docs/transformers/v4.26.1/en/model_doc/bert#transformers.BertConfig
    config_encoder = BertConfig()
    config_decoder = BertConfig()


    config_encoder.hidden_size = 512
    config_encoder.num_hidden_layers = 2
    config_encoder.num_attention_heads = 4
    config_encoder.intermediate_size = 1024

    config_encoder.decoder_start_token_id = tokenizer.cls_token_id
    config_encoder.bos_token_id = tokenizer.bos_token_id
    config_encoder.eos_token_id = tokenizer.sep_token_id
    config_encoder.pad_token_id = tokenizer.pad_token_id
    config_encoder.vocab_size = tokenizer.vocab_size

    config_decoder.hidden_size = 512
    config_decoder.intermediate_size = 1024
    config_decoder.num_hidden_layers = 2
    config_decoder.num_attention_heads = 4
    config_decoder.is_decoder = True
    config_decoder.add_cross_attention = True

    config_decoder.decoder_start_token_id = tokenizer.cls_token_id
    config_decoder.bos_token_id = tokenizer.bos_token_id
    config_decoder.eos_token_id = tokenizer.sep_token_id
    config_decoder.pad_token_id = tokenizer.pad_token_id
    config_decoder.vocab_size = tokenizer.vocab_size


    config = EncoderDecoderConfig.from_encoder_decoder_configs(
        config_encoder, config_decoder
    )


    # https://huggingface.co/blog/how-to-generate
    config.max_length = 512
    config.min_length = 0
    config.no_repeat_ngram_size = 3
    config.early_stopping = True
    config.length_penalty = 2.0
    config.num_beams = 5

    # config.tie_word_embeddings = True
    config.tie_encoder_decoder = False

    config.decoder_start_token_id = tokenizer.cls_token_id
    config.eos_token_id = tokenizer.sep_token_id
    config.pad_token_id = tokenizer.pad_token_id
    config.vocab_size = config.encoder.vocab_size

    enc_dec = EncoderDecoderModel(config=config)

    model_file_name = f"{model_name}-custom"
    # Saving the model, including its configuration
    enc_dec.save_pretrained(model_file_name)

    # loading model and config from pretrained folder
    encoder_decoder_config = EncoderDecoderConfig.from_pretrained(model_file_name)
    model = EncoderDecoderModel.from_pretrained(
        model_file_name, config=encoder_decoder_config
    )


    # set the wandb project where this run will be logged
    os.environ["WANDB_PROJECT"] = "Project"
    # save your trained model checkpoint to wandb
    os.environ["WANDB_LOG_MODEL"] = "false"
    # turn off watch to log faster
    os.environ["WANDB_WATCH"] = "false"



    logger.info(f"\n\nNum Params: {model_size}")

    # %%### process data, tokenize and prepare for training
    logger.info(f"process train data (tokenization)")

   def process_data_to_model_inputs(batch, max_len=512):
        """map function for transformation text to ids,
        tokenize the inputs and labels

        """
        # Tokenizer will automatically set [BOS] <text> [EOS]
        inputs = batch["incorrect"]
        targets = batch["correct"]

        # tokenize the inputs and labels
        # without padding, the data collator will pad
        model_inputs = tokenizer(inputs, max_length=max_len, truncation=True)
        labels = tokenizer(text_target=targets, max_length=512, truncation=True)   
        model_inputs["labels"] = labels.input_ids
        
        return model_inputs



    process_batch = 5000

    train_data_tok = train_data.map(
        process_data_to_model_inputs,
        batched=True,
        batch_size=process_batch,
        remove_columns=["incorrect", "correct"],
        num_proc=num_proc
    )

    logger.info(f"process val data (tokenization)")

    val_data_tok = val_data.map(
        process_data_to_model_inputs,
        batched=True,
        batch_size=process_batch,
        remove_columns=["incorrect", "correct"],
        num_proc=num_proc,
        cache_file_name=f"{hf_cache_dir}/val_mapped_{test_size}.arrow",
        # keep_in_memory=True
    )


    del train_data
    del val_data
    del dataset

    logger.info(f"done process data (tokenization)")


    data_collator = DataCollatorForSeq2Seq(
        tokenizer=tokenizer, model=model, max_length=512, pad_to_multiple_of=8
    )

    trainer = Seq2SeqTrainerForEncoderDecoder(
        args=training_args,
        model=model,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=None,
        train_dataset=train_data_tok,
        eval_dataset=val_data_tok,
    )

    logger.info(f"start training")

    trainer.train()
    # %%
    trainer.save_model(f"{output_dir}/final")

Expected behavior

Start traning without error

Traceback

[2023-03-16 05:54:45,026] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}                                                                                 
[2023-03-16 05:54:45,026] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=4, node_rank=0                                                                                     
[2023-03-16 05:54:45,026] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})                                                 
[2023-03-16 05:54:45,026] [INFO] [launch.py:162:main] dist_world_size=4                                                                                                            
[2023-03-16 05:54:45,026] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3                                                                                         
[2023-03-16 05:54:48,465] [INFO] [comm.py:661:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl                                                           

03/16/2023 05:54:49 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True                                             
03/16/2023 05:54:49 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: True                                            
03/16/2023 05:54:49 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: True                                             
03/16/2023 05:54:49 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True                                             
03/16/2023 05:54:50 - WARNING - datasets.builder - Found cached dataset json (/home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
                                                                  0%|                                                                                                                                                        | 0/1 [00:00<?, ?it/s]
03/16/2023 05:54:50 - WARNING - datasets.builder - Found cached dataset json (/home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 
                                                                 0%|                                                                                                                                                        | 0/1 [00:00<?, ?it/s]
03/16/2023 05:54:50 - WARNING - datasets.builder - Found cached dataset json (/home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)  
                                                                0%|                                                                                                                                                        | 0/1 [00:00<?, ?it/s]
03/16/2023 05:54:50 - WARNING - datasets.builder - Found cached dataset json (/home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 
                                                               100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.90it/s]100%|
████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.90it/s]100%|
████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.89it/s]100%|
████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.87it/s]
03/16/2023 05:54:51 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-1488b483c5004ed7_*_of_00008.arrow         
03/16/2023 05:54:51 - WARNING - datasets.arrow_dataset - Loading cached split indices for dataset at /home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-d86219c9d32c5215.arrow and /home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-b4b18a39600bbc9f.arrow                                                                                                                         
03/16/2023 05:54:55 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/train_mapped_29636272_*_of_00008.arrow                                                                                                                
03/16/2023 05:54:56 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/datagec_data_file.jsonl/val_mapped_10000_*_of_00008.arrow

Loading results from main process
Traceback (most recent call last):
  File "playground/hf_transformers/training_enc_dec_model_from_scratch.py", line 458, in <module>
    trainer.train()
  File "/home/ksopyla/.cache/pypoetry/virtualenvs/ml-A9X51t2i-py3.8/lib/python3.8/site-packages/transformers/trainer.py", line 1543, in train
    return inner_training_loop(
  File "/home/ksopyla/.cache/pypoetry/virtualenvs/ml-A9X51t2i-py3.8/lib/python3.8/site-packages/transformers/trainer.py", line 1612, in _inner_training_loop
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
  File "/home/ksopyla/.cache/pypoetry/virtualenvs/ml-A9X51t2i-py3.8/lib/python3.8/site-packages/transformers/deepspeed.py", line 312, in deepspeed_init
    hf_deepspeed_config.trainer_config_finalize(args, model, num_training_steps)
  File "/home/ksopyla/.cache/pypoetry/virtualenvs/ml-A9X51t2i-py3.8/lib/python3.8/site-packages/transformers/deepspeed.py", line 174, in trainer_config_finalize
    hidden_size = model.config.hidden_size
  File "/home/ksopyla/.cache/pypoetry/virtualenvs/ml-A9X51t2i-py3.8/lib/python3.8/site-packages/transformers/configuration_utils.py", line 260, in __getattribute__
    return super().__getattribute__(key)
AttributeError: 'EncoderDecoderConfig' object has no attribute 'hidden_size'
03/16/2023 05:54:58 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-1488b483c5004ed7_*_of_00008.arrow
03/16/2023 05:54:58 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-1488b483c5004ed7_*_of_00008.arrow
03/16/2023 05:54:58 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-1488b483c5004ed7_*_of_00008.arrow
03/16/2023 05:54:58 - WARNING - datasets.arrow_dataset - Loading cached split indices for dataset at /home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-d86219c9d32c5215.arrow and /home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-b4b18a39600bbc9f.arrow
03/16/2023 05:54:58 - WARNING - datasets.arrow_dataset - Loading cached split indices for dataset at /home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-d86219c9d32c5215.arrow and /home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-b4b18a39600bbc9f.arrow
03/16/2023 05:54:58 - WARNING - datasets.arrow_dataset - Loading cached split indices for dataset at /home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/datagec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-d86219c9d32c5215.arrow and /home/ksopyla/dev/ml/hf_output/hf_enc_dec_custom/data/gec_data_file.jsonl/json/default-6447d29028c8f08e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-b4b18a39600bbc9f.arrow
[2023-03-16 05:54:59,072] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 526261
[2023-03-16 05:54:59,073] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 526262
[2023-03-16 05:54:59,218] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 526263
[2023-03-16 05:54:59,362] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 526264
[2023-03-16 05:54:59,545] [ERROR] [launch.py:324:sigkill_handler] ['/home/ksopyla/.cache/pypoetry/virtualenvs/ml-A9X51t2i-py3.8/bin/python', '-u', 'playground/hf_transformers/training_enc_dec_model_from_scratch.py', '--local_rank=3', '--output_dir=./hf_output/', '--per_device_train_batch_size=128', '--dataloader_num_workers=8', '--gradient_accumulation_steps=1', '--gradient_checkpointing=False', '--fp16', '--logging_steps=500', '--eval_steps=5000', '--save_steps=50000', '--num_train_epochs=2', '--learning_rate=0.001', '--warmup_steps=5000', '--logging_first_step=True', '--eval_accumulation_steps=100', '--deepspeed', 'playground/hf_transformers/deepspeed_zero2.json', '--log_level=warning'] exits with return code = 1
@amyeroberts
Copy link
Collaborator

Hi @ksopyla Thanks for raising this issue and for giving all the script and environment details. Could you share the full traceback of the error encountered?

Although I'm not immediately sure where the error is being raised, it is expected that the error occurs if hidden_size is being references from the model's config i.e. model.config.hidden_size as it's only the encoder and decoder configs that have this parameter.

@ksopyla
Copy link
Author

ksopyla commented Mar 16, 2023

HI @amyeroberts I have updated the issue and added the traceback. I hope it helps.
Yes, you are right problem occurs when script tries to get model.config.hidden_size

I would add to this that the encoder and decoder could have different sizes in terms of the number of layers and hidden_size

@stas00
Copy link
Contributor

stas00 commented Mar 16, 2023

Thank you for the full traceback, @ksopyla. Now it's easy to support you.

Please try again with the latest version of transformers. You can see here that this situation has been dealt with on Feb 10th so this assert shouldn't happen again as it now carefully checks different scenarios:

def trainer_config_finalize(self, args, model, num_training_steps):
"""
This stage is run after we have the model and know num_training_steps.
Now we can complete the configuration process.
"""
# zero
# deal with config keys that use `auto` value and rely on model's hidden_size
hidden_size_based_keys = [
"zero_optimization.reduce_bucket_size",
"zero_optimization.stage3_prefetch_bucket_size",
"zero_optimization.stage3_param_persistence_threshold",
]
hidden_size_auto_keys = [x for x in hidden_size_based_keys if self.is_auto(x)]
if len(hidden_size_auto_keys) > 0:
if hasattr(model.config, "hidden_size"):
hidden_size = model.config.hidden_size
elif hasattr(model.config, "hidden_sizes"):
# if there are many hidden sizes pick the largest one
hidden_size = max(model.config.hidden_sizes)
else:
raise ValueError(
"The model's config file has neither `hidden_size` nor `hidden_sizes` entry, "
"therefore it's not possible to automatically fill out the following `auto` entries "
f"in the DeepSpeed config file: {hidden_size_auto_keys}. You can fix that by replacing "
"`auto` values for these keys with an integer value of your choice."
)
self.fill_only("zero_optimization.reduce_bucket_size", hidden_size * hidden_size)
if self.is_zero3():
# automatically assign the optimal config values based on model config
self.fill_only("zero_optimization.stage3_prefetch_bucket_size", 0.9 * hidden_size * hidden_size)
self.fill_only("zero_optimization.stage3_param_persistence_threshold", 10 * hidden_size)

However if you don't set hidden_size then please don't use auto values for zero configuration section. This is what the proper assert in the latest version will tell you to do.

This is just an automatic optimization and you can remove these entries completely and deepspeed will use its defaults. Or you can study what those values should be and set them yourself as explained here:
https://huggingface.co/docs/transformers/main/main_classes/deepspeed#zero3-config

@ksopyla
Copy link
Author

ksopyla commented Mar 16, 2023

Sure, I will check and let you know.
Meanwhile, could you explain what you mean by " then please don't use auto values for zero configuration section."
I use Zero2 https://huggingface.co/docs/transformers/main/main_classes/deepspeed#zero2-config not Zero3-config

I infer you talk about these parameters, which should be set if I use Zero3.

hidden_size_based_keys = [
            "zero_optimization.reduce_bucket_size",
            "zero_optimization.stage3_prefetch_bucket_size",
            "zero_optimization.stage3_param_persistence_threshold",
        ]

Correct me if I am wrong. Or maybe I should also set those in Zero2?

@stas00
Copy link
Contributor

stas00 commented Mar 16, 2023

ah, ok, thank you for clarifying the situation - that's even simpler then. Just upgrade transformers, change nothing in your setup and it should just work.

The original code just did model.config.hidden_size regardless of the config type and thus it is failing for you.

@ksopyla
Copy link
Author

ksopyla commented Mar 20, 2023

I have updated the transformers to 4.27 and pytorch 2.0 and it works :)
But I have an issue that Zero-2 is slower than pytorch distributed approach, try to investigate it further.
Meanwhile thank you for your help.

@stas00
Copy link
Contributor

stas00 commented Mar 20, 2023

Best to discuss a new issue in a new Issue, but if we can wrap it up quickly - it's absolutely normal that the speed will progressively drop as you enable stages 1, 2 and 3, as each stage creates an additional overhead.

If you can fit everything into a single GPU do not use Deepspeed. It's a scalability solution for when one can't fit the training or inference components into a single gpu. If you can, always use straight DDP.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants