Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unet - inference issues #42

Open
kkurzacz-intel opened this issue Jun 24, 2024 · 1 comment
Open

Unet - inference issues #42

kkurzacz-intel opened this issue Jun 24, 2024 · 1 comment

Comments

@kkurzacz-intel
Copy link

kkurzacz-intel commented Jun 24, 2024

I'm getting error when running UNet2D inference:

root@ip-172-31-0-126:/Model-References/PyTorch/computer_vision/segmentation/Unet# python main.py --exec_mode predict --task 01 --hpus 1 --fold 3 --seed 123 --val_batch_size 64 --dim 2 --data=/data/pytorch/unet/01_2d --results=/tmp/Unet/results/fold_3 --autocast --inference_mode lazy --ckpt_path pretrained_checkpoint/pretrained_checkpoint.pt
Namespace(framework='pytorch-lightning', exec_mode='predict', data='/data/pytorch/unet/01_2d', results='/tmp/Unet/results/fold_3', logname=None, task='01', gpus=0, hpus=1, learning_rate=0.001, gradient_clip_val=0, negative_slope=0.01, tta=False, gradient_clip=False, gradient_clip_norm=12, amp=False, benchmark=False, deep_supervision=False, drop_block=False, attention=False, residual=False, focal=False, sync_batchnorm=False, save_ckpt=False, nfolds=5, seed=123, skip_first_n_eval=0, ckpt_path='pretrained_checkpoint/pretrained_checkpoint.pt', fold=3, patience=100, lr_patience=70, batch_size=2, val_batch_size=64, steps=None, profile=False, profile_steps='90:95', momentum=0.99, weight_decay=0.0001, save_preds=False, dim=2, resume_training=False, factor=0.3, num_workers=8, min_epochs=30, max_epochs=10000, warmup=5, norm='instance', nvol=1, run_lazy_mode=True, inference_mode='lazy', is_autocast=True, hpu_graphs=True, habana_loader=False, bucket_cap_mb=130, data2d_dim=3, oversampling=0.33, overlap=0.5, affinity='disabled', scheduler='none', optimizer='adamw', blend='gaussian', train_batches=0, test_batches=0, progress_bar_refresh_rate=25, set_aug_seed=False, augment=True, measurement_type='throughput', use_torch_compile=False, enable_tensorboard_logging=False)
Seed set to 123
Seed set to 123
Seed set to 123
Seed set to 773630
Number of test examples: 266
Seed set to 28030
Traceback (most recent call last):
  File "/Model-References/PyTorch/computer_vision/segmentation/Unet/main.py", line 218, in <module>
    main()
  File "/Model-References/PyTorch/computer_vision/segmentation/Unet/main.py", line 209, in main
    ptlrun(args)
  File "/Model-References/PyTorch/computer_vision/segmentation/Unet/lightning_trainer/ptl.py", line 211, in ptlrun
    model = NNUnet.load_from_checkpoint(ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/utilities/model_helpers.py", line 125, in wrapper
    return self.method(cls, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/module.py", line 1581, in load_from_checkpoint
    loaded = _load_from_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/saving.py", line 91, in _load_from_checkpoint
    model = _load_state(cls, checkpoint, strict=strict, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/saving.py", line 158, in _load_state
    obj = cls(**_cls_kwargs)
  File "/Model-References/PyTorch/computer_vision/segmentation/Unet/models/nn_unet.py", line 72, in __init__
    self.build_nnunet()
  File "/Model-References/PyTorch/computer_vision/segmentation/Unet/models/nn_unet.py", line 189, in build_nnunet
    in_channels, n_class, kernels, strides, self.patch_size = get_unet_params(self.args)
  File "/Model-References/PyTorch/computer_vision/segmentation/Unet/utils/utils.py", line 132, in get_unet_params
    config = get_config_file(args)
  File "/Model-References/PyTorch/computer_vision/segmentation/Unet/utils/utils.py", line 102, in get_config_file
    return pickle.load(open(path, "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/weka/data/pytorch/unet/01_2d/config.pkl'

This command comes from README examples (Single Card Inference Examples / Inference / UNet2D, Lazy mode, BF16 mixed precision, batch size 64, 1 HPU on a single server).

Environment:

  • AWS DL1 instance + suggested system image
  • Ubuntu 22.04.4
  • Python 3.10.12

Environment is AWS DL1 instance. I followed Gaudi AWS quickstart to start instance and run Docker Habana runtime environment.

Command for benchmark inference:

$PYTHON main.py --exec_mode predict --task 01 --hpus 1 --fold 3 --val_batch_size 64 --dim 2 --data=/data/pytorch/unet/01_2d --results=/tmp/Unet/results/fold_3 --autocast --inference_mode lazy --benchmark --test_batches 150

works without errors.

@Alberto-Villarreal
Copy link

@kkurzacz-intel Could you please point us to the command you used from https://github.com/HabanaAI/Model-References/tree/master/PyTorch/computer_vision/segmentation/Unet#single-card-inference-examples ? The one that produced the error above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants