Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No inf checks were recorded for this optimizer. #7

Open
johnny0234 opened this issue May 7, 2024 · 2 comments
Open

No inf checks were recorded for this optimizer. #7

johnny0234 opened this issue May 7, 2024 · 2 comments

Comments

@johnny0234
Copy link

(botsort) PS C:\Users\user\AICUP_Baseline_BoT-SORT>

  • History restored

UP\bagtricks_R50-ibn.yml MODEL.DEVICE "cuda:0"
Command Line Args: Namespace(config_file='C:\Users\user\AICUP_Baseline_BoT-SORT\fast_reid\configs\AICUP\bagtricks_R50-ibn.yml', resume=False, eval_only=False, num_gpus=1, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:49153', opts=['MODEL.DEVICE', 'cuda:0'])
[05/07 13:18:32 fastreid]: Rank of current process: 0. World size: 1
[05/07 13:18:34 fastreid]: Environment info:


sys.platform win32
Python 3.9.19 (main, Mar 21 2024, 17:21:27) [MSC v.1916 64 bit (AMD64)]
numpy 1.26.4
fastreid failed to import
FASTREID_ENV_MODULE
PyTorch 2.3.0+cu118 @C:\Users\user\anaconda3\envs\botsort\lib\site-packages\torch
PyTorch debug build False
GPU available True
GPU 0 NVIDIA GeForce GTX 1650
CUDA_HOME C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4
Pillow 10.3.0
torchvision 0.18.0+cpu @C:\Users\user\anaconda3\envs\botsort\lib\site-packages\torchvision
torchvision arch flags C:\Users\user\anaconda3\envs\botsort\lib\site-packages\torchvision_C.pyd; cannot find cuobjdump
cv2 4.9.0


PyTorch built with:

  • C++ Version: 201703
  • MSVC 192930151
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
  • OpenMP 2019
  • LAPACK is enabled (usually provided by MKL)
  • CPU capability usage: AVX2
  • CUDA Runtime 11.8
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.7
  • Magma 2.5.4
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /Zc:__cplusplus /bigobj /FS /utf-8 -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI
    -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE /wd4624 /wd4068 /wd4067 /wd4267 /wd4661 /wd4717 /wd4244 /wd4804 /wd4273, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

[05/07 13:18:34 fastreid]: Command line arguments: Namespace(config_file='C:\Users\user\AICUP_Baseline_BoT-SORT\fast_reid\configs\AICUP\bagtricks_R50-ibn.yml', resume=False, eval_only=False, num_gpus=1, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:49153', opts=['MODEL.DEVICE', 'cuda:0'])
[05/07 13:18:34 fastreid]: Contents of args.config_file=C:\Users\user\AICUP_Baseline_BoT-SORT\fast_reid\configs\AICUP\bagtricks_R50-ibn.yml:
BASE: ../Base-bagtricks.yml

INPUT:
SIZE_TRAIN: [256, 256]
SIZE_TEST: [256, 256]

MODEL:
BACKBONE:
WITH_IBN: True
HEADS:
POOL_LAYER: GeneralizedMeanPooling

LOSSES:
TRI:
HARD_MINING: False
MARGIN: 0.0

DATASETS:
NAMES: ("AICUP",)
TESTS: ("AICUP",)

SOLVER:
BIAS_LR_FACTOR: 1.

IMS_PER_BATCH: 32
MAX_EPOCH: 60
STEPS: [30, 50]
WARMUP_ITERS: 2000

CHECKPOINT_PERIOD: 1

TEST:
EVAL_PERIOD: 60 # We didn't provide eval dataset
IMS_PER_BATCH: 256

OUTPUT_DIR: logs/AICUP_115/bagtricks_R50-ibn

[05/07 13:18:34 fastreid]: Running with full config:
CUDNN_BENCHMARK: True
DATALOADER:
NUM_INSTANCE: 4
NUM_WORKERS: 8
SAMPLER_TRAIN: NaiveIdentitySampler
SET_WEIGHT: []
DATASETS:
COMBINEALL: False
NAMES: ('AICUP',)
TESTS: ('AICUP',)
INPUT:
AFFINE:
ENABLED: False
AUGMIX:
ENABLED: False
PROB: 0.0
AUTOAUG:
ENABLED: False
PROB: 0.0
CJ:
BRIGHTNESS: 0.15
CONTRAST: 0.15
ENABLED: False
HUE: 0.1
PROB: 0.5
SATURATION: 0.1
CROP:
ENABLED: False
RATIO: [0.75, 1.3333333333333333]
SCALE: [0.16, 1]
SIZE: [224, 224]
FLIP:
ENABLED: True
PROB: 0.5
PADDING:
ENABLED: True
MODE: constant
SIZE: 10
REA:
ENABLED: True
PROB: 0.5
VALUE: [123.675, 116.28, 103.53]
RPT:
ENABLED: False
PROB: 0.5
SIZE_TEST: [256, 256]
SIZE_TRAIN: [256, 256]
KD:
EMA:
ENABLED: False
MOMENTUM: 0.999
MODEL_CONFIG: []
MODEL_WEIGHTS: []
MODEL:
BACKBONE:
ATT_DROP_RATE: 0.0
DEPTH: 50x
DROP_PATH_RATIO: 0.1
DROP_RATIO: 0.0
FEAT_DIM: 2048
LAST_STRIDE: 1
NAME: build_resnet_backbone
NORM: BN
PRETRAIN: True
PRETRAIN_PATH:
SIE_COE: 3.0
STRIDE_SIZE: (16, 16)
WITH_IBN: True
WITH_NL: False
WITH_SE: False
DEVICE: cuda:0
FREEZE_LAYERS: []
HEADS:
CLS_LAYER: Linear
EMBEDDING_DIM: 0
MARGIN: 0.0
NAME: EmbeddingHead
NECK_FEAT: before
NORM: BN
NUM_CLASSES: 0
POOL_LAYER: GeneralizedMeanPooling
SCALE: 1
WITH_BNNECK: True
LOSSES:
CE:
ALPHA: 0.2
EPSILON: 0.1
SCALE: 1.0
CIRCLE:
GAMMA: 128
MARGIN: 0.25
SCALE: 1.0
COSFACE:
GAMMA: 128
MARGIN: 0.25
SCALE: 1.0
FL:
ALPHA: 0.25
GAMMA: 2
SCALE: 1.0
NAME: ('CrossEntropyLoss', 'TripletLoss')
TRI:
HARD_MINING: False
MARGIN: 0.0
NORM_FEAT: False
SCALE: 1.0
META_ARCHITECTURE: Baseline
PIXEL_MEAN: [123.675, 116.28, 103.53]
PIXEL_STD: [58.395, 57.120000000000005, 57.375]
QUEUE_SIZE: 8192
WEIGHTS:
OUTPUT_DIR: logs/AICUP_115/bagtricks_R50-ibn
SOLVER:
AMP:
ENABLED: True
BASE_LR: 0.00035
BIAS_LR_FACTOR: 1.0
CHECKPOINT_PERIOD: 1
CLIP_GRADIENTS:
CLIP_TYPE: norm
CLIP_VALUE: 5.0
ENABLED: False
NORM_TYPE: 2.0
DELAY_EPOCHS: 0
ETA_MIN_LR: 1e-07
FREEZE_ITERS: 0
GAMMA: 0.1
HEADS_LR_FACTOR: 1.0
IMS_PER_BATCH: 32
MAX_EPOCH: 60
MOMENTUM: 0.9
NESTEROV: False
OPT: Adam
SCHED: MultiStepLR
STEPS: [30, 50]
WARMUP_FACTOR: 0.1
WARMUP_ITERS: 2000
WARMUP_METHOD: linear
WEIGHT_DECAY: 0.0005
WEIGHT_DECAY_BIAS: 0.0005
WEIGHT_DECAY_NORM: 0.0005
TEST:
AQE:
ALPHA: 3.0
ENABLED: False
QE_K: 5
QE_TIME: 1
EVAL_PERIOD: 60
FLIP:
ENABLED: False
IMS_PER_BATCH: 256
METRIC: cosine
PRECISE_BN:
DATASET: Market1501
ENABLED: False
NUM_ITER: 300
RERANK:
ENABLED: False
K1: 20
K2: 6
LAMBDA: 0.3
ROC:
ENABLED: False
[05/07 13:18:34 fastreid]: Full config saved to C:\Users\user\AICUP_Baseline_BoT-SORT\logs\AICUP_115\bagtricks_R50-ibn\config.yaml
C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\data\transforms\functional.py:46: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))
C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\data\transforms\functional.py:46: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))
C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\data\transforms\functional.py:46: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))
C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\data\transforms\functional.py:46: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))
C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\data\transforms\functional.py:46: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))
C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\data\transforms\functional.py:46: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))
C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\data\transforms\functional.py:46: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))
C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\data\transforms\functional.py:46: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))
start epoch 0
Exception during training:
Traceback (most recent call last):
File "C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\engine\train_loop.py", line 146, in train
self.run_step()
File "C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\engine\defaults.py", line 359, in run_step
self._trainer.run_step()
File "C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\engine\train_loop.py", line 354, in run_step
self.grad_scaler.step(self.optimizer)
File "C:\Users\user\anaconda3\envs\botsort\lib\site-packages\torch\amp\grad_scaler.py", line 449, in step
assert (
AssertionError: No inf checks were recorded for this optimizer.
Traceback (most recent call last):
File "C:\Users\user\AICUP_Baseline_BoT-SORT\fast_reid\tools\train_net.py", line 54, in
launch(
File "C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\engine\launch.py", line 71, in launch
main_func(*args)
File "C:\Users\user\AICUP_Baseline_BoT-SORT\fast_reid\tools\train_net.py", line 47, in main
return trainer.train()
File "C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\engine\defaults.py", line 350, in train
super().train(self.start_epoch, self.max_epoch, self.iters_per_epoch)
File "C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\engine\train_loop.py", line 146, in train
self.run_step()
File "C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\engine\defaults.py", line 359, in run_step
self._trainer.run_step()
File "C:\Users\user\AICUP_Baseline_BoT-SORT.\fast_reid\fastreid\engine\train_loop.py", line 354, in run_step
self.grad_scaler.step(self.optimizer)
File "C:\Users\user\anaconda3\envs\botsort\lib\site-packages\torch\amp\grad_scaler.py", line 449, in step
assert (
AssertionError: No inf checks were recorded for this optimizer.

No inf checks were recorded for this optimizer. 這該如何解決

@MuennL
Copy link

MuennL commented May 9, 2024

你是照baseline tutorial跑嗎還是有改過configs的甚麼東西? 看起來是你在使用某個pretrained model時,他的optimizer和fast reid裡面的train loop不相容。Fast reid的default trainer有預設蠻多訓練條件沒講清楚,只寫在comment裡面,可以往這個方向研究。供你參考。

@ricky-696
Copy link
Owner

ricky-696 commented May 9, 2024

@johnny0234 哈囉~

你出現的錯誤是torch套件中optimizer問題,我有觀察到你的torch版本是使用最新的torch2.3.x版本,我覺的應該是版本與fast_reid差異太大,導致裡面有些func沒有實作,才會出現AssertionError

原作者的torch版本為torch 1.11.0+cu113 torchvision 0.12.0 (在這邊有提到)

我自己是安裝torch 1.13.1+cu117 torchvision 0.14.1 給你參考

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants