Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Poor training results when trying to configure for camera-only BEVFusion #3024

Open
3 tasks done
abubake opened this issue Aug 20, 2024 · 8 comments
Open
3 tasks done

Comments

@abubake
Copy link

abubake commented Aug 20, 2024

Prerequisite

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

sys.platform: linux
Python: 3.10.14 (main, Jul 8 2024, 14:50:49) [GCC 12.3.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA GeForce GTX 1080 Ti
CUDA_HOME: /usr/local/cuda-12.1
NVCC: Cuda compilation tools, release 12.1, V12.1.66
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.1.2+cu121
PyTorch compiling details: PyTorch built with:

  • GCC 9.3
  • C++ Version: 201703
  • Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX512
  • CUDA Runtime 12.1
  • NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  • CuDNN 8.9.2
  • Magma 2.6.1
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.16.2+cu121
OpenCV: 4.9.0
MMEngine: 0.10.2
MMDetection: 3.3.0
MMDetection3D: 1.4.0+161d091
spconv2.0: False

Reproduces the problem - code sample

'''
Base is the base configuration file. The config files follow a system of inheritance. For example, just like when you inherit from a class,
this config contains all the configurations of default_runtime.py
The same ideas that apply to inheritance with classes apply here. For example, if you wanted to change something in default_runtime,
you can copy it into this class and make the modifications, just like you would do with a function you would like to change in a class.

Custom_imports imports tje modules within the bevfusion project which are needed to run the code.
'''
_base_ = ['../../../configs/_base_/default_runtime.py']
custom_imports = dict(
    imports=['projects.BEVFusion.bevfusion'], allow_failed_imports=False)

'''
The pointcloud range specifies the geometric space the pointclouds can occupy.
Voxel Size indiciates the distance in meters of each dimension of the squares that make up the BEV grid 
(our map where predictions from BEVFusion are made)
'''
point_cloud_range = [-51.2, -51.2, -5.0, 51.2, 51.2, 3.0] # TODO: step through for more info
# point_cloud_range = [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
# voxel_size = [0.075, 0.075, 0.2] # this voxel size made it actually have a mAP of 0!
voxel_size = [0.1, 0.1, 0.2]
# image_size = [256, 704]
# post_center_range = [-64.0, -64.0, -10.0, 64.0, 64.0, 10.0]
post_center_range = [-61.2, -61.2, -10.0, 61.2, 61.2, 10.0] # this matches what I see for det in MIT # TODO: step through for more info

'''
Class names used for all object detection tasks. Using nuScenes, we train and evaluate on 6 different object detection tasks, where the combinations of 
object classes for each tasks vary. For example, task 0 may comtain car, truck, and bus, while task 1 may contain car, motorcycle, bicycle, barrier.
'''
class_names = [
    'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier',
    'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone'
]
'''
metainfo is used to pass the class names from the config in the format the code is looking for.
the dataset type and root specify 1. the class object dataset being used (for other datasets such as KITTI a dataset object is similarly defined)
2. the realtive path to the nuscenes dataset to be used.

data_prefix: data prefix is used for specifying to the nuScenesDataset object what sensors are being used. This can include camera and lidar sensors.
For this case, we inlcude only the 6 cameras available in the nuscenes dataset.
'''
metainfo = dict(classes=class_names)    #, version='v1.0-mini')
dataset_type = 'NuScenesDataset'
data_root = 'data/nuscenes/'

data_prefix = dict(
    CAM_FRONT='samples/CAM_FRONT',
    CAM_FRONT_LEFT='samples/CAM_FRONT_LEFT',
    CAM_FRONT_RIGHT='samples/CAM_FRONT_RIGHT',
    CAM_BACK='samples/CAM_BACK',
    CAM_BACK_RIGHT='samples/CAM_BACK_RIGHT',
    CAM_BACK_LEFT='samples/CAM_BACK_LEFT'
    )

'''
input modality specifies which sensors are being used, which effects...
'''
input_modality = dict(use_lidar=False, use_camera=True) # TODO: determine the effect of lidar=False
backend_args = None # TODO: find out what is

'''
MODEL DEFINITION
- MMLab's way of defining deep learning models.

- type: specifies the project being used
- data_preprocessor: Det3DDataPreprocessor is a general mmdetection3d preprocessing class that works for lidar, vision only, and more.
- img_backbone: this is the model which performs initial transformation from image data into features using a CNN architecture.
    * mmdet.SwinTransformer
- img_neck: this the the model component which takes the first output of the backbone and further refines our features
- 
'''
model = dict(
    type='BEVFusion',
    data_preprocessor=dict(
        type='Det3DDataPreprocessor',
        pad_size_divisor=32,
        # voxelize_cfg=dict(
        #     max_num_points=10,
        #     point_cloud_range=point_cloud_range,
        #     voxel_size=voxel_size,
        #     max_voxels=[120000, 160000],
        #     voxelize_reduce=True),
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        bgr_to_rgb=False),
    img_backbone=dict(
        type='mmdet.SwinTransformer',
        embed_dims=96,
        depths=[2, 2, 6, 2],
        num_heads=[3, 6, 12, 24],
        window_size=7,
        mlp_ratio=4,
        qkv_bias=True,
        qk_scale=None,
        drop_rate=0.0,
        attn_drop_rate=0.0,
        drop_path_rate=0.2,
        patch_norm=True,
        out_indices=[1, 2, 3],
        with_cp=False,
        convert_weights=True,
        init_cfg=dict(
            type='Pretrained',
            checkpoint=  # noqa: E251
            'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth'  # noqa: E501
        )),
    img_neck=dict(
        type='GeneralizedLSSFPN',
        in_channels=[192, 384, 768],
        out_channels=256,
        start_level=0,
        num_outs=3,
        norm_cfg=dict(type='BN2d', requires_grad=True),
        act_cfg=dict(type='ReLU', inplace=True),
        upsample_cfg=dict(mode='bilinear', align_corners=False)),
    view_transform=dict(
        type='LSSTransform',
        in_channels=256,
        out_channels=80,
        image_size=[256, 704],
        feature_size=[32, 88],
        # xbound=[-54.0, 54.0, 0.3],
        xbound=[-51.2, 51.2, 0.4],
        ybound=[-51.2, 51.2, 0.4],
        # ybound=[-54.0, 54.0, 0.3],
        zbound=[-10.0, 10.0, 20.0],
        dbound=[1.0, 60.0, 0.5],
        downsample=2),
    pts_backbone=dict(
        type='GeneralizedResNet',
        in_channels=80,
        blocks=[[2, 128, 2],
                [2, 256, 2],
                [2, 512, 1]]),
    pts_neck=dict(
        type='LSSFPN',
        in_indices=[-1,0],
        in_channels=[512, 128],
        out_channels=256,
        scale_factor=2),
    bbox_head=dict(
        type='CenterHead', # changed from CenterHead to CustomCenterHead
        in_channels=256,
        tasks=[
            dict(num_class=1, class_names=['car']),
            dict(num_class=2, class_names=['truck', 'construction_vehicle']),
            dict(num_class=2, class_names=['bus', 'trailer']),
            dict(num_class=1, class_names=['barrier']),
            dict(num_class=2, class_names=['motorcycle', 'bicycle']),
            dict(num_class=2, class_names=['pedestrian', 'traffic_cone']),
        ],
        common_heads=dict(
            reg=(2, 2), height=(1, 2), dim=(3, 2), rot=(2, 2), vel=(2, 2)),
        share_conv_channel=64,
        bbox_coder=dict(
            type='CenterPointBBoxCoder', # modified from CustomCenterPointBBoxCoder
            post_center_range=post_center_range,
            pc_range=point_cloud_range,
            max_num=500,
            score_threshold=0.1,
            out_size_factor=8,            
            voxel_size=voxel_size[:2],
            code_size=9),
        separate_head=dict(
            type='SeparateHead', init_bias=-2.19, final_kernel=3),
        loss_cls=dict(type='mmdet.GaussianFocalLoss', reduction='mean'),
        loss_bbox=dict(
            type='mmdet.L1Loss', reduction='mean', loss_weight=0.25),
        norm_bbox=True,
        train_cfg=dict(
            dataset='nuScenes',
            point_cloud_range=point_cloud_range,
            grid_size=[1024, 1024, 1],
            # grid_size=[1440, 1440, 41],
            voxel_size=voxel_size,
            out_size_factor=8,
            dense_reg=1,
            gaussian_overlap=0.1,
            max_objs=500,
            min_radius=2,
            code_weights=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2, 0.2]
        ),
        test_cfg=dict(
            dataset='nuScenes',
            post_center_limit_range=post_center_range,
            max_per_img=500,
            max_pool_nms=False,
            min_radius=[4, 12, 10, 1, 0.85, 0.175],
            score_threshold=0.1,
            pc_range=point_cloud_range[:2], # he had 0:2- same thing
            out_size_factor=8,
            voxel_size=voxel_size[:2],
            nms_type= 'circle', #['circle', 'circle', 'circle', 'circle', 'circle', 'circle'], # Changed from just being 'circle'
            pre_max_size=1000,
            post_max_size=83,
            nms_thr=0.2)
    )
)

train_pipeline = [
    dict(
        type='BEVLoadMultiViewImageFromFiles',
        to_float32=False, # was flp32- what if we change?
        color_type='color',
        backend_args=backend_args),
    dict(
        type='LoadAnnotations3D',
        with_bbox_3d=True,
        with_label_3d=True,
        with_attr_label=False),
    # dict(type='ObjectSample', db_sampler=db_sampler),
    dict(
        type='ImageAug3D',
        final_dim=[256, 704],
        resize_lim=[0.38, 0.55],
        bot_pct_lim=[0.0, 0.0],
        rot_lim=[-5.4, 5.4],
        rand_flip=True,
        is_train=True),
    dict(type='BEVFusionRandomFlip3D'), # temporarily commmented out
    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
    dict(
        type='ObjectNameFilter',
        classes=[
            'car', 'truck', 'construction_vehicle', 'bus', 'trailer',
            'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone'
        ]),
    dict(
        type='GridMask',
        use_h=True,
        use_w=True,
        rotate=1,
        offset=False,
        ratio=0.5,
        mode=1,
        prob=0,
        max_epoch=20,
    ),
    # dict(type='PointShuffle'),
    dict(
        type='Pack3DDetInputs',
        keys=[
            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
            'gt_labels'
        ],
        meta_keys=[
            'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar',
            'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx',
            'lidar_path', 'img_path', 'transformation_3d_flow',
            #'pcd_rotation','pcd_scale_factor', 'pcd_trans', 
            'img_aug_matrix',
            #'lidar_aug_matrix', 'num_pts_feats'
        ])
]

test_pipeline = [
    dict(
        type='BEVLoadMultiViewImageFromFiles', # no BEV prefix in MIT
        to_float32=True,
        color_type='color',
        backend_args=backend_args), # what are the backend args being used??
    dict( # MIT has another type inlcuded, LoadAnnotations3D
        type='ImageAug3D',
        final_dim=[256, 704],
        resize_lim=[0.48, 0.48],
        bot_pct_lim=[0.0, 0.0],
        rot_lim=[0.0, 0.0],
        rand_flip=False,
        is_train=False),
    # dict(
    #     type='PointsRangeFilter',
    #     point_cloud_range=point_cloud_range),
    dict(
        type='Pack3DDetInputs',
        keys=['img', 'points', 'gt_bboxes_3d', 'gt_labels_3d'],
        meta_keys=[
            'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar',
            'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx',
            'lidar_path', 'img_path', 'num_pts_feats', 'num_views'
        ])
]

train_dataloader = dict(
    batch_size=1, # changed from 2 to 1
    num_workers=1, # changed from 1 back to 4
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True), #shuffle
    dataset=dict(
       type='CBGSDataset',
    dataset=dict(
            type=dataset_type,
            data_root=data_root,
            ann_file='nuscenes_infos_train.pkl',
            pipeline=train_pipeline,
            metainfo=metainfo,
            modality=input_modality,
            test_mode=False,
            data_prefix=data_prefix,
            use_valid_flag=True,
            # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
            # and box_type_3d='Depth' in sunrgbd and scannet dataset.
            box_type_3d='LiDAR'))
            )
val_dataloader = dict(
    batch_size=1,
    num_workers=1,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        ann_file='nuscenes_infos_val.pkl',
        pipeline=test_pipeline,
        metainfo=metainfo,
        modality=input_modality,
        data_prefix=data_prefix,
        test_mode=True, # test mode was true- does not make sense for val_dataloader perhaps?
        box_type_3d='LiDAR',
        backend_args=backend_args))
test_dataloader = val_dataloader

val_evaluator = dict(
    type='NuScenesMetric',
    data_root=data_root,
    ann_file=data_root + 'nuscenes_infos_val.pkl',
    metric='bbox',
    backend_args=backend_args)
test_evaluator = val_evaluator

vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')

# learning rate
# lr = 0.0001
lr = 2e-5 # changed from 2e-4
param_scheduler = [
    # learning rate scheduler
    # During the first 8 epochs, learning rate increases from 0 to lr * 10
    # during the next 12 epochs, learning rate decreases from lr * 10 to
    # lr * 1e-4
    dict(
        type='CosineAnnealingLR',
        T_max=8,
        eta_min=lr * 6, # changed from 10
        begin=0,
        end=8,
        by_epoch=True,
        convert_to_iter_based=True),
    dict(
        type='CosineAnnealingLR',
        T_max=12,
        eta_min=lr * 1e-2, # changed from -4
        begin=8,
        end=20,
        by_epoch=True,
        convert_to_iter_based=True),
    # momentum scheduler
    # During the first 8 epochs, momentum increases from 0 to 0.85 / 0.95
    # during the next 12 epochs, momentum increases from 0.85 / 0.95 to 1
    dict(
        type='CosineAnnealingMomentum',
        T_max=8,
        eta_min=0.85 / 0.95,
        begin=0,
        end=8,
        by_epoch=True,
        convert_to_iter_based=True),
    dict(
        type='CosineAnnealingMomentum',
        T_max=12,
        eta_min=1,
        begin=8,
        end=20,
        by_epoch=True,
        convert_to_iter_based=True)
]

# runtime settings
train_cfg = dict(by_epoch=True, max_epochs=20, val_interval=1) # do kyoung had change to 10
val_cfg = dict()
test_cfg = dict()

'''
load_from and resume:

load_from: specifies a model path that is either pretrained or partially pretrained that you would like to continue to train from the current state of the weights.
            Specifying "None" type for load_from opts to train from scratch.

            Here is an example of how you might use load_from to train starting with a pretrained model:

            load_from = "/home/a0271391/code/edgeai-mmdetection3d/projects/BEVFusion/models/camera-only-det_converted_copy.pth"

resume: be aware: resume=True means that you want to resume training from a specific training epoch and step for a particular model. If you don't care about actually resuming
        training from where training was stopped previously, then you don't need to set resume True. Only set it True if the model you are loading with load_from was trained
        to a specific point (ex: on epoch 7, step 19200/30000) and you want to continue from there.
'''
load_from = None
resume = False # resume from the checkpoint defined in load_from

optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=dict(type='AdamW', lr=lr, weight_decay=0.01),
    clip_grad=dict(max_norm=35, norm_type=2))

# Default setting for scaling LR automatically
#   - `enable` means enable scaling LR automatically
#       or not by default.
#   - `base_batch_size` = (8 GPUs) x (4 samples per GPU).
auto_scale_lr = dict(enable=False, base_batch_size=1)
log_processor = dict(window_size=50)

'''
HOOKS - 

Objects which operate on actively running code, such as logging information at the end of an epoch.
Hooks are defined in mmdet3d/engine. The purpose of hooks is often to add new features to a predefined python module.

EX: You want to be able to add additional data to your dataloader when training a model every 3 epochs. You could modify the source code for training, or you could make a
hook which adds that functionality on top of your base code. Then all you have to do is init that hook when defining the parameters of your code, or not init it if you want the
base funtionality.

Here, they are used for logging information such as speed to train an epoch.
The DisableObjectSampleHook simply stops augmenting the training data after a specified epoch (epoch 15)
'''
default_hooks = dict(
    logger=dict(type='LoggerHook', interval=50),
    checkpoint=dict(type='CheckpointHook', interval=1))
custom_hooks = [dict(type='DisableObjectSampleHook', disable_after_epoch=15)]

Reproduces the problem - command or script

bash tools/dist_train.sh projects/BEVFusion/configs/bevfusion_cam_swint_centerpoint_nus-3d.py 4

Reproduces the problem - error message

No error message; issue is even after 20 epochs, the result is extremely poor mAP and NMS. Loss gets down to about 6.x.

Additional information

  1. I expected training results to be similar to MIT's camera-only results.
  2. I used the nuScenes dataset
  3. I suspect there is an issue in my setup in the configuration file. I have included the configuration I have been using for image-only BEVFusion.
@ymlab
Copy link

ymlab commented Sep 21, 2024

Same problem.

@gorkemguzeler
Copy link

hi @ymlab @abubake,

I have a question regarding the training:

I am curious how much time the training takes per epoch and how many gpus do you use? I am particularly interested in the lidar only training if you have any experience with that.

@abubake
Copy link
Author

abubake commented Oct 9, 2024 via email

@gorkemguzeler
Copy link

Thanks a lot for sharing your experience @abubake, it helps! were you able to reproduce good results (comparable to the paper) with lidar only training?

I plan to work with this repository for my thesis, and don't want to waste time if the code is not working as expected. therefore any feedback is valuable for me :)

@mdessl
Copy link

mdessl commented Oct 20, 2024

@gorkemguzeler the repo is working as expected for me. Haven't trained lidar-only but I got 65 mAP after 3 epochs of training the bevfusion model with the lidar-only base. Oh and it took 2h per epoch on 8x 3090 with bs 2 and lr scaling enabled.

Btw we are in the same boat. I am also doing my thesis on multimodal learning :)

@curiosity654
Copy link

@mdessl Hi I'm also working on multimodal 3d det. I'm curious by bs 2 you mean 2 batch per GPU or 2 batch for the whole 8 GPUs? As 3080 seems only have 12G of mem. I have trained the BEVFusion of this repo on 2xA5000 with bs of 4 (with lr scale) and cannot match the result of 71.4 NDS. After using Gradient Accumulation to simulate bs 32, the performance is much better to approximately 70.9 NDS.

For the multimodal, my concern is that the camera branch of this repo is too dependent on LiDAR, as they use DepthLSS instead of original LSS transform.

@mdessl
Copy link

mdessl commented Oct 21, 2024

@curiosity654 ohh sry it was a typo. I meant 3090 (24G RAM), so bs 2 per GPU.

Do you think the issue could have to do with the batchnorm layers? I think BN is not so compatible with gradient accumulation and I am not sure what you could do about it.

@gorkemguzeler
Copy link

@mdessl , thanks a lot for the feedback 👍

oh, good luck on your thesis :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants