Waymo training loss descreases very slowly or even not decreasing. #8

zppppppx · 2024-01-26T04:02:26Z

Hi, very good work! Fast in inference and very light-weighted.

May I know how the training went on your machine? I had finished training, but according to training logs, the loss basically stays the same. Considering I made a few changes in the code, I am trying using completely original code to train. The loss starts to decrease, but very slowly.

I did not change much of the code, I made two major changes:

When processing the waymo dataset using your code, it raised an error, because there is no file_client._map_path. I changed the code to:

def process_single_sequence(sequence_file, save_path, sampled_interval, client, has_label=True, use_two_returns=True):
    sequence_name = os.path.splitext(os.path.basename(sequence_file))[0]

    # print('Load record (sampled_interval=%d): %s' % (sampled_interval, sequence_name))
    if not client.exists(sequence_file):
        print('NotFoundError: %s' % sequence_file)
        return []

    # dataset = tf.data.TFRecordDataset(client._map_path(sequence_file), compression_type='')
    dataset = tf.data.TFRecordDataset(str(sequence_file), compression_type='')
    cur_save_dir = save_path / sequence_name
    cur_save_dir.mkdir(parents=True, exist_ok=True)
........

When I did the first training, I changed the dist_train.sh to OpenPCDet format, which is:

#!/usr/bin/env bash
set -x
NGPUS=$1
PY_ARGS=${@:2}

echo "#######################################" $PY_ARGS

while true
do
    PORT=$(( ((RANDOM<<15)|RANDOM) % 49152 + 10000 ))
    status="$(nc -z 127.0.0.1 $PORT < /dev/null &>/dev/null; echo $?)"
    if [ "${status}" != "0" ]; then
        break;
    fi
done
echo $PORT

python3 -m torch.distributed.launch --nproc_per_node=${NGPUS} --master_port $PORT train.py --launcher pytorch ${PY_ARGS}

The first training log is attached as well.
train-waymo-pvt-ssd.log

zppppppx · 2024-01-26T04:11:08Z

BTW, to speed up a little bit, I only chose 20% of the data, but it is also the same case when I train on the whole dataset.

Nightmare-n · 2024-02-01T07:28:34Z

It may be caused by the version of spconv #6 .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Waymo training loss descreases very slowly or even not decreasing. #8

Waymo training loss descreases very slowly or even not decreasing. #8

zppppppx commented Jan 26, 2024

zppppppx commented Jan 26, 2024

Nightmare-n commented Feb 1, 2024

Waymo training loss descreases very slowly or even not decreasing. #8

Waymo training loss descreases very slowly or even not decreasing. #8

Comments

zppppppx commented Jan 26, 2024

zppppppx commented Jan 26, 2024

Nightmare-n commented Feb 1, 2024