Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waymo training loss descreases very slowly or even not decreasing. #8

Open
zppppppx opened this issue Jan 26, 2024 · 2 comments
Open

Comments

@zppppppx
Copy link

Hi, very good work! Fast in inference and very light-weighted.

May I know how the training went on your machine? I had finished training, but according to training logs, the loss basically stays the same. Considering I made a few changes in the code, I am trying using completely original code to train. The loss starts to decrease, but very slowly.

I did not change much of the code, I made two major changes:

  1. When processing the waymo dataset using your code, it raised an error, because there is no file_client._map_path. I changed the code to:
def process_single_sequence(sequence_file, save_path, sampled_interval, client, has_label=True, use_two_returns=True):
    sequence_name = os.path.splitext(os.path.basename(sequence_file))[0]

    # print('Load record (sampled_interval=%d): %s' % (sampled_interval, sequence_name))
    if not client.exists(sequence_file):
        print('NotFoundError: %s' % sequence_file)
        return []

    # dataset = tf.data.TFRecordDataset(client._map_path(sequence_file), compression_type='')
    dataset = tf.data.TFRecordDataset(str(sequence_file), compression_type='')
    cur_save_dir = save_path / sequence_name
    cur_save_dir.mkdir(parents=True, exist_ok=True)
........
  1. When I did the first training, I changed the dist_train.sh to OpenPCDet format, which is:
#!/usr/bin/env bash
set -x
NGPUS=$1
PY_ARGS=${@:2}

echo "#######################################" $PY_ARGS

while true
do
    PORT=$(( ((RANDOM<<15)|RANDOM) % 49152 + 10000 ))
    status="$(nc -z 127.0.0.1 $PORT < /dev/null &>/dev/null; echo $?)"
    if [ "${status}" != "0" ]; then
        break;
    fi
done
echo $PORT

python3 -m torch.distributed.launch --nproc_per_node=${NGPUS} --master_port $PORT train.py --launcher pytorch ${PY_ARGS}

The first training log is attached as well.
train-waymo-pvt-ssd.log

@zppppppx
Copy link
Author

BTW, to speed up a little bit, I only chose 20% of the data, but it is also the same case when I train on the whole dataset.

@Nightmare-n
Copy link
Owner

It may be caused by the version of spconv #6 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants