设置了start_bacth似乎是无效的 #1783

alanshaoTT · 2024-10-26T13:39:29Z

我的模型训练到batch_16000.pt时中断了我想继续训练我的脚本这样设置的
./train.py
--world-size 8
--num-epochs 30
--start-batch 16000
--max-duration 40
--num-buckets 100
--on-the-fly-feats true
--exp-dir ./exp
--bpe-model data/lang_bpe_2000/bpe.model
但是模型似乎还是从bacth0继续训练的

我需要在train_one_epoch添加跳过batch的部分吗其他的部分还需要修改吗？？我使用的recipe是librispeech/ASR/pruned_transducer_stateless7

csukuangfj · 2024-10-28T04:07:00Z

看一下你的 exp 目录下有哪些 .pt 文件？

我的模型训练到batch_16000.pt时中断了

代码要去找 checkpoint-16000.pt, 而不是 batch_16000.pt

alanshaoTT · 2024-10-29T15:45:43Z

看一下你的 exp 目录下有哪些 .pt 文件？

我的模型训练到batch_16000.pt时中断了

代码要去找 checkpoint-16000.pt, 而不是 batch_16000.pt

是我打错了在exp目录下有checkpoint-16000.pt等等每一千步保存一次。我在
for batch_idx, batch in enumerate(train_dl):
if params.batch_idx_train >= batch_idx:
if batch_idx % 100 == 0:
logging.info(f"Batch index {batch_idx} is reached.")
continue
做了这样的修改，但是根据日志显示在batch index 0 is reached之前模型就已经加载了很久的数据了，打印了很多remove_short_and_long_utt(c: Cut):这个函数滤去的数据。所以我怀疑是不是根据--start-batch已经重新在train_dl中过滤了--start-batch的数据不需要我加代码中这些跳过的操作

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

设置了start_bacth似乎是无效的 #1783

设置了start_bacth似乎是无效的 #1783

alanshaoTT commented Oct 26, 2024

csukuangfj commented Oct 28, 2024

alanshaoTT commented Oct 29, 2024

设置了start_bacth似乎是无效的 #1783

设置了start_bacth似乎是无效的 #1783

Comments

alanshaoTT commented Oct 26, 2024

csukuangfj commented Oct 28, 2024

alanshaoTT commented Oct 29, 2024