Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

设置了start_bacth似乎是无效的 #1783

Open
alanshaoTT opened this issue Oct 26, 2024 · 2 comments
Open

设置了start_bacth似乎是无效的 #1783

alanshaoTT opened this issue Oct 26, 2024 · 2 comments

Comments

@alanshaoTT
Copy link

我的模型训练到batch_16000.pt时中断了 我想继续训练 我的脚本这样设置的
./train.py
--world-size 8
--num-epochs 30
--start-batch 16000
--max-duration 40
--num-buckets 100
--on-the-fly-feats true
--exp-dir ./exp
--bpe-model data/lang_bpe_2000/bpe.model
但是模型似乎还是从bacth0继续训练的
image
我需要在train_one_epoch添加跳过batch的部分吗 其他的部分还需要修改吗??我使用的recipe是librispeech/ASR/pruned_transducer_stateless7

@csukuangfj
Copy link
Collaborator

看一下你的 exp 目录下有哪些 .pt 文件?

我的模型训练到batch_16000.pt时中断了

代码要去找 checkpoint-16000.pt, 而不是 batch_16000.pt

@alanshaoTT
Copy link
Author

看一下你的 exp 目录下有哪些 .pt 文件?

我的模型训练到batch_16000.pt时中断了

代码要去找 checkpoint-16000.pt, 而不是 batch_16000.pt

是我打错了 在exp目录下有checkpoint-16000.pt等等每一千步保存一次。我在
for batch_idx, batch in enumerate(train_dl):
if params.batch_idx_train >= batch_idx:
if batch_idx % 100 == 0:
logging.info(f"Batch index {batch_idx} is reached.")
continue
做了这样的修改,但是根据日志显示在batch index 0 is reached之前模型就已经加载了很久的数据了,打印了很多remove_short_and_long_utt(c: Cut):这个函数滤去的数据。所以我怀疑是不是根据--start-batch已经重新在train_dl中过滤了--start-batch的数据 不需要我加代码中这些跳过的操作

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants