使用merge_sample.jsonw做了例子简单训练，生成的checkpoints效果检验不理想 #94

grantchenhuarong · 2023-04-20T04:05:25Z

你好，使用 bash finetune.sh简单训练了下，以checkpoint-11600为基础，跑了例子程序，正常生成了11800的adapter。
然后启动bash generate.sh进行简单测试，对比原有的11600应答效果，新生成的lora模型存在以下这些问题：

1、回答多数自动给出英文了，这是为啥呢？
2、做摘要任务时，死活都OOM了。如“请用30字总结下文：在我们介绍Transformers之前，我们先了解下NLP主要解决的问题是什么。下面就列出一些常见的NLP任务：句子分类：例如影评的情感分析，检测一封电子邮件是否为垃圾邮件，确定一个句子是否在语法上正确，或者两个句子是否在逻辑上相关。给句子里每个词分类：例如识别句子的语法成分（名词、动词、形容词）或命名实体（人、地点、组织）。内容生成：例如自动写诗，填充句子中的空白。答案抽取：例如给定一个问题和上下文，根据上下文提供的信息提取问题的答案。根据输入生成一个新的句子：例如机器翻译，文本摘要。”，而checkpoint-11600的应答是正常的。

grantchenhuarong · 2023-04-20T04:06:52Z

所以想咨询一下，怎样规模的语料，怎样数量的训练steps，才能避免类似的情况出现？这个应该是工程经验，也是相当宝贵的，看看能否指导一下哈。跟大神学习，膜拜。。。

grantchenhuarong · 2023-04-20T04:10:14Z

补充下信息：centOS7 python==3.8.16,单机2080ti(11GB)

grantchenhuarong · 2023-04-20T04:20:29Z

再测试了一下，对于上述摘要任务，用给出的checkpoint-final模型，一样是出现如下情况：
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.75 GiB total capacity; 9.23 GiB already allocated; 13.50 MiB free; 9.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

grantchenhuarong · 2023-04-20T04:32:20Z

再测试了Chinese-Vicuna-lora-7b-3epoch-belle-and-guanaco这个在huggingface下载的开放模型，发现它的表现跟checkpoint-final一样，自己设置的几个问题表现完全一致。最后的摘要任务也是OOM掉了。。。

grantchenhuarong · 2023-04-20T04:34:08Z

总结一下想咨询的几个问题：
1、如何避免实验加训之后经常回复英文？
2、摘要任务如何避免OOM？实测只有 11600这个模型能够正常摘要总结出来
3、自己实验加训之后的模型推理效率对比起来也慢了足有两倍，这个是为何？

grantchenhuarong · 2023-04-20T06:16:22Z

对于checkpoint-11600的摘要任务，如何对于文本总结提取效果不佳的时候，也会产生OOM的情况。可以观察到GPU的显存占用不断升高，确实挺头疼的。

Facico · 2023-04-20T10:32:41Z

1、我觉得，你continue_finetune的设置错了吧，可以参考这个文档
2、OOM和什么模型应该没关系，OOM和输入、输出的长度有关（GPU显存不断升高是因为它在生成东西），你可以控制max_new_token并减小beam_num来试试
3、不知道你这个加训的是怎么设置的，你可以看看文件大小一不一样

grantchenhuarong · 2023-04-21T00:34:44Z

谢谢指导，我参考下您的医疗问答，尝试做一个古诗词知识服务的训练。

grantchenhuarong closed this as completed Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用merge_sample.jsonw做了例子简单训练，生成的checkpoints效果检验不理想 #94

使用merge_sample.jsonw做了例子简单训练，生成的checkpoints效果检验不理想 #94

grantchenhuarong commented Apr 20, 2023

grantchenhuarong commented Apr 20, 2023

grantchenhuarong commented Apr 20, 2023

grantchenhuarong commented Apr 20, 2023

grantchenhuarong commented Apr 20, 2023

grantchenhuarong commented Apr 20, 2023 •

edited

Loading

grantchenhuarong commented Apr 20, 2023

Facico commented Apr 20, 2023

grantchenhuarong commented Apr 21, 2023

使用merge_sample.jsonw做了例子简单训练，生成的checkpoints效果检验不理想 #94

使用merge_sample.jsonw做了例子简单训练，生成的checkpoints效果检验不理想 #94

Comments

grantchenhuarong commented Apr 20, 2023

grantchenhuarong commented Apr 20, 2023

grantchenhuarong commented Apr 20, 2023

grantchenhuarong commented Apr 20, 2023

grantchenhuarong commented Apr 20, 2023

grantchenhuarong commented Apr 20, 2023 • edited Loading

grantchenhuarong commented Apr 20, 2023

Facico commented Apr 20, 2023

grantchenhuarong commented Apr 21, 2023

grantchenhuarong commented Apr 20, 2023 •

edited

Loading