预训练数据先拼接在切分成block_size,容易导致一条样本的上下文不相关 #724
Unanswered
sameul-yuan
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
预训练数据处理,首先将整个pt_sample_data.txt 拼接在一起,再按block_size进行切分, 这可能会导致完全不相关的内容进行自回归,容易导致模型胡说八道,想问一下一般预训练数据是这么处理的吗
Beta Was this translation helpful? Give feedback.
All reactions