Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于合成数据create_synthetic_data.py #3

Open
Tough-Stone opened this issue May 30, 2023 · 8 comments
Open

关于合成数据create_synthetic_data.py #3

Tough-Stone opened this issue May 30, 2023 · 8 comments

Comments

@Tough-Stone
Copy link

您好,我正在学习这份代码。想请问一下在这份创建合成数据的代码中,获取的3个list:incorrect_input_ids_list, label_ids_list, target_ids_list分别代表什么含义?对应论文中的哪里呢?谢谢

@NLPCode
Copy link
Owner

NLPCode commented May 30, 2023

incorrect_input_ids_list: encoder input
label_ids_list: encoder labels
target_ids_list: decoder labels

@Tough-Stone
Copy link
Author

谢谢。
这个incorrecet命名没看懂含义...也就是分别对应论文里的X,Y,Ym吗?
我看到代码里的样例数据,Dev.txt有10条文本,但是生成的incorrect_input_ids_list, label_ids_list, target_ids_list都有50条,因为获取的时候用了“for i in range(5):”,这是为什么呢?

@Tough-Stone
Copy link
Author

还想补充个问题:keywork.txt里的数据参与训练吗,里面的ground-truth有什么作用?这个任务的ground-truth可以理解为从原句自中抽取一部分单词,再恢复到原句子吗?

@NLPCode
Copy link
Owner

NLPCode commented May 31, 2023

构造数据的时候,用的随机采样操作,所以一条数据可以构造多条伪数据。
keywork.txt相当于测试集,里面的数据不参与训练。ground-truth用于评测生成文本质量。

@Tough-Stone
Copy link
Author

谢谢回答。那么这个任务在训练时的目标就是从原句子中抽取部分,再恢复到原句吗

@NLPCode
Copy link
Owner

NLPCode commented May 31, 2023

是的。

@Tough-Stone
Copy link
Author

谢谢,我现在还有两个问题:
在构建合成数据时,encoder label是利用什么策略获得的?
如果在推理阶段用较少的关键词想要生成更长的句子,应该做哪些修改呢?我尝试了一些别的测试用例,关键词与关键词之间几乎没有插入新单词,而是全部插入到了句子的结尾。

@Tough-Stone
Copy link
Author

如果使用中文,每个关键词都不止一个token,在推理时indicate_labels中间有很多0,最终插入的新单词全跑到了句尾,这个是什么原因呢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants