关于合成数据create_synthetic_data.py #3

Tough-Stone · 2023-05-30T13:48:24Z

您好，我正在学习这份代码。想请问一下在这份创建合成数据的代码中，获取的3个list：incorrect_input_ids_list, label_ids_list, target_ids_list分别代表什么含义？对应论文中的哪里呢？谢谢

NLPCode · 2023-05-30T14:12:00Z

incorrect_input_ids_list: encoder input
label_ids_list: encoder labels
target_ids_list: decoder labels

Tough-Stone · 2023-05-30T16:00:14Z

谢谢。
这个incorrecet命名没看懂含义...也就是分别对应论文里的X，Y，Ym吗？
我看到代码里的样例数据，Dev.txt有10条文本，但是生成的incorrect_input_ids_list, label_ids_list, target_ids_list都有50条，因为获取的时候用了“for i in range(5):”，这是为什么呢？

Tough-Stone · 2023-05-31T03:15:28Z

还想补充个问题：keywork.txt里的数据参与训练吗，里面的ground-truth有什么作用？这个任务的ground-truth可以理解为从原句自中抽取一部分单词，再恢复到原句子吗？

NLPCode · 2023-05-31T03:21:30Z

构造数据的时候，用的随机采样操作，所以一条数据可以构造多条伪数据。
keywork.txt相当于测试集，里面的数据不参与训练。ground-truth用于评测生成文本质量。

Tough-Stone · 2023-05-31T03:36:50Z

谢谢回答。那么这个任务在训练时的目标就是从原句子中抽取部分，再恢复到原句吗

NLPCode · 2023-05-31T03:40:16Z

是的。

Tough-Stone · 2023-06-14T09:29:03Z

谢谢，我现在还有两个问题：
在构建合成数据时，encoder label是利用什么策略获得的？
如果在推理阶段用较少的关键词想要生成更长的句子，应该做哪些修改呢？我尝试了一些别的测试用例，关键词与关键词之间几乎没有插入新单词，而是全部插入到了句子的结尾。

Tough-Stone · 2023-06-15T03:56:16Z

如果使用中文，每个关键词都不止一个token，在推理时indicate_labels中间有很多0，最终插入的新单词全跑到了句尾，这个是什么原因呢

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于合成数据create_synthetic_data.py #3

关于合成数据create_synthetic_data.py #3

Tough-Stone commented May 30, 2023

NLPCode commented May 30, 2023

Tough-Stone commented May 30, 2023

Tough-Stone commented May 31, 2023

NLPCode commented May 31, 2023 •

edited

Loading

Tough-Stone commented May 31, 2023

NLPCode commented May 31, 2023

Tough-Stone commented Jun 14, 2023

Tough-Stone commented Jun 15, 2023

关于合成数据create_synthetic_data.py #3

关于合成数据create_synthetic_data.py #3

Comments

Tough-Stone commented May 30, 2023

NLPCode commented May 30, 2023

Tough-Stone commented May 30, 2023

Tough-Stone commented May 31, 2023

NLPCode commented May 31, 2023 • edited Loading

Tough-Stone commented May 31, 2023

NLPCode commented May 31, 2023

Tough-Stone commented Jun 14, 2023

Tough-Stone commented Jun 15, 2023

NLPCode commented May 31, 2023 •

edited

Loading