If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it? #8

lxianl455 · 2021-12-29T09:02:22Z

I noticed that the unprocessed data should be like the following format:

If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it?
Is the file (without the suffix --- ".tokenized" ) should be filled with the origin sentence?
And Which tokenizer should be used to tokenize the sentence in the file with the suffix --- ".tokenized" ?

yixinL7 · 2022-01-03T20:35:51Z

Hi,

Is the file (without the suffix --- ".tokenized" ) should be filled with the original sentence?
Yes.
Which tokenizer should be used to tokenize the sentence in the file with the suffix --- ".tokenized" ?
I used the PTBTokenizer from CoreNLP but it should be okay if you used another one. The tokenized data is only used for evaluation so it would not affect the training.

Let me know if you have more questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it? #8

If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it? #8

lxianl455 commented Dec 29, 2021

yixinL7 commented Jan 3, 2022

If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it? #8

If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it? #8

Comments

lxianl455 commented Dec 29, 2021

yixinL7 commented Jan 3, 2022