You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that the unprocessed data should be like the following format:
If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it?
Is the file (without the suffix --- ".tokenized" ) should be filled with the origin sentence?
And Which tokenizer should be used to tokenize the sentence in the file with the suffix --- ".tokenized" ?
The text was updated successfully, but these errors were encountered:
Is the file (without the suffix --- ".tokenized" ) should be filled with the original sentence?
Yes.
Which tokenizer should be used to tokenize the sentence in the file with the suffix --- ".tokenized" ?
I used the PTBTokenizer from CoreNLP but it should be okay if you used another one. The tokenized data is only used for evaluation so it would not affect the training.
I noticed that the unprocessed data should be like the following format:
If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it?
Is the file (without the suffix --- ".tokenized" ) should be filled with the origin sentence?
And Which tokenizer should be used to tokenize the sentence in the file with the suffix --- ".tokenized" ?
The text was updated successfully, but these errors were encountered: