Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it? #8

Open
lxianl455 opened this issue Dec 29, 2021 · 1 comment

Comments

@lxianl455
Copy link

I noticed that the unprocessed data should be like the following format:
image
If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it?
Is the file (without the suffix --- ".tokenized" ) should be filled with the origin sentence?
And Which tokenizer should be used to tokenize the sentence in the file with the suffix --- ".tokenized" ?

@yixinL7
Copy link
Owner

yixinL7 commented Jan 3, 2022

Hi,

  • Is the file (without the suffix --- ".tokenized" ) should be filled with the original sentence?
    Yes.

  • Which tokenizer should be used to tokenize the sentence in the file with the suffix --- ".tokenized" ?
    I used the PTBTokenizer from CoreNLP but it should be okay if you used another one. The tokenized data is only used for evaluation so it would not affect the training.

Let me know if you have more questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants