-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad dataset #65
Comments
Hmm, i wonder if we could get better results using a dataset with higher quality like https://github.com/allenai/natural-instructions/tree/master/splits/default |
Yeah, I'm currently playing with Laion OIG small-chip2, but I'm really looking forward to the result of their Open-Assistant project, which will create a fully human generated, refined, and curated instruction dataset. |
Dataset is the problem I guess. Here are just two examples of inputs/outputs which teaches the model nothing but hallucinating.
I'm sure there are a lot of such bad instructions there which makes more problems than proper learning. |
There is definitely a lot of bad data in the dataset. I went through and manually cleaned hundreds of issues resulting in the current "cleaned" dataset, but there are likely many more issues I missed. I came across and fixed many issues referencing data on the internet that would likely just cause hallucinations, for example:
There were hundreds of similar examples in the original dataset. @abacaj Are the training results you posted of the original dataset or the cleaned dataset? |
The runs posted are from the original, unfortunately only discovered this dataset after posting on twitter. But I can probably do a smaller model run to see if it improves the loss. I personally don't know it will help to remove what appear to be bad samples To me it seems like there are very similar answers in the labels of the dataset which are used to compute the loss, which could lead to overfitting/reduce ability to generalize. Here's 3 examples I found - maybe not a issue but it appears to contribute to the flat loss - ex not a diverse dataset {
"instruction": "Generate a list of 5 American cities with populations higher than 1 million.",
"input": "",
"output": "New York City, Los Angeles, Chicago, Houston, Phoenix."
},
{
"instruction": "Name 5 US cities with populations over 1 million.",
"input": "",
"output": "The 5 US cities with populations over 1 million are: New York City, Los Angeles, Chicago, Houston, and Phoenix."
},
{
"instruction": "Generate a list of four US cities with over one million population.",
"input": "",
"output": "New York City, Los Angeles, Chicago, Houston."
}, |
Following our discussion on twitter, here is a screenshot of my current alpaca-lora training run (losses are a bit higher because I'm masking out the instruction in the loss): I'm starting to drift towards the idea that we should probably abandon the Alpaca dataset entirely once we get a suitable SFT dataset from the Open-Assistant project, or at least diversify the seed prompts in the original repo. |
Looks better. We could probably improve quality by filtering out duplicate instruction/answer from the dataset by picking the best ones I’m curious how you did the masking because I did something similar in my run by applying IGNORE_INDEX to the labels up to the instruction prompt length Just realized your loss is still a bit of a flatline like my previous run, I think validation loss will show that it is overfitting |
Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per #32 (comment)) |
Not my place to respond, but I would say leaving typos in the prompt makes it understand the typo should be thought of as what it is meant to be, and respond accordingly |
Makes sense to me as well for the prompt, the output dataset should aim to be correct |
I agree with that forsure. |
LAION's dataset can be found here https://github.com/LAION-AI/Anh/tree/main/data in case anyone wants to give a try for it in training! |
Interesting - it looks like 100K lines of |
I started a new effort to try and clean up the current alpaca dataset |
I am working on putting together a FLAN dataset as well to upload to the HF hub. Training a 7B and 13B llama model on OIG at bf16 no LORA. Will have those out soon. |
My intuition is we should keep the training data scoped and focused. Correct all typos for the training data that does not cover the skill of correcting wrong spellings. Create more (there are some already) training prompts specifically focused on understanding the transition from:
|
If anyone is curious here is my run on the Alpaca dataset using another decoder model (codegen-16B-nl). Appears the dataset isn't diverse, multiple closely related answers. I believe this dataset is not capable of generalizing well to new data.
The loss from the original Alpaca training script follows a similar pattern used in OPT-IML to compute loss based on the label.
My run on codegen-16B-nl
Another user's run on LLaMA 7B
Some more discussion: https://twitter.com/abacaj/status/1637310768780648448
The text was updated successfully, but these errors were encountered: