-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix few issues with the dataset #32
Conversation
Very interesting — I hadn't realized there were so many holes in the data. Fixing them could improve the model quality significantly. Out of curiosity, how many examples did you view and was there any method to your approach? |
alpaca_data.json
Outdated
@@ -121612,7 +121567,7 @@ | |||
{ | |||
"instruction": "Calculate the area of the following shape in square centimeters.", | |||
"input": "rectangle of size 4 cm x 5 cm", | |||
"output": "" | |||
"output": "The area the rectangle is 20 square centimeters." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: typo
alpaca_data.json
Outdated
"input": "", | ||
"output": "" | ||
"output": "Over the last 5 years, there have been advancements in deep learning, increased used of AI in the industry, a greater emphasis on explainability, development of new hardware and increased importance placed on ethical and social considerations of AI." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: typo (used)
{ | ||
"instruction": "Create a graphic or logo that visually represents the word \"courage\".", | ||
"input": "", | ||
"output": "<No Output>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably standardize an output for these instead of removing all of them completely. i liked the one above:
As a large language model, I am unable to generate visual data.
alpaca_data.json
Outdated
"instruction": "Given a list of materials, come up with a plan to build a table.", | ||
"input": "Materials: Wood, paint, glue, nails, screws", | ||
"output": "No Output" | ||
"instruction": "Come up with a motto for a robot company.", | ||
"input": "", | ||
"output": "Machines that make a difference." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth running this one through GPT again.
alpaca_data.json
Outdated
"instruction": "Is the following statement true or false? People in China make the most of their allowances.", | ||
"input": "None", | ||
"input": "", | ||
"output": "False" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol. yeah, there are a few of these floating around. Seem to make no sense.
Noticed there are a several tasks that expect the LLM to use data from URL's. Many of which don't even exist. I've replaced equivalent data when available. |
I only gave a cursory look and fixed the very obvious issues (ie. inconsistent empty input, obviously wrong answers, blank outputs, etc). I probably manually went through a few hundred examples. I think I got most of the low-hanging fruit via searching for empty inputs and blank outputs. I did notice there are many instructions asking the LLM to reference online data to answer a question. These should probably be addressed in some manner. |
I’m not sure if this is the right place to ask but I was thinking of crowdsourcing updating of each response in the training data set with functions to approve and review each line |
Instead of providing generic answers like "As a large language model, I am unable to..." we could introduce a standardized set of tools, that could potentially improve the accuracy of certain types of responses, such as calculations, image generation, or code compilation. The model should propose tools and use their output instead of relying solely on the language model's internal capabilities (which could be a big limitation considering the model size). One can still detect the tool usage and replace it with generic answer if necessary. |
To assist with this, I made an embedding space explorer (running the data through a transformer) for visualizing the instructions and outputs. Training Data Instructions Latent Space: https://atlas.nomic.ai/map/alpaca_instructions For example, here is a link to a bunch of bad data points in the outputs: https://atlas.nomic.ai/map/d2139cc3-bc1c-441c-8d6f-3e6ffbbc2eda/838019ff-8fe2-42ba-809a-d86d2b98cd50/-18.11668742841587/-11.348087116836096/-20.88850316347706/-17.680468640801223/774455612 |
The original Stanford Dataset is full of mistakes and holes. Another large issue I found was that many of the instructions hallucinated references to article URL's. I made my best effort first pass through the dataset to clean it up:
The patched dataset is much more consistent and no longer assumes the LLM can access the internet or view/generate visual data. It also now has a few CoT training examples. |
I spent some time thinking about how to crowdsource dataset cleaning with minimal tooling. One way to do this is to create a separate repo with the following structure:
I suppose the utility of such an approach would depend on how many bad data points remain. In the meantime, I'll review the changes made so far and save a new "cleaned" dataset alongside the existing one. |
Would the dataset benefit from multiple prompt:response chains rather than just single prompt>response? i.e. Question:Answer:FollowupQ:FollowupA |
That's a lot of work to build. I'd hold out for that 22k dataset that LAION used to train SFT-1. |
Folded into f704404. Thanks for your work! |
Looks like this just closed as I was typing but there is an typo not to far into the file which I'm not sure intentional or not.
https://github.com/tloen/alpaca-lora/blob/main/alpaca_data_cleaned.json#L23 |
Although honestly we might want to leave typos in the instructions. |
Yeah it might be worth it idk. |
for prompts it seems a good idea to keep typos |
People should really support LAION's open-assistant.io project, because every person helping there, will improve a fully curated, crowd sourced, open sourced instruction fine tuning dataset, which in turn can be used for alpaca fine tuning. |
FYI, the dataset cleaning is on-going. Latest cleaned dataset can be accessed here. |
Good idea, meta is already working on it with toolformer and there are a few other efforts too, for example getting it to control a web browser. They help but not as much as you would expect at the moment (red is baseline, blue is with a calculator). Since it's a WIP I would guess it's outside the scope of this repo for now. |
Being that the training dataset was generated through GPT3, there seem to have been several issues I noticed when going through it. I have manually fixed the following issues:
Hoping this slightly curated dataset will help produce better training results.