-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any chance we could improve the dataset beyond fixing? #8
Comments
Yeah, that's the plan, I think. At least that's what I will focus on shortly. |
Awesome. I hope to contribute |
One way you could get started right now, is to write a script similar to |
Start a new file containing the newly rewritten prompts, to help keep things organized. |
My thought is the following:
We could then provide extensions (patches) to the BASE dataset that would add additional instructions or features that people could try.
The extensions could be in the form of a new dataset, or could be in the form of a patch that could be applied to the base dataset along with a small python script to apply the patch and generate a new dataset. This way, anyone who just wants a cleaned, base alpaca dataset will not have any issues with possible new instructions, while those who want new features can either apply the patch that implements the data they want. |
Good idea, that seems like a reasonable implementation since the original intention of the upstream As for me, I will probably not use the base cleaned alpaca dataset as it is honestly plain garbage, and will use only more advanced training prompts. Should those go in this repo? If so, should we rename this repo to something better suited like AlpacaTrainingExtensions? FYI, I've opened a PR in |
Agree completely. Then model fine tuners can mix and match base + X Y Z or all if they would like. 2 new categories for dataset shards: Also, another one to add to the list is ToolFormer capabilities. I'm writing some prompts for all 4 sets that should generate the training data with GPT-4, based on a modified self instruct prompt that stanford used for their original dataset. Here is one to generate similar data, but with GPT-4 responses, as the original dataset:
|
In regards to the above, when running that prompt through GPT-4 API, you get data such as this, already formated into json, like the original training set afaik. [{"instruction": "Rewrite the given paragraph in a more concise and professional manner.","input": "The manager was super upset and yelled a lot at the team because, like, everyone was late and didn't finish the project on time. So, the company lost the deal with, you know, that really important client. Oh, and the boss was, like, really mad too and said there might be, like, consequences for everyone involved.","response": "The manager was extremely displeased and reprimanded the team due to their tardiness and failure to complete the project on time, resulting in the company losing a significant business deal. The supervisor expressed their anger and indicated potential repercussions for those involved."}, |
Nice, here are some generated examples that might mesh well with yours:
|
In general, I think it's bad practice to include artificial limitations in regards to abilities with training prompts, it skews the distributions too much.
|
While they don't seem to be enforcing it (yet), do keep in mind that training another model using the OpenAI API is against the OpenAI API ToS, and doing so might put your account at risk:
|
Anything is possible when a transformer can webdrive, it's inevitable. |
Because the "base" cleaned model could change over time, it might be difficult to host 'patches' that my not cleanly apply to a changed base. Perhaps a folder where we can host the extensions (extended models), something along the line of:
I think it might be interesting to combine some RLHF datasets (such as Anthropics Human Preferences dataset) with the cleaned alpaca dataset. |
The one issue I have with this is that I think all new datasets should conform to alpaca dataset's format, i.e., with just an instruction, input, and response field:
|
You could ask chatgpt4 to generate the tasks in the given json format, i.e, it is possible to get your tasks preformatted with additional context |
yep indeed, thats what my sample task above does, makes it easier to dump to json file:
|
Please do take note that sometimes GPT 4 will continuously generate tasks rely on some form of internet access. I have no idea why it happens but it can generate all 25 tasks to be some form of url accessing and summarising tasks |
Another dataset has been produced for code generating instruct: |
And guanaco dataset here which is basically rebuilt alpaca set but with gpt3.5 instead of davinci: |
Interesting. Do you know if the same Stanford |
Here is more info on guanaco dataset from HF and their githubio website: The dataset is far larger than alpaca, but mainly focuses on recreating alpaca set in other languages. also for more context; Ive tried both alpaca-full fine tune 7b, and guanaco-LORA 7b, and I find guanaco lora to be far worse. It could be the dataset or it could be that its trained with a LORA, but figured I should mention that for even more context |
Yet another generated dataset to keep an eye on: It is a dataset that used gpt3.5 (I believe) to critique each response from alpaca dataset |
Another idea I've been toying with is extending the dataset so that alpaca performs better with langchain. The current dataset only gets about 60-70% pass rate on the LLM Math Chain. I haven't tested VectorDBQA or the other chains; however, I've heard some others claim it did not do so well on them. |
Here's an 800k 3.5-turbo dataset (and lora) |
A few notable instruction dataset not mentioned here
|
Personally I feel like the datasets we work on here should be limited to self-instruct datasets, i.e. generated by LLM's, just since this is about improving a synthetically generated dataset. But also the gpt4-all updated their dataset to remove all objects where gpt3.5 refused a request, just to keep everyone uptodate |
I just added some code to the tools directory that allows one to generate outputs using gpt-3.5-turbo (ie. ChatGPT). Here's some example output:
|
I guess we have to look at the comparative advantage of this project. From an ML perspective, it doesn't really matter if it's synthetic, augmented, or manual. What matters more is the diversity, size, and quality. Synthetic, if anything, has negative associations in ML, since it's often low quality and high quantity. If you care about size, OIG has 43 million (!) instructions (mostly synthetic). So our little dataset cannot compete on size. Perhaps on quality? Well smaller dataset's like HH-RLHF, open-assistant, SHP, or natural instructions are (I guess) also high quality. But more high quality data is better, so we can add them together 🤗 if we are sure about the quality. I would say the comparative advantage here is clean, high quality data, that has been reviewed in detail. Maybe at the end, a knowledgeable community member can summarise the cleaning process. It's gone past me tbh. It would be interesting to have a case study on how to clean large amounts of Knowledge Distillation data - i.e. to clone but also improve a model. |
btw it might be mildly interesting to see LAION's WIP approach to cleaning their data. Things are moving so fast |
Would like to suggest https://sharegpt.com/ data to augment the dataset. |
Hi, so, I just uploaded a GPT4 generated dataset some friends of mine made here: https://github.com/teknium1/GPTeacher There's a set for instruct-roleplay, general-instruct, instruct-code (soon), and toolformers |
|
Here's another synthetic dataset https://github.com/project-baize/baize/tree/main/data |
If you haven't, you might enjoy checking out this lit review on ToolFormer & TALM |
it seems that instead of using self-instruct to make the LM to become versions of I cannot think of any tasks that is more well suited for |
A task that requires reasoning and such would be far better for gpt4, but also, cgpt-3.5-turbo uses the same ChatML format as GPT-4. Also, GPT4All is 400k examples from 3.5turbo afaik too |
I've already made a dataset for toolformer with gpt4, its in my repo I listed above |
I just merged GPT4 results for all the non-curated items using the Microsoft GPT-4 dataset. |
Several chain of thoughts links: GPT4 answers dataset https://github.com/instruction-tuning-with-gpt-4/gpt-4-llm |
Just added a dataset_extensions folder with two datasets I've converted to Alpaca JSON format:
|
The open-assistant.io folks are apparently releasing a first drop of volunteer-provided QAs on the 15th too. |
Would that be relevant in the scope of this project? Like adding a couple sorts of task examples could improve its generalized capabilities, for instance:
Longer responses
GPT-4 Generated Responses for similar tasks it already has
Roleplaying
Chain of Thought
etc
The text was updated successfully, but these errors were encountered: