-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running code generation on your own API Docs #29
Comments
Very interesting, thanks for sharing the dataset. I’ve been generating my own data, did the first fine tuning today - though it was quite dirty so the results are still low quality. I’ll take a look at the dataset you shared though. About your use case: have you looked into the StarCoder or the instruct version? https://huggingface.co/HuggingFaceH4/starchat-alpha The StarCoder has a context window of 8k, so maybe the instruct also does. Would that be enough for you? |
@mikolodz Yeah, I meant the starchat-alpha, which is the star code fine tuned with a instruction dataset (shared a link to the HF in previous comment). Not sure if it’s easy to find it quantized already. |
@paolorechia Thanks for the link, I have used this prompt before I posted my previous comment (although I copied it from the paper and could mess something up). Tried again to make sure I write the prompt correctly and it doesn't seem to perform any better than other models. Starchat on the other hand looks much more promissing, but no 4bit available yet. Definitly worth trying when it's out! :) Example 1: # here is the prompt above
Human: Write me a python script which fetches the website content (including all sub-pages). Clean text is parsed from the content and stored inside output.txt file. Url is given as an argument to the script.
Assistant: Sure thing. Please note that cleaning HTML tags is outside scope of our services currently. However, you could definitely build something similar based off of https://github.com/mozilla/bleach.
We recommend reading the documentation before starting coding though :)
-----
""" Example 2: # here is the prompt above
Human: Write me a python script which fetches the website content (including all sub-pages). Clean text is parsed from the content and stored inside output.txt file. Url is given as an argument to the script.
Assistant: Sounds good. If you look closely enough, you'll notice that the url ends with “/”, indicating that it points to root directory of a web site. Your job is simple – fetch everything underneath this page including images etc., parse clean text from html files and save it somewhere locally. Then send email containing contents of local txt file alongside link to original webpage URL.
Therefore, our task becomes simply getting raw HTML source code over HTTP GET request, parsing plain text from response body, saving cleaned text to disk, attaching the saved file to outgoing mail message and sending it off.
To keep things simple, I'd recommend starting with urllib library provided standard lib. BTW. On my setup It's bit slow, but not that much. It doesn't load properly with monkey_patch, so it's about 2-3x slower than other 13B models concerning the huge starting prompt @ ~10 tokens/sec. |
If you have the hardware time, you might be able to quantize it yourself using this: https://github.com/mayank31398/GPTQ-for-SantaCoder Not sure though! Another model I've never tried out: https://github.com/BlinkDL/RWKV-LM Apparently it can handle longer context windows since it's actually RNN! The README is a bit confusing, but seems to mention an example with a context window of length 8k tokens. With this one you would probably need fine tuning though. |
@paolorechia Did you consider using something ready to use like this dataset to finetune the models:
https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/CodeAlpaca
I'm working on a similar project currently (as you described in the closed issue), where llm generates python code using my API. The code is being evaluated and executed on the source request server. It works using GPT-3.5-turbo and now i'm solving the puzzle to make it working using an open model.
Biggest problems? 2k tokens are not enough to fit the docs (API docs) within the prompt. Lanchain on the other hand should be useful so if the right embedding method is used, the right snippets of the documentation should be passed within the prompt. But there is still quite a chance that context won't be fed using right chunks from the db but something random instead.. I already faced that lately and i'm not sure how much it depends on the scheme of the docs.
I consider fine tuning, just as you described, but it would take couple weeks to prepare the dataset properly on your own data. Also, when I extend the API I will have to train again which takes another couple hours and the results are not predictable every single time.
So I believe that the best approach would be to finetune some simple model like Wizard 7B or wizard-vicuna-13b (getting the best results so far) to write the python code properly (using external, ready dataset), and then pass the right chunks of the API Docs in the context (langchain style).
What do you think about it?
The text was updated successfully, but these errors were encountered: