This repository contains slides and code for a session on Instruction Tuned Language Models at the Technical University of Applied Science, Augsburg, Germany. The session was done as an invited lecture within the NLP course taught by Prof. Alessandra Zarcone.
Note: The code samples and descriptions here are just to help the students understand instruction tuned language models. They are not intended for being used in any projects. There is no guarantee of correctness or usability.
There has been a raging debate about the potentials and risks of AI, with even 'existential risks for humanity' being mentioned. Recently, even the most famous researchers in the domain have started taking sides in this debate. Understanding the basis of the opinions of these researchers is not easy for most students.
It is undeniable that ChatGPT has been the trigger for this debate. We do not engage in the discussion here whether Large Language Models have the potential of understanding the world or whether they are just stochastic parrots. We leave that to the more knowledgable people 😏
We try to address much simpler problems for a student at the University of Applied Sciences:
- How can I better understand how models like ChatGPT are developed?
- Is there a possibility that I could develop a smaller version of such a model and understand whether or not a small model can perform some tasks at par with a much larger model?
- What is the nature of data that is used to train such models and what is the impact of the quality of data on the performance of such models?
- If asking the question in the right form is so important, can I experiment with sophisticated techniques like Chain of Thought or ReAct using my small model?
We needed a model that can satisfy the following needs:
- The model weights can be downloaded without constraints
- An instruction tuned version of the model is available to validate if we can make progress with finetuning on another dataset
- The feasibility of finetuning the model on a GPU with 24GB VRAM (assuming some gaming GPUs can be used for betterment of science and humanity 😁 )
Amongst a few options available, we just decided to go with the Falcon large language model. The model has been shown to do well on multiple benchmarks. It can be easily used with Hugging Face Transformers library with Parameter Efficient Finetuning (PEFT) and Quantization (bitsandbytes). However, students can experiment with other models. The two model variants used:
- The base model: falcon-7b
- The instruction tuned model: falcon-7b-instruct
We did not want to start from scratch with the finetuning code. The Alpaca model has already triggered creation of multiple open source projects. However, these are dependant on using the LLaMA model. We can use one such project - Alpaca-LoRA and adapt it for being used with Falcon. We used a version of the databricks-dolly-15k dataset adapted to suit the Alpaca format as required by Alpaca-Lora.
These instructions are not meant to enable you to copy and paste commands to a terminal. Use the descriptions below to figure out the intermediate steps (you should be able to succesfully do this). This is a part of the learning process to be able to work with such models.
(I could not resist putting some comments here. So just for fun: When the LLM could itself decide to read these instructions, understand them, generate datasets and execute the code, it might be able to create offsprings to suit its goals 😈 ).
-
Set up the environment (python dependencies) for working with falcon. Remember to setup the required PyTorch version. Follow the instructions here. You would need additional packages for Hugginface Transformers, Datasets, PEFT and Quantization.
-
Clone the Alpaca-Lora repository
After cloning the repository, you need the following adaptations to the code to make it work with falcon:
-
In the train function in finetune.py specify base_model as "tiiuae/falcon-7b", data_path as "c-s-ale/dolly-15k-instruction-alpaca-format" and the list for lora_target_modules as ["query","value"]. Specify output_dir to something like "./falcon-loara"
-
Replace the use of LlamaForCausalLM with AutoModelForCausalLM and LlamaTokenizer with AutoTokenizer.
-
Something to further explore: The Alpaca-Lora code tries to use the unknown_token as padding token. Falcon tokenizer does not have this token. Hence, adding a padding token can be used. You need to adapt the code accordingly as:
After loding the Tokenizer:
tokenizer.add_special_tokens({'pad_token': '<PAD>'})
After loading the model:
model.resize_token_embeddings(len(tokenizer))
Additionally, the original alpaca code used padding to the right. The alpaca lora code uses a padding to the left (with the argument that it allows batched inference). Look into the effects of these.
-
There is a bug in the Alpaca-Lora code that prevents saving of the adapter weights. Read the comments for the issue already logged and fix it (code commenting).
-
In utils/prompter.py, modify get_response function to use "<|endoftext|>" token to find the end of the generated text and use the text only upto the first occurence of the token.
Perform finetuning as described in the instructions in the Alpaca-Lora repository (remember to use the right values for the arguments - base model, rank, output path etc..).
Important: If you are using a remote login using ssh at the university (for accessing a workstation or server), try to use nohup and an & at the end of the finetuning command to ensure that it keeps running when your ssh session is terminated. You can use the tail -f nohup.out
in the same path where you started your finetuning script to see the progress (it might take days or hours based on the GPU and the number of epochs).
After finetuning is completed, modify the generate.py as required for use with falcon - set the correct base model and load the peft weights from the folder where you saved the finetuning weights, and use AutoModelForCausalLM and AutoTokenizer . (Hint: You might not necessarily need the setup with gradio as provided in alpaca-lora code. Just use the part of code needed for text generation - base model loading, PEFT parameter loading, GenerationConfig, and prompter modified for use with falcon.