Taking these general purpose models like GPT-3 and specializing them into something like ChatGPT, the specific chat use case to make it chat well, or using GPT-4 and turning that into a specialized GitHub co-pilot use case to auto-complete code.
- Makes it possible for you to give it a lot more data than what fits into the prompt so that your model can learn from that data rather than just get access to it
e.g. A model that is fine-tuned on dermatology data however might take in the same symptoms and be able to give you a much clearer, more specific diagnosis.
- Helps steer the model to more consistent outputs or more consistent behavior.
For example, you can see the base model here. When you ask it, what's your first name? It might respond with, what's your last name? Because it's seen so much survey data out there of different questions.
So it doesn't even know that it's supposed to answer that question. But a fine-tuned model by contrast, when you ask it, what's your first name? would be able to respond clearly. My first name is Sharon.
- Reduces hallucinations
- Customizes the model to a specific use case
- Process is similar to the model's earlier training
- smaller upfront cost, so you don't really need to think about cost, since every single time you ping the model, it's not that expensive.
- RAG, to connect more of your data to it, to selectively choose what kind of data goes into the prompt.
- Oftentimes when you do try to fit in a ton of data, unfortunately it will forget a lot of that data.
- Hallucination, which is when the model does make stuff up and it's hard to correct that incorrect information that it's already learned.
- RAG often miss the right data,get the incorrect data and cause the model, to output the wrong thing.
- you can correct that incorrect information that it may have learned before, or even put in recent information that it hadn't learned about previously
- There's less compute cost afterwards if you do fine-tune a smaller model and this is particularly relevant if you expect to hit the model a lot of times
- RAG connects it with far more data as well even after it's learned all this information
-
Performance
- stops the LLM from making stuff up, especially around your domain.
- It can be far more consistent.
- better at moderating. Reduce unwanted info, esp. about your company
-
Privacy
- in your VPC or on premise
- prevents data leakage and data breaches that might happen on off the shelf, third party solutions. This is one way to keep that data safe that you've been collecting for a while.
-
Cost
- cost transparency.
- lower the cost per request. fine tuning a smaller LLM can actually help you do that.
- greater control over costs and a couple other factors as well.
-
Reliability
- control uptime
- lower latency. You can greatly reduce the latency for certain applications like autocomplete.
- moderation. Basically, if you want the model to say, I'm sorry to certain things, or to say, I don't know to certain things, or even to have a custom response, This is one way to actually provide those guardrails to the model.
First step before fine-tuning even happens.
Model at the start
- zero knowledge about the world: weights are completely random
- can't form english words: no language skill
- Next token prediction is the objective
- Giant corpus of text data that often scraped from the internet: "unlabeled". A lot of manual work to getting this data set to be effective for model pre-training even after many cleaning processes.
- often called self-supervised learning because the model is essentially supervising itself with next token prediction.
After training
- Learns language
- Learns knowledge
It is just trying to predict the next token and it's reading the entire Internet's worth of data to do so.
- However the data set of the closed source models from large companies are not very public.
- Open source effort by EleutherAI to create a dataset called The Pile.
- pre-training step is pretty expensive and time-consuming because it's so time-consuming to have the model go through all of this data, go from absolutely randomness to understanding some of these texts
pre-training is really that first step that gets you that base model. It's not useful at this moment.
you can use fine-tuning to get a fine-tuned model. And actually, even a fine-tuned model, you can continue adding fine-tuning steps afterwards. So fine-tuning really is a step afterwards.
- Can also be self-supervised unlabeled data
- Can be "labeled" data you curated to make it much more structured for the model to learn about
- same training obj: next token prediction
one thing that's key that differentiates fine-tuning from pre-training is that there's much less data needed.
Note: Definition of fine-tuning here is updating the weights of the entire model, not just part of it(vs. fine-tuning on ImageNet).
In summary, all we're doing is changing up the data so that it's more structured in a way, and the model can be more consistent in outputting and mimicking that structure.
behavior change - both - gain knowledge
- one giant category is just behavior change. You're changing the behavior of the model.
- learn to respond much more consistently
- learn to focus, e.g. moderation
- teasing out capability, e.g. better at conversation. before we would have to do a lot of prompt engineering in order to tease that information out.
- gain knowledge
- Increase knowledge of new specific topics that are not in that base pre-trained model
- Correct old incorrect information
- just text in, text out for LLMs
- extracting text: put text in and you get less text out.
- reading: extracting keywords, topics
- route the chat, for example, to some API or otherwise
- agent capabilities: planning, reasoning, self-critic, tool use, etc.
- expansion: you put text in, and you get more text out
- writing: chatting, writing emails/code
- extracting text: put text in and you get less text out.
- task clarity is key indicator of success: clarity really means knowing what good output looks like, what bad output looks like, but also what better output looks like.
- identify tasks by just prompt engineering a large LLM and that could be chat GPT.
- find a task you see on LLM doing ok at.
- pick one task.
- get ~1000(golden number) input and outputs for that task. make sure that these inputs and outputs are better than the okay result from that LLM before.
- finetune a small LLM on this data.
Various ways of formatting your data
filename = "lamini_docs.jsonl"
instruction_dataset_df = pd.read_json(filename, lines=True)
instruction_dataset_df
examples = instruction_dataset_df.to_dict()
text = examples["question"][0] + examples["answer"][0]
text
if "question" in examples and "answer" in examples:
text = examples["question"][0] + examples["answer"][0]
elif "instruction" in examples and "response" in examples:
text = examples["instruction"][0] + examples["response"][0]
elif "input" in examples and "output" in examples:
text = examples["input"][0] + examples["output"][0]
else:
text = examples["text"][0]
prompt_template_qa = """### Question:
{question}
### Answer:
{answer}"""
question = examples["question"][0]
answer = examples["answer"][0]
text_with_prompt_template = prompt_template_qa.format(question=question, answer=answer)
text_with_prompt_template
prompt_template_q = """### Question:
{question}
### Answer:"""
num_examples = len(examples["question"])
finetuning_dataset_text_only = []
finetuning_dataset_question_answer = []
for i in range(num_examples):
question = examples["question"][i]
answer = examples["answer"][i]
text_with_prompt_template_qa = prompt_template_qa.format(question=question, answer=answer)
finetuning_dataset_text_only.append({"text": text_with_prompt_template_qa})
text_with_prompt_template_q = prompt_template_q.format(question=question)
finetuning_dataset_question_answer.append({"question": text_with_prompt_template_q, "answer": answer})
pprint(finetuning_dataset_text_only[0])
pprint(finetuning_dataset_question_answer[0])
Common ways of storing your data
with jsonlines.open(f'lamini_docs_processed.jsonl', 'w') as writer:
writer.write_all(finetuning_dataset_question_answer)
finetuning_dataset_name = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_name)
print(finetuning_dataset)
Instruction fine-tuning( AKA instruction tune or instruction following LLMs) is a type of fine-tuning. Type includes: reasoning, routing, copilot, which is writing code, chat, different agents.
It teaches the model to follow instructions and behave more like a chatbot. Like GPT-3 to chatgpt.
Some existing data is ready as-is, online:
- FAQs
- Customer support conversations
- Slack messages
It's really this dialogue dataset or just instruction response datasets
if you don't have data, no problem:
- You can also convert your data into something that's more of a question-answer format or instruction following format by using a prompt template. i.e. README might be able to come be converted into a question-answer pair.
- You can also use another LLM to do this for you -a technique called Alpaca from Stanford that uses chat GPT to do this.
- You can use a pipeline of different open source models to do this as well.
- It teaches this new behavior to the model. i.e. the answer of capital of france from base model vs finetuned model
- Can access model's pre-existing knowledge(learned in pre-existing pre-training step), generalize instructions to other data, not in finetuning dataset. i.e. code. this is actually findings from the chat GPT paper where the model can now answer questions about code even though they didn't have question answer pairs about that for their instruction fine-tuning. because it's really expensive to get programmers label data sets.
It's a very iterative process to improve the model: data prep - training - evaluation. After you evaluate the model, you need to prep the data again to improve it.
For different types of fine-tuning, data prep is really where you have differences. training and evaluation is very similar.
import os
import lamini
lamini.api_url = os.getenv("POWERML__PRODUCTION__URL")
lamini.api_key = os.getenv("POWERML__PRODUCTION__KEY")
import itertools
import jsonlines
from datasets import load_dataset
from pprint import pprint
from llama import BasicModelRunner
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
instruction_tuned_dataset = load_dataset("tatsu-lab/alpaca", split="train", streaming=True)
m = 5
print("Instruction-tuned dataset:")
top_m = list(itertools.islice(instruction_tuned_dataset, m))
for j in top_m:
print(j)
prompt_template_with_input = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:"""
prompt_template_without_input = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:"""
processed_data = []
for j in top_m:
if not j["input"]:
processed_prompt = prompt_template_without_input.format(instruction=j["instruction"])
else:
processed_prompt = prompt_template_with_input.format(instruction=j["instruction"], input=j["input"])
processed_data.append({"input": processed_prompt, "output": j["output"]})
pprint(processed_data[0])
with jsonlines.open(f'alpaca_processed.jsonl', 'w') as writer:
writer.write_all(processed_data)
dataset_path_hf = "lamini/alpaca"
dataset_hf = load_dataset(dataset_path_hf)
print(dataset_hf)
non_instruct_model = BasicModelRunner("meta-llama/Llama-2-7b-hf")
non_instruct_output = non_instruct_model("Tell me how to train my dog to sit")
print("Not instruction-tuned output (Llama 2 Base):", non_instruct_output)
instruct_model = BasicModelRunner("meta-llama/Llama-2-7b-chat-hf")
instruct_output = instruct_model("Tell me how to train my dog to sit")
print("Instruction-tuned output (Llama 2): ", instruct_output)
chatgpt = BasicModelRunner("chat-gpt")
instruct_output_chatgpt = chatgpt("Tell me how to train my dog to sit")
print("Instruction-tuned output (ChatGPT): ", instruct_output_chatgpt)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m")
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
# Tokenize
input_ids = tokenizer.encode(
text,
return_tensors="pt",
truncation=True,
max_length=max_input_tokens
)
# Generate
device = model.device
generated_tokens_with_prompt = model.generate(
input_ids=input_ids.to(device),
max_length=max_output_tokens
)
# Decode
generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)
# Strip the prompt
generated_text_answer = generated_text_with_prompt[0][len(text):]
return generated_text_answer
#---------------------
finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_path)
print(finetuning_dataset)
test_sample = finetuning_dataset["test"][0]
print(test_sample)
print(inference(test_sample["question"], model, tokenizer))
instruction_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned")
print(inference(test_sample["question"], instruction_model, tokenizer))
- Most important is higher quality data. Advoid garbage-in-garbage-out.
- Diversity. data covers a lot of aspects of your use case. If all your inputs and outputs are the same, then the model can start to memorize them,then the model will start to just only spout the same thing over and over again.
- Real vs Generated. Easy to generate but real data is very effective and helpful, esp for writing tasks. Generated data has certain patterns and it can be detected. Train with same pattern results in not learning new patterns or new ways of framing things.
- More data is important. Pre-training handles this problem. It learns from the internet.
- Collect instruction-response pairs
- Concatenate pairs(add prompt template, if applicable)
- Tokenize: Pad, truncate -> Pad: right size going into the model. Something that's really important for models is that everything in a batch is the same length, because you're operating with fixed size tensors. Truncation is a strategy to handle making those encoded text much shorter and that fit into the model actually into the model
- Split into train/test
Take your text data and turn that into numbers that represent each of those pieces of text. It's not actually necessarily by word. It's based on the frequency of common character occurrences.
ING token is very common in tokenizers. Happens in every single gerund, "ing"->278.
Encode and decode with the same tokenizer.
A tokenizer is really associated with a specific model for each model as it was trained on it.
import pandas as pd
import datasets
from pprint import pprint
from transformers import AutoTokenizer
### AutoTokenizer class from the Transformers library by HuggingFace. And what it does is amazing. It automatically finds the right tokenizer or for your model when you just specify what the model is.
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
text = "Hi, how are you?"
encoded_text = tokenizer(text)["input_ids"]
encoded_text
decoded_text = tokenizer.decode(encoded_text)
print("Decoded tokens back into text: ", decoded_text)
list_texts = ["Hi, how are you?", "I'm good", "Yes"]
encoded_texts = tokenizer(list_texts)
print("Encoded several texts: ", encoded_texts["input_ids"])
Padding: Something that's really important for models is that everything in a batch is the same length, because you're operating with fixed size tensors(fundamental data structure to store, represent and change data in ML). Truncation is a strategy to handle making those encoded text much shorter and that fit into the modelactually into the model
Realistically for padding and truncation, you want to use both. So let's just actually set both in there.
tokenizer.pad_token = tokenizer.eos_token
encoded_texts_longest = tokenizer(list_texts, padding=True)
print("Using padding: ", encoded_texts_longest["input_ids"])
encoded_texts_truncation = tokenizer(list_texts, max_length=3, truncation=True)
print("Using truncation: ", encoded_texts_truncation["input_ids"])
tokenizer.truncation_side = "left"
encoded_texts_truncation_left = tokenizer(list_texts, max_length=3, truncation=True)
print("Using left-side truncation: ", encoded_texts_truncation_left["input_ids"])
encoded_texts_both = tokenizer(list_texts, max_length=3, truncation=True, padding=True)
print("Using both padding and truncation: ", encoded_texts_both["input_ids"])
import pandas as pd
filename = "lamini_docs.jsonl"
instruction_dataset_df = pd.read_json(filename, lines=True)
examples = instruction_dataset_df.to_dict()
if "question" in examples and "answer" in examples:
text = examples["question"][0] + examples["answer"][0]
elif "instruction" in examples and "response" in examples:
text = examples["instruction"][0] + examples["response"][0]
elif "input" in examples and "output" in examples:
text = examples["input"][0] + examples["output"][0]
else:
text = examples["text"][0]
prompt_template = """### Question:
{question}
### Answer:"""
num_examples = len(examples["question"])
finetuning_dataset = []
for i in range(num_examples):
question = examples["question"][i]
answer = examples["answer"][i]
text_with_prompt_template = prompt_template.format(question=question)
finetuning_dataset.append({"question": text_with_prompt_template, "answer": answer})
from pprint import pprint
print("One datapoint in the finetuning dataset:")
pprint(finetuning_dataset[0])
So first concatenating that question with that answer and then running it through the tokenizer. I'm just returning the tensors as a NumPy array here just to be simple and running it with just padding and that's because I don't know how long these tokens actually will be,
text = finetuning_dataset[0]["question"] + finetuning_dataset[0]["answer"]
tokenized_inputs = tokenizer(
text,
return_tensors="np",
padding=True
)
print(tokenized_inputs["input_ids"])
I then figure out, you know, the minimum between the max length and the tokenized inputs. Of course, you can always just pad to the longest. You can always pad to the max length and so that's what that is here.
max_length = 2048
max_length = min(
tokenized_inputs["input_ids"].shape[1],
max_length,
)
And then I'm tokenizing again with truncation up to that max length.
tokenized_inputs = tokenizer(
text,
return_tensors="np",
truncation=True,
max_length=max_length
)
tokenized_inputs["input_ids"]
Turn above into a function
def tokenize_function(examples):
if "question" in examples and "answer" in examples:
text = examples["question"][0] + examples["answer"][0]
elif "input" in examples and "output" in examples:
text = examples["input"][0] + examples["output"][0]
else:
text = examples["text"][0]
tokenizer.pad_token = tokenizer.eos_token
tokenized_inputs = tokenizer(
text,
return_tensors="np",
padding=True,
)
max_length = min(
tokenized_inputs["input_ids"].shape[1],
2048
)
tokenizer.truncation_side = "left"
tokenized_inputs = tokenizer(
text,
return_tensors="np",
truncation=True,
max_length=max_length
)
return tokenized_inputs
So I'm setting batch size to one, it's very simple. It is gonna be batched and dropping last batch true. That's often what we do to help with mixed size inputs. And so the last batch might be a different size
finetuning_dataset_loaded = datasets.load_dataset("json", data_files=filename, split="train")
tokenized_dataset = finetuning_dataset_loaded.map(
tokenize_function,
batched=True,
batch_size=1,
drop_last_batch=True
)
print(tokenized_dataset)
I have to add in this labels columns as for hugging face to handle it
tokenized_dataset = tokenized_dataset.add_column("labels", tokenized_dataset["input_ids"])
Specify the test size as 10% of the data. So of course you can change this depending on how big your data set is. Shuffle's true, so I'm randomizing the order of this data set
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=123)
print(split_dataset)
finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = datasets.load_dataset(finetuning_dataset_path)
print(finetuning_dataset)
taylor_swift_dataset = "lamini/taylor_swift"
bts_dataset = "lamini/bts"
open_llms = "lamini/open_llms"
dataset_swiftie = datasets.load_dataset(taylor_swift_dataset)
print(dataset_swiftie["train"][1])
# This is how to push your own dataset to your Huggingface hub
# !pip install huggingface_hub
# !huggingface-cli login
# split_dataset.push_to_hub(dataset_path_hf)
Training process is actually quite similar to other neural networks.
- Add traing data up at the top
- calculate the loss - it predicts something totally off in the beginning, predict the loss compared to the actual response it was supposed to give
- update the weights
- back prop through the model to update the model to improve it
Hyperparameters: learning rate, learning scheduler, and various optimizer hyperparameters, etc.
Keys Words:
- PyTorch
- An epoch is a pass over your entire data set.
you might go over your entire data set multiple times. And then you want to load it up in batches. So that is those different batches that you saw when you're tokenizing data. So that sets of data together. And then you put the batch through your model to get outputs. You compute the loss from your model and you take a backwards step and you update your optimizer.
General chunks of training process code in PyTorch
for epoch in range(num_epochs):
for batch in train_dataloader:
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
from llama import BasicModelRunner
model = BasicModelRunner("EleutherAI/pythia-410m")
model.load_data_from_jsonlines("lamini_docs.jsonl", input_key="question", output_key="answer")
model.train(is_public=True)
- Choose base model.
- Load data.
- Train it. Returns a model ID, dashboard, and playground interface.
Let's look under the hood at the core code running this! This is the open core of Lamini's llama
library :)
import os
import lamini
lamini.api_url = os.getenv("POWERML__PRODUCTION__URL")
lamini.api_key = os.getenv("POWERML__PRODUCTION__KEY")
import datasets
import tempfile
import logging
import random
import config
import os
import yaml
import time
import torch
import transformers
import pandas as pd
import jsonlines
from utilities import *
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import TrainingArguments
from transformers import AutoModelForCausalLM
from llama import BasicModelRunner
logger = logging.getLogger(__name__)
global_config = None
# local
dataset_name = "lamini_docs.jsonl"
dataset_path = f"/content/{dataset_name}"
use_hf = False
# huggingface
dataset_path = "lamini/lamini_docs"
use_hf = True
model_name = "EleutherAI/pythia-70m"
training_config = {
"model": {
"pretrained_name": model_name,
"max_length" : 2048
},
"datasets": {
"use_hf": use_hf,
"path": dataset_path
},
"verbose": True
}
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
train_dataset, test_dataset = tokenize_and_split_data(training_config, tokenizer)
print(train_dataset)
print(test_dataset)
base_model = AutoModelForCausalLM.from_pretrained(model_name)
device_count = torch.cuda.device_count()
if device_count > 0:
logger.debug("Select GPU device")
device = torch.device("cuda")
else:
logger.debug("Select CPU device")
device = torch.device("cpu")
base_model.to(device)
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
# Tokenize
input_ids = tokenizer.encode(
text,
return_tensors="pt",
truncation=True,
max_length=max_input_tokens
)
# Generate based on tokens
device = model.device
generated_tokens_with_prompt = model.generate(
# e.g. on GPU so the model can see it
input_ids=input_ids.to(device),
max_length=max_output_tokens
)
# Decode
generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)
# Strip the prompt
generated_text_answer = generated_text_with_prompt[0][len(text):]
return generated_text_answer
test_text = test_dataset[0]['question']
print("Question input (test):", test_text)
print(f"Correct answer from Lamini docs: {test_dataset[0]['answer']}")
print("Model's answer: ")
print(inference(test_text, base_model, tokenizer))
# A step is a batch of training data. And so if your batch size is one, it's just one data point. If your batch size is 2,000, it's 2,000 data points.
max_steps = 3
# best practice is add timestamp
trained_model_name = f"lamini_docs_{max_steps}_steps"
output_dir = trained_model_name
training_args = TrainingArguments(
# Learning rate
learning_rate=1.0e-5,
# Number of training epochs
num_train_epochs=1,
# Max steps to train for (each step is a batch of data)
# Overrides num_train_epochs, if not -1
max_steps=max_steps,
# Batch size for training
per_device_train_batch_size=1,
# Directory to save model checkpoints
output_dir=output_dir,
# Other arguments (most are default)
overwrite_output_dir=False, # Overwrite the content of the output directory
disable_tqdm=False, # Disable progress bars
eval_steps=120, # Number of update steps between two evaluations
save_steps=120, # After # steps model is saved
warmup_steps=1, # Number of warmup steps for learning rate scheduler
per_device_eval_batch_size=1, # Batch size for evaluation
evaluation_strategy="steps",
logging_strategy="steps",
logging_steps=1,
optim="adafactor",
gradient_accumulation_steps = 4,
gradient_checkpointing=False,
# Parameters for early stopping
load_best_model_at_end=True,
save_total_limit=1,
metric_for_best_model="eval_loss",
greater_is_better=False
)
model_flops = (
base_model.floating_point_ops(
{
"input_ids": torch.zeros(
(1, training_config["model"]["max_length"])
)
}
)
* training_args.gradient_accumulation_steps
)
print(base_model)
print("Memory footprint", base_model.get_memory_footprint() / 1e9, "GB")
print("Flops", model_flops / 1e9, "GFLOPs")
trainer = Trainer(
model=base_model,
model_flops=model_flops,
total_steps=max_steps,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
)
training_output = trainer.train()
loss will go down in the later step
save_dir = f'{output_dir}/final'
trainer.save_model(save_dir)
print("Saved model to:", save_dir)
finetuned_slightly_model = AutoModelForCausalLM.from_pretrained(save_dir, local_files_only=True)
finetuned_slightly_model.to(device)
test_question = test_dataset[0]['question']
print("Question input (test):", test_question)
print("Finetuned slightly model's answer: ")
print(inference(test_question, finetuned_slightly_model, tokenizer))
test_answer = test_dataset[0]['answer']
print("Target answer output (test):", test_answer)
Note that 3 data points(out of 1000+ data points in the training data) does not make the model better
finetuned_longer_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned")
tokenizer = AutoTokenizer.from_pretrained("lamini/lamini_docs_finetuned")
finetuned_longer_model.to(device)
print("Finetuned longer model's answer: ")
print(inference(test_question, finetuned_longer_model, tokenizer))
model_name_to_id
bigger_finetuned_model = BasicModelRunner(model_name_to_id["bigger_model_name"])
bigger_finetuned_output = bigger_finetuned_model(test_question)
print("Bigger (2.8B) finetuned model (test): ", bigger_finetuned_output)
similar to moderation api in chatgpt
count = 0
for i in range(len(train_dataset)):
if "keep the discussion relevant to Lamini" in train_dataset[i]["answer"]:
print(i, train_dataset[i]["question"], train_dataset[i]["answer"])
count += 1
print(count)
First, try the non-finetuned base model:
base_tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
base_model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m")
print(inference("What do you think of Mars?", base_model, base_tokenizer))
print(inference("What do you think of Mars?", finetuned_longer_model, tokenizer))
model = BasicModelRunner("EleutherAI/pythia-410m") # largest in free tier
model.load_data_from_jsonlines("lamini_docs.jsonl", input_key="question", output_key="answer")
model.train(is_public=True)
out = model.evaluate()
lofd = []
for e in out['eval_results']:
q = f"{e['input']}"
at = f"{e['outputs'][0]['output']}"
ab = f"{e['outputs'][1]['output']}"
di = {'question': q, 'trained model': at, 'Base Model' : ab}
lofd.append(di)
df = pd.DataFrame.from_dict(lofd)
style_df = df.style.set_properties(**{'text-align': 'left'})
style_df = style_df.set_properties(**{"vertical-align": "text-top"})
style_df
This is a really important step because AI is all about iteration. This helps you improve your model over time. Evaluating generative models is notoriously very, very difficult. metrics actually have trouble keeping up the improved performance of these models.
- As a result, human evaluation is often the most reliable way of doing so, so have experts who understand the domain actually assess the outputs.
- Good test data is crucial
- high-quality
- accurate
- genralized: covers a lot of different test cases
- not seen in training data
- Another popular way that is emerging is ELO comparison: almost like a A-B test between multiple models or tournament across multiple models.
it's actually taking a bunch of different possible evaluation methods and averaging them all together to rank models. This one is developed by EleutherAI, and it's a set of different benchmarks put together.
- ARC. It's a set of grade school questions.
- HellaSwag is a test of common sense.
- MMLU covers a lot of elementary school subjects,
- TruthfulQA measures the model's propensity to reproduce falsehoods that you can commonly find online.
- Understand base model behavior before fineturning
- Categroize errors: iterate on data to fix these problems in data space.
understand the types of errors that are very common, and going after the very common errors and the very catastrophic errors first.
With the pore-trained base model, you can already perform error analysis before you even fine-tune the model. This helps you understand and characterize how the base model is doing, so that you know what kind of data will give it the biggest lift for fine-tuning.
And speaking of repetitive, these models do tend to be very repetitive. And so one way to do that is to fix it with either stop tokens more explicitly, those prompt templates you saw, but of course, also making sure your dataset includes examples that don't have as much repetition and do have diversity.
Running your model on your entire test data set in a batched way that's very efficient on GPUs. You can load up your model here and instantiate it and then have it just run on a list of your entire test data set. And then it's automatically batched on GPUs really quickly.
Technically, there are very few steps to run it on GPUs, elsewhere (ie. on Lamini).
# Let's look again under the hood! This is the open core code of Lamini's llama library :)
finetuned_model = BasicModelRunner(
"lamini/lamini_docs_finetuned"
)
finetuned_output = finetuned_model(
test_dataset_list # batched!
)
print question answer pair.
import datasets
import tempfile
import logging
import random
import config
import os
import yaml
import logging
import difflib
import pandas as pd
import transformers
import datasets
import torch
from tqdm import tqdm
from utilities import *
from transformers import AutoTokenizer, AutoModelForCausalLM
logger = logging.getLogger(__name__)
global_config = None
dataset = datasets.load_dataset("lamini/lamini_docs")
test_dataset = dataset["test"]
print(test_dataset[0]["question"])
print(test_dataset[0]["answer"])
Load up the model to run it over this entire data set. So this is the same as before. I'm going to pull out the actual fine-tuned model from HuggingFace.
model_name = "lamini/lamini_docs_finetuned"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# just for testing purpose
def is_exact_match(a,b):
return a.strip()==b.strip()
# An important thing when you're running a model in evaluation mode is to do "model.eval" to make sure things like dropout is disabled.
model.eval()
# to generate output.
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
# Tokenize
tokenizer.pad_token = tokenizer.eos_token
input_ids = tokenizer.encode(
text,
return_tensors="pt",
truncation=True,
max_length=max_input_tokens
)
# Generate
device = model.device
generated_tokens_with_prompt = model.generate(
input_ids=input_ids.to(device),
max_length=max_output_tokens
)
# Decode
generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)
# Strip the prompt
generated_text_answer = generated_text_with_prompt[0][len(text):]
return generated_text_answer
Note that “dropout” refers to the practice of disregarding certain nodes in a layer at random during training. A dropout is a regularization approach that prevents overfitting by ensuring that no units are codependent with one another. LLM dropout is a regularization technique used in large language models to prevent overfitting during training. Dropout involves randomly dropping out (i.e., setting to zero) some of the units in the hidden layers of the model during each training iteration.
This helps to prevent the model from relying too heavily on any one feature or unit and encourages the development of more robust representations.
Dropout has been shown to be effective in improving the performance of large language models and is commonly used in conjunction with other regularization techniques such as L1 and L2 regularization.
test_question = test_dataset[0]["question"]
generated_answer = inference(test_question, model, tokenizer)
print(test_question)
print(generated_answer)
answer = test_dataset[0]["answer"]
print(answer)
exact_match = is_exact_match(generated_answer, answer)
print(exact_match)
Sometimes people also will take you know these outputs and put it into another LLM to ask it and grade it to see how well you know how close is it really.
You can also use embedding so you can embed the actual answer and actually embed the generated answer and see how close they are in distance.
n = 10
metrics = {'exact_matches': []}
predictions = []
for i, item in tqdm(enumerate(test_dataset)):
print("i Evaluating: " + str(item))
question = item['question']
answer = item['answer']
try:
predicted_answer = inference(question, model, tokenizer)
except:
continue
predictions.append([predicted_answer, answer])
#fixed: exact_match = is_exact_match(generated_answer, answer)
exact_match = is_exact_match(predicted_answer, answer)
metrics['exact_matches'].append(exact_match)
if i > n and n != -1:
break
print('Number of exact matches: ', sum(metrics['exact_matches']))
df = pd.DataFrame(predictions, columns=["predicted_answer", "target_answer"])
print(df)
what's been found to be significantly more effective by a large margin is using manual inspection on a very curated test set.
evaluation_dataset_path = "lamini/lamini_docs_evaluation"
evaluation_dataset = datasets.load_dataset(evaluation_dataset_path)
pd.DataFrame(evaluation_dataset)
This can take several minutes
!python lm-evaluation-harness/main.py --model hf-causal --model_args pretrained=lamini/lamini_docs_finetuned --tasks arc_easy --device cpu
#3 generate some or use a prompt template to create some more. #4 to get a sense of where the performace is at with this model. #5 to understand how much data actually influences where the model is going. #9 on that more complex task.
For task-defined tune, there are
- Extract-reading tasks
- Expend-writing tasks(a lot harder): more expensive tasks like chatting, writing emails, writing code because there are more tokens that are produced by the model.
Harder tasks tend to result in needing larger models to be able to handle them.
Combination of tasks is harder than 1 task - could mean having an agent be flexible and do several things at once or in one step.
Model size decides compute requirement.
Training needs far more memory to store the gradients and the optimizers than run a model for inference.
PEFT is a set of different methods that help you do much more efficient in how you're using your parameters and training your models. https://arxiv.org/abs/2303.15647
I like LoRa, which stands for low rank adaptation. And what LoRa does is that it reduces the number of parameters you have to train weights that you have to train by a huge amount.
LoRa: low-rank adaptation of LLMs
- Fewer trainable parameters: For GPT-3, for example, they found that they could reduce it by 10,000x, which resulted in 3x less memory needed from the GPU.
- Less GPU memory
- Slightly below accuracy to finetuning
- Same inference latency
Train new wights in some layers, freeze main pre-trained weights
- New weights: rank decomposition matrices of original weights' change.
- At inference, merge with main weights and get that fine-tuned model more efficiently.
Use LoRA for adapting to new, different tasks: so that means you could train a model with LoRa on one customer's data and then train another one on another customer's data and then be able to merge them each in at inference time when you need them.