Project Members: Sanjay Sharma (sgs2185) and George Tamer (gyt2107)
- original llama code:
llama/
- our code:
high_perf
The Llama model can be downloaded from Facebook's official repo: https://github.com/facebookresearch/llama. We downloaded the model from here by filling out Facebook's request form which gave us access to download. We have then created our own repo (which includes the Llama model) at this link: https://github.com/gtamer2/hpml_llama
In order to download the model weights and tokenizer, please visit the Meta website and accept the Meta License.
Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download.
Pre-requisites: Make sure you have wget
and md5sum
installed. Then to run the script: ./download.sh
.
Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as 403: Forbidden
, you can always re-request a link.
Please follow the steps below to get up and running with our project.
-
If needed, provision a Virtual Machine with an Nvidia GPU which has at least 16GB of memory and a host machine with at least 30GB of memory. We provisioned a VM from GCP compute engine with 8 Intel CPUs, 30GB of RAM, and an Nvidia P100 GPU with 16GB of memory.
-
SSH into your machine
-
Activate a cconda environment or python virtual environment with PyTorch and Cuda available
-
Clone this repository, or a fork of it.
-
Before your first time running any scrpits, install dependencies by runnnig:
pip install -e .
pip install -r requirements.txt
- Follow the detailsi before to run benchmarks and optiimizations.
- Run the inference bencharks for latency and truthfulness
- Run the model pruning script
- Run the inference bencharks for latency and truthfulness
The file inference_benchmark.py which benchmarks the code and profiles per-epoch and average performance for data-loading time, inference time, and total epoch time. We also profile CUDA memory usage.
You can run this from the root of the repository with the following:
torchrun inference_benchmark.py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 6
^^In this command, you run the inference_benchmark.py file with many command line arguments specifying the checkpoint directory of the llama-2-7b file, the path to the tokenizer file, and the max sequence length and the max batch size.
We used the TruthfulQA language model benchmark to compare the mdoel's output quality before and after applying latency optimimizations.
The script TODO runs the benchmark and outputs a score.
The file prune_model.py prunes one transformer block of a llama transformer at a time. You can run this with: torchrun prune_model.py <layer index to prune>
.
Alternatively, you can prune the entire model by running the script
If you would like to quantize the model