Skip to content

IntelLabs/SCAP

Statistical Calibrated Activation Pruning (SCAP)

This repo contains the reference codes for "Post-Training Statistical Calibration for Higher Activation Sparsity".

Abstract

We present Statistical Calibrated Activation Pruning (SCAP), a post-training activation pruning framework that (1) generalizes sparsification by input activations of Fully-Connected layers for generic and flexible application across Transformers, and (2) features a simple Mode-Centering technique to pre-calibrate activation distributions for maximizing post-training sparsity. Our results demonstrate robust Pareto efficiency compared to prior methods, translating to a 1.5× additional LLM decoding speedup against CATS at iso model quality. SCAP effectiveness is empirically verified across a wide range of models, including recent Transformer Decoders, MoE, Mamba2, Encoding Transformer, and pre-quantized models, highlighting its practicality and scalability.

Setup

Please follow the steps below.

# recommended python version: 3.10.13
python -m venv ./scap_env
source ./scap_env/bin/activate

# install torch
pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cu121

# install dependencies
pip install transformers==4.44.0 datasets==2.21.0 accelerate tqdm rich seaborn matplotlib wheel \
    git+https://github.com/EleutherAI/lm-evaluation-harness.git@906ef948dc8dbb4c84e1bb0f2861b1aba30ab533

# install gemv kernel
pip install triton "git+https://github.com/ScalingIntelligence/CATS.git@0bda7708b835f20c59f4dd59d3d32b0c5f2f6376#egg=flash_gemv&subdirectory=flash_gemv"

Reproducing the results

1. Run calibration

Get the calibrated thresholds of SCAP for each model and sparsity config.

bash scripts/01.calibration.bash

You can skip this calibration step, as we have already uploaded the following model configs in the repo.

Model ID Config in the bash Up/gate sparsity Down sparsity
meta-llama/Llama-2-7b-hf up,zero,0.35,gate,zero,0.35,down,zero,0.55 35% without mode centering 55% without mode centering
mistralai/Mistral-7B-v0.1 up,zero,0.3,gate,zero,0.3,down,zero,0.7 30% without mode centering 70% without mode centering
mosaicml/mpt-7b down,kde,0.5 / 50% with kde peak as mode
tiiuae/falcon-7b down,median,0.5 / 50% with median as mode

The resulting calibrated_thresholds.json file at results/scap/ folder shows the mode and threshold for each FFN layer specified in the config.

2. Evaluation on zero-shot tasks

Evaluate the zero-shot tasks listed in the paper, i.e., winogrande, piqa, sciq, hellaswag, boolq, arc_easy, arc_challenge. Results are at results/scap/ folder.

bash scripts/02.evaluate_zero_shot_tasks.bash

The resulting evaluation_results.json file contains: (1) evaluation metrics for each task; (2) averaged actual input sparsity for each layer.

3. Inference with sparse kernel

We show the actual inference of SCAP optimized models with the sparse GEMV kernel.

bash scripts/03.inference_demo.bash

Acknowledgement

This work is built atop CATS, which we believe also extends from DejaVu. Credits go to the original authors of these projects.

About

Statistical Calibrated Activation Pruning

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published