Statistical Calibrated Activation Pruning (SCAP)

This repo contains the reference codes for "Post-Training Statistical Calibration for Higher Activation Sparsity".

Abstract

We present Statistical Calibrated Activation Pruning (SCAP), a post-training activation pruning framework that (1) generalizes sparsification by input activations of Fully-Connected layers for generic and flexible application across Transformers, and (2) features a simple Mode-Centering technique to pre-calibrate activation distributions for maximizing post-training sparsity. Our results demonstrate robust Pareto efficiency compared to prior methods, translating to a 1.5× additional LLM decoding speedup against CATS at iso model quality. SCAP effectiveness is empirically verified across a wide range of models, including recent Transformer Decoders, MoE, Mamba2, Encoding Transformer, and pre-quantized models, highlighting its practicality and scalability.

Setup

Please follow the steps below.

# recommended python version: 3.10.13
python -m venv ./scap_env
source ./scap_env/bin/activate

# install torch
pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cu121

# install dependencies
pip install transformers==4.44.0 datasets==2.21.0 accelerate tqdm rich seaborn matplotlib wheel \
    git+https://github.com/EleutherAI/lm-evaluation-harness.git@906ef948dc8dbb4c84e1bb0f2861b1aba30ab533

# install gemv kernel
pip install triton "git+https://github.com/ScalingIntelligence/CATS.git@0bda7708b835f20c59f4dd59d3d32b0c5f2f6376#egg=flash_gemv&subdirectory=flash_gemv"

Reproducing the results

1. Run calibration

Get the calibrated thresholds of SCAP for each model and sparsity config.

bash scripts/01.calibration.bash

You can skip this calibration step, as we have already uploaded the following model configs in the repo.

Model ID	Config in the bash	Up/gate sparsity	Down sparsity
meta-llama/Llama-2-7b-hf	up,zero,0.35,gate,zero,0.35,down,zero,0.55	35% without mode centering	55% without mode centering
mistralai/Mistral-7B-v0.1	up,zero,0.3,gate,zero,0.3,down,zero,0.7	30% without mode centering	70% without mode centering
mosaicml/mpt-7b	down,kde,0.5	/	50% with kde peak as mode
tiiuae/falcon-7b	down,median,0.5	/	50% with median as mode

The resulting calibrated_thresholds.json file at results/scap/ folder shows the mode and threshold for each FFN layer specified in the config.

2. Evaluation on zero-shot tasks

Evaluate the zero-shot tasks listed in the paper, i.e., winogrande, piqa, sciq, hellaswag, boolq, arc_easy, arc_challenge. Results are at results/scap/ folder.

bash scripts/02.evaluate_zero_shot_tasks.bash

The resulting evaluation_results.json file contains: (1) evaluation metrics for each task; (2) averaged actual input sparsity for each layer.

3. Inference with sparse kernel

We show the actual inference of SCAP optimized models with the sparse GEMV kernel.

bash scripts/03.inference_demo.bash

Acknowledgement

This work is built atop CATS, which we believe also extends from DejaVu. Credits go to the original authors of these projects.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
results/scap		results/scap
scripts		scripts
utils		utils
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
calibrate.py		calibrate.py
evaluate.py		evaluate.py
inference_demo.py		inference_demo.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Statistical Calibrated Activation Pruning (SCAP)

Abstract

Setup

Reproducing the results

1. Run calibration

2. Evaluation on zero-shot tasks

3. Inference with sparse kernel

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

License

IntelLabs/SCAP

Folders and files

Latest commit

History

Repository files navigation

Statistical Calibrated Activation Pruning (SCAP)

Abstract

Setup

Reproducing the results

1. Run calibration

2. Evaluation on zero-shot tasks

3. Inference with sparse kernel

Acknowledgement

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages