You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This task is to experiment running quantized HuggingFace models with ExecuTorch out-of-the-box.
The heavy-lifting quantization work will be done through quantize_ API by torchao, for example quantize_(model, int4_weight_only()).
The quantization API can be integrated with the integration points to executorch transformers.integrations.executorch, expanding the export workflow with a new option of "exporting with quantization". In eager, users can verify the numerics accuracy of the quantized exported artifact, e.g. the script for eval llama (here). In ExecuTorch, users can just load the quantized .pte files to ExecuTorch runner for inference.
Feature request
This task is to experiment running quantized HuggingFace models with ExecuTorch out-of-the-box.
The heavy-lifting quantization work will be done through
quantize_
API bytorchao
, for examplequantize_(model, int4_weight_only())
.The quantization API can be integrated with the integration points to executorch
transformers.integrations.executorch
, expanding the export workflow with a new option of "exporting with quantization". In eager, users can verify the numerics accuracy of the quantized exported artifact, e.g. the script for eval llama (here). In ExecuTorch, users can just load the quantized.pte
files to ExecuTorch runner for inference.Motivation
Experiment quantization workflow w/
transforms
+torchao
+executorch
Your contribution
Direct contribution, or provide guidance to anyone who is interested in this work
The text was updated successfully, but these errors were encountered: