From 6895a18b994bf910c3d6d6c9d55c93504448ec90 Mon Sep 17 00:00:00 2001 From: Joe Bowser Date: Fri, 22 Nov 2024 21:28:24 -0800 Subject: [PATCH 1/4] Changing the referenced AAR so that it uses the AAR from the docs (#1390) --- .gitignore | 4 ++++ torchchat/edge/android/torchchat/app/build.gradle.kts | 2 +- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/.gitignore b/.gitignore index 74d0a28fa..61ab1ee4d 100644 --- a/.gitignore +++ b/.gitignore @@ -19,6 +19,10 @@ runner-et/cmake-out/* runner-aoti/cmake-out/* cmake-out/ +# Example project Android Studio ignore +torchchat/edge/android/torchchat/.idea/* + + # pte files *.pte diff --git a/torchchat/edge/android/torchchat/app/build.gradle.kts b/torchchat/edge/android/torchchat/app/build.gradle.kts index e0c9c196b..a98a70cab 100644 --- a/torchchat/edge/android/torchchat/app/build.gradle.kts +++ b/torchchat/edge/android/torchchat/app/build.gradle.kts @@ -57,7 +57,7 @@ dependencies { implementation("androidx.constraintlayout:constraintlayout:2.2.0-alpha12") implementation("com.facebook.fbjni:fbjni:0.5.1") implementation("com.google.code.gson:gson:2.8.6") - implementation(files("libs/executorch-llama.aar")) + implementation(files("libs/executorch.aar")) implementation("com.google.android.material:material:1.12.0") implementation("androidx.activity:activity:1.9.0") testImplementation("junit:junit:4.13.2") From f8211638a3423d35d4f4740323a0e6f295f39ba2 Mon Sep 17 00:00:00 2001 From: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com> Date: Tue, 26 Nov 2024 12:00:57 -0800 Subject: [PATCH 2/4] Typo fixes in native-execution.md (#1394) Typo fixes in native-execution.md --- docs/native-execution.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/native-execution.md b/docs/native-execution.md index 790547e21..c22d3c3ba 100644 --- a/docs/native-execution.md +++ b/docs/native-execution.md @@ -16,14 +16,14 @@ The 'llama runner' is a native standalone application capable of running a model exported and compiled ahead-of-time with either Executorch (ET) or AOT Inductor (AOTI). Which model format to use depends on your requirements and preferences. Executorch models are -optimized for portability across a range of decices, including mobile +optimized for portability across a range of devices, including mobile and edge devices. AOT Inductor models are optimized for a particular target architecture, which may result in better performance and efficiency. Building the runners is straightforward with the included cmake build files and is covered in the next sections. We will showcase the -runners using ~~stories15M~~ llama2 7B and llama3. +runners using llama2 7B and llama3. ## What can you do with torchchat's llama runner for native execution? @@ -160,7 +160,7 @@ and native execution environments, respectively. After exporting a model, you will want to verify that the model delivers output of high quality, and works as expected. Both can be -achieved with the Python environment. All torchchat Python comands +achieved with the Python environment. All torchchat Python commands can work with exported models. Instead of loading the model from a checkpoint or GGUF file, use the `--dso-path model.so` and `--pte-path model.pte` for loading both types of exported models. This From 5eac3292988d0227991628a8373982dfdc9f6bb0 Mon Sep 17 00:00:00 2001 From: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com> Date: Tue, 26 Nov 2024 12:02:30 -0800 Subject: [PATCH 3/4] Improvements for readability in ADVANCED-USERS.md (#1393) * Various spelling corrections * Remove empty performance tables * Remove CONTRIBUTING section that is covered in the project root README --- docs/ADVANCED-USERS.md | 90 +++++++++--------------------------------- 1 file changed, 18 insertions(+), 72 deletions(-) diff --git a/docs/ADVANCED-USERS.md b/docs/ADVANCED-USERS.md index 417a823f8..8f66b8a29 100644 --- a/docs/ADVANCED-USERS.md +++ b/docs/ADVANCED-USERS.md @@ -18,10 +18,10 @@ Torchchat is currently in a pre-release state and under extensive development. [shell default]: TORCHCHAT_ROOT=${PWD} ./torchchat/utils/scripts/install_et.sh -This is the advanced users guide, if you're looking to get started +This is the advanced users' guide, if you're looking to get started with LLMs, please refer to the README at the root directory of the torchchat distro. This is an advanced user guide, so we will have -many more concepts and options to discuss and taking advantage of them +many more concepts and options to discuss and take advantage of them may take some effort. We welcome community contributions of all kinds. If you find @@ -41,7 +41,7 @@ While we strive to support a broad range of models, we can't test them all. We classify supported models as tested ✅, work in progress 🚧 or some restrictions ❹. -We invite community contributions of new model suport and test results! +We invite community contributions of new model support and test results! | Model | Tested | Eager | torch.compile | AOT Inductor | ExecuTorch | Fits on Mobile | |-----|--------|-------|-----|-----|-----|-----| @@ -86,7 +86,7 @@ Server C++ runtime | n/a | run.cpp model.pte | ✅ | Mobile C++ runtime | n/a | app model.pte | ✅ | Mobile C++ runtime | n/a | app + AOTI | 🚧 | -**Getting help:** Each command implements the --help option to give addititonal information about available options: +**Getting help:** Each command implements the --help option to give additional information about available options: [skip default]: begin ``` @@ -96,8 +96,8 @@ python3 torchchat.py [ export | generate | chat | eval | ... ] --help Exported models can be loaded back into torchchat for chat or text generation, letting you experiment with the exported model and valid -model quality. The python interface is the same in all cases and is -used for testing nad test harnesses too. +model quality. The Python interface is the same in all cases and is +used for testing and test harnesses, too. Torchchat comes with server C++ runtimes to execute AOT Inductor and ExecuTorch models. A mobile C++ runtimes allow you to deploy @@ -115,7 +115,7 @@ Some common models are recognized by torchchat based on their filename through `Model.from_name()` to perform a fuzzy match against a table of known model architectures. Alternatively, you can specify the index into that table with the option `--params-table ${INDEX}` where -the index is the lookup key key in the [the list of known +the index is the lookup key in the [the list of known pconfigurations](https://github.com/pytorch/torchchat/tree/main/torchchat/model_params) For example, for the stories15M model, this would be expressed as `--params-table stories15M`. (We use the model constructor @@ -237,7 +237,7 @@ which chooses the best 16-bit floating point type. The virtual device fast and virtual floating point data types fast and fast16 are best used for eager/torch.compiled execution. For export, -specify the your device choice for the target system with --device for +specify your device choice for the target system with --device for AOTI-exported DSO models, and using ExecuTorch delegate selection for ExecuTorch-exported PTE models. @@ -250,8 +250,7 @@ python3 torchchat.py generate [--compile] --checkpoint-path ${MODEL_PATH} --prom To improve performance, you can compile the model with `--compile` trading off the time to first token processed with time per token. To improve performance further, you may also compile the prefill with -`--compile_prefill`. This will increase further compilation times though. The -`--compile-prefill` option is not compatible with `--prefill-prefill`. +`--compile-prefill`. This will increase further compilation times though. Parallel prefill is not yet supported by exported models, and may be supported in a future release. @@ -265,7 +264,7 @@ the introductory README. In addition to running eval on models in eager mode and JIT-compiled mode with `torch.compile()`, you can also load dso and pte models back into the PyTorch to evaluate the accuracy of exported model objects -(e.g., after applying quantization or other traqnsformations to +(e.g., after applying quantization or other transformations to improve speed or reduce model size). Loading exported models back into a Python-based Pytorch allows you to @@ -297,14 +296,14 @@ for ExecuTorch.) We export the stories15M model with the following command for execution with the ExecuTorch runtime (and enabling execution on a -wide range of community and vendor supported backends): +wide range of community and vendor-supported backends): ``` python3 torchchat.py export --checkpoint-path ${MODEL_PATH} --output-pte-path ${MODEL_NAME}.pte ``` Alternatively, we may generate a native instruction stream binary -using AOT Inductor for CPU oor GPUs (the latter using Triton for +using AOT Inductor for CPU or GPUs (the latter using Triton for optimizations such as operator fusion): ``` @@ -319,10 +318,10 @@ the exported model artifact back into a model container with a compatible API surface for the `model.forward()` function. This enables users to test, evaluate and exercise the exported model artifact with familiar interfaces, and in conjunction with -pre-exiisting Python model unit tests and common environments such as +pre-existing Python model unit tests and common environments such as Jupyter notebooks and/or Google colab. -Here is how to load an exported model into the python environment on the example of using an exported model with `generate.oy`. +Here is how to load an exported model into the Python environment using an exported model with the `generate` command. ``` python3 torchchat.py generate --checkpoint-path ${MODEL_PATH} --pte-path ${MODEL_NAME}.pte --device cpu --prompt "Once upon a time" @@ -452,7 +451,7 @@ strategies: You can find instructions for quantizing models in [docs/quantization.md](file:///./quantization.md). Advantageously, quantization is available in eager mode as well as during export, -enabling you to do an early exploration of your quantization setttings +enabling you to do an early exploration of your quantization settings in eager mode. However, final accuracy should always be confirmed on the actual execution target, since all targets have different build processes, compilers, and kernel implementations with potentially @@ -464,9 +463,8 @@ significant impact on accuracy. ## Native (Stand-Alone) Execution of Exported Models -Refer to the [README](README.md] for an introduction toNative -execution on servers, desktops and laptops is described under -[runner-build.md]. Mobile and Edge executipon for Android and iOS are +Refer to the [README](README.md] for an introduction to native +execution on servers, desktops, and laptops. Mobile and Edge execution for Android and iOS are described under [torchchat/edge/docs/Android.md] and [torchchat/edge/docs/iOS.md], respectively. @@ -475,7 +473,7 @@ described under [torchchat/edge/docs/Android.md] and [torchchat/edge/docs/iOS.md PyTorch and ExecuTorch support a broad range of devices for running PyTorch with python (using either eager or eager + `torch.compile`) or -in a python-free environment with AOT Inductor and ExecuTorch. +in a Python-free environment with AOT Inductor and ExecuTorch. | Hardware | OS | Eager | Eager + Compile | AOT Compile | ET Runtime | @@ -499,58 +497,6 @@ in a python-free environment with AOT Inductor and ExecuTorch. *Key*: n/t -- not tested -## Runtime performance with Llama 7B, in tokens per second (4b quantization) - -| Hardware | OS | eager | eager + compile | AOT compile | ET Runtime | -|-----|------|-----|-----|-----|-----| -| x86 | Linux | ? | ? | ? | ? | -| x86 | macOS | ? | ? | ? | ? | -| aarch64 | Linux | ? | ? | ? | ? | -| aarch64 | macOS | ? | ? | ? | ? | -| AMD GPU | Linux | ? | ? | ? | ? | -| Nvidia GPU | Linux | ? | ? | ? | ? | -| MPS | macOS | ? | ? | ? | ? | -| MPS | iOS | ? | ? | ? | ? | -| aarch64 | Android | ? | ? | ? | ? | -| Mobile GPU (Vulkan) | Android | ? | ? | ? | ? | -| CoreML | iOS | | ? | ? | ? | ? | -| Hexagon DSP | Android | | ? | ? | ? | ? | -| Raspberry Pi 4/5 | Raspbian | ? | ? | ? | ? | -| Raspberry Pi 4/5 | Android | ? | ? | ? | ? | -| ARM 32b (up to v7) | any | | ? | ? | ? | ? | - - -## Runtime performance with Llama3, in tokens per second (4b quantization) - -| Hardware | OS | eager | eager + compile | AOT compile | ET Runtime | -|-----|------|-----|-----|-----|-----| -| x86 | Linux | ? | ? | ? | ? | -| x86 | macOS | ? | ? | ? | ? | -| aarch64 | Linux | ? | ? | ? | ? | -| aarch64 | macOS | ? | ? | ? | ? | -| AMD GPU | Linux | ? | ? | ? | ? | -| Nvidia GPU | Linux | ? | ? | ? | ? | -| MPS | macOS | ? | ? | ? | ? | -| MPS | iOS | ? | ? | ? | ? | -| aarch64 | Android | ? | ? | ? | ? | -| Mobile GPU (Vulkan) | Android | ? | ? | ? | ? | -| CoreML | iOS | | ? | ? | ? | ? | -| Hexagon DSP | Android | | ? | ? | ? | ? | -| Raspberry Pi 4/5 | Raspbian | ? | ? | ? | ? | -| Raspberry Pi 4/5 | Android | ? | ? | ? | ? | -| ARM 32b (up to v7) | any | | ? | ? | ? | ? | - - - - -# CONTRIBUTING to torchchat - -We welcome any feature requests, bug reports, or pull requests from -the community. See the [CONTRIBUTING](CONTRIBUTING.md) for -instructions how to contribute to torchchat. - - - # LICENSE Torchchat is released under the [BSD 3 license](./LICENSE). However From de2507b63ed8af7410a30ea1982d1a41b5ae4271 Mon Sep 17 00:00:00 2001 From: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com> Date: Tue, 26 Nov 2024 12:03:59 -0800 Subject: [PATCH 4/4] Update multimodal.md to exercise server as part of test (#1391) Similar to #1384 to exercise the server , but for multimodal 1 - Run server: 1a - in background 1b - capture server_pid 2 - enable query using curl 3 - shutdown server with server pid captured in server_pid --- docs/multimodal.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/multimodal.md b/docs/multimodal.md index 6a3cb2be8..cd249a1fb 100644 --- a/docs/multimodal.md +++ b/docs/multimodal.md @@ -41,6 +41,9 @@ python3 torchchat.py server llama3.2-11B ``` [skip default]: end +[shell default]: python3 torchchat.py server llama3.2-11B & server_pid=$! + + In another terminal, query the server using `curl`. This query might take a few minutes to respond.
@@ -50,7 +53,6 @@ Setting `stream` to "true" in the request emits a response in chunks. If `stream **Example Input + Output** -[skip default]: begin ``` curl http://127.0.0.1:5000/v1/chat/completions \ -H "Content-Type: application/json" \ @@ -74,12 +76,14 @@ curl http://127.0.0.1:5000/v1/chat/completions \ "max_tokens": 300 }' ``` - +[skip default]: begin ``` {"id": "chatcmpl-cb7b39af-a22e-4f71-94a8-17753fa0d00c", "choices": [{"message": {"role": "assistant", "content": "The image depicts a simple black and white cartoon-style drawing of an animal face. It features a profile view, complete with two ears, expressive eyes, and a partial snout. The animal looks to the left, with its eye and mouth implied, suggesting that the drawn face might belong to a rabbit, dog, or pig. The graphic face has a bold black outline and a smaller, solid black nose. A small circle, forming part of the face, has a white background with two black quirkly short and long curved lines forming an outline of what was likely a mouth, complete with two teeth. The presence of the curve lines give the impression that the animal is smiling or speaking. Grey and black shadows behind the right ear and mouth suggest that this face is looking left and upwards. Given the prominent outline of the head and the outline of the nose, it appears that the depicted face is most likely from the side profile of a pig, although the ears make it seem like a dog and the shape of the nose makes it seem like a rabbit. Overall, it seems that this image, possibly part of a character illustration, is conveying a playful or expressive mood through its design and positioning."}, "finish_reason": "stop"}], "created": 1727487574, "model": "llama3.2", "system_fingerprint": "cpu_torch.float16", "object": "chat.completion"}% ``` [skip default]: end +[shell default]: kill ${server_pid} +
## Browser