-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sgemm] fp32 temporary storage logic added #47
Conversation
- Match requestMemory arguments with memory_pool - Added override keyword Signed-off-by: hyeonseok lee <[email protected]>
- To support scaled dot product on attention layer as described in paper "attention all you need" add scaled dot product property Signed-off-by: hyeonseok lee <[email protected]>
- To provide dynamic input dimension implement reinitialize function - This commit is PoC of reinitialize so many of codes are just copy & paste of initialize function. Needs to refine this commit. Signed-off-by: hyeonseok lee <[email protected]>
- Added causal mask in attention layer - Implements PicoGPT Signed-off-by: hyeonseok lee <[email protected]>
Implementing picoGPT/GPT2's Encoder in CPP using nlohman/json.hpp file so we need to add or make some path to compile json parser Signed-off-by: Donghak PARK <[email protected]>
Add PicoGPT's user input Add Comment in encoder.hpp Signed-off-by: Donghak PARK <[email protected]>
This PR includes the PicoGPT(https://github.com/jaymody/picoGPT) Android Application with NNTrainer. We only use the PicoGPT Model Binary and provides the NNTrainer implementation nnstreamer#2212. This is the Android application implementation for that PR. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
- PoC of incremental inference - Only works if batch, channel size is 1 - For the concat layer, inference step only works if axis dimension is width axis Signed-off-by: hyeonseok lee <[email protected]>
- Each threads will copy the data with batchwise direction Signed-off-by: hyeonseok lee <[email protected]>
- Apply incremental inference to pico gpt Signed-off-by: hyeonseok lee <[email protected]>
This PR includes fixes for running GPT. Signed-off-by: jijoong.moon <[email protected]>
This pr includes some fixes to run PicoGPT with W16A16 on Android using NEON. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
This PR includes, - Fixes to enable memory optimization - remove unnecessary memory buffer **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
Added LLaMA v2 application. This implementations is based on llama of meta. ref url: "https://github.com/facebookresearch/llama/" It contains... - implementations of swiglu, rmsprop, rotary embedding - load weights from pytorch implementation(official github) To do... - label encoding - load weights from huggingface format - refactoring scripts Signed-off-by: Seungbaek Hong <[email protected]>
- temp commit Signed-off-by: hyeonseok lee <[email protected]>
Added LLaMA v2 application. This implementations is based on llama of meta. ref url: "https://github.com/facebookresearch/llama/" It contains... - implementations of swiglu, rmsprop, rotary embedding - load weights from pytorch implementation(official github) To do... - label encoding - load weights from huggingface format - refactoring scripts Signed-off-by: Seungbaek Hong <[email protected]>
- To support scaled dot product on attention layer as described in paper "attention all you need" add scaled dot product property Signed-off-by: hyeonseok lee <[email protected]>
- Added causal mask in attention layer - Implements PicoGPT Signed-off-by: hyeonseok lee <[email protected]>
- PoC of incremental inference - Only works if batch, channel size is 1 - For the concat layer, inference step only works if axis dimension is width axis Signed-off-by: hyeonseok lee <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP. **Changes proposed in this PR:** - Added TOC generator for README.md Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
This pr incluides some fixes to run LLaMA2 with W16A16. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped
- make freqs_cis as static Signed-off-by: hyeonseok lee <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP. **Changes proposed in this PR:** - Added TOC generator for README.md Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
This pr fix the 32bit computing issue in RMS NORM layer **Changes proposed in this PR:** - Added TOC generator for README.md Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
This PR enables the cache sliding if it exceeds the max length of cache. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
This pr includes cache shifting when the seqence length is greater than max sequence length. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
This PR includes the fixes correct the tensor buffer sharing in cache_tensor. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
this PR enables FP16 computation for LLaMA **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
This PR adds the rotary embedding frequency value when the number of sequence is greater than max sequence. We also need to shift the windows for cache of K and V. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
This pr add LLaMA2 summariaztion android application. The LLaMA2 is based on the LLaMA2 Application Example and only android interface is added. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
This pr enable the tempearture generator for logit Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
To compute Q, K, V for initial sentences, it will use multi core Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
In This PR includes, . enable packed (bool) property for layer node. It is only for the wieght. If it is false, then it follows global activation datatype and if it is true, then it will follows global weight datatype. . add output axis in Weight Spec and set private variable in weight. it is to find right direction for multiplying scales and zero point **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
This pr add outoput axis parameter in dequantize api Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
This PR test Qint4 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
This patch adds functionality to read quantized tensor from a binary file. Details are as follows. - Read the tensor in the following order (axis, scale factors, zero points, and values). - Tensor::read takes an extra argument of datatype to identify the datatype of scale factors and read exact bytes. - Dequantize function takes the output axis as a parameter instead of the tensor having an axis variable. - Fix QINT4 tensor print segfault issue. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Donghyeon Jeong <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP. **Changes proposed in this PR:** - Added TOC generator for README.md Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
This patch optimizes dequantize by utilizing tensor operations. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Donghyeon Jeong <[email protected]>
- scopy_INT4 : convert int4 to float16 Tensor - ewvm_fp16 : elementwise vector multiplication Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
This PR enables int4 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP. **Changes proposed in this PR:** - Added TOC generator for README.md Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP. **Changes proposed in this PR:** - Added TOC generator for README.md Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP. **Changes proposed in this PR:** - Added TOC generator for README.md Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP. **Changes proposed in this PR:** - Added TOC generator for README.md Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
- Previously, if it was not for Android, wrong code block was called. - Now, if it is not Android, scopy_INT4_loop is called Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- ditto **Changes proposed in this PR:** - Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
Signed-off-by: Debadri Samaddar <[email protected]>
Signed-off-by: Debadri Samaddar <[email protected]>
Signed-off-by: Debadri Samaddar <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP. **Changes proposed in this PR:** - Added TOC generator for README.md Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
This PR fixes the output of incremental inference. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>
Added temporary fp32 storage along with fp32 intrinsics for all scenarios. Signed-off-by: Debadri Samaddar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Now we have to determine how delicate our accuracy preservation should be.
For example, using full fp32 for conservating decimal digits is the most extreme case and vice versa (previous implementation is the least one)
I think we should observe:
- how the model output and weight diverges with neon implementation change (especially when we have to proceed to backwarding implementation of fp16)
- how other fp16 frameworks are doing (to follow the standard)
Signed-off-by: Debadri Samaddar <[email protected]>
SGEMM changes added:
Signed-off-by:s-debadri [email protected]