[sgemm] fp32 temporary storage logic added #47

s-debadri · 2023-09-26T09:50:43Z

SGEMM changes added:

Use temporal fp32 array to reduce digit errors for large dimension.
Use fp32 NEON intrinsics to perform multiply/add operations.

Signed-off-by:s-debadri [email protected]

- Match requestMemory arguments with memory_pool - Added override keyword Signed-off-by: hyeonseok lee <[email protected]>

- To support scaled dot product on attention layer as described in paper "attention all you need" add scaled dot product property Signed-off-by: hyeonseok lee <[email protected]>

- To provide dynamic input dimension implement reinitialize function - This commit is PoC of reinitialize so many of codes are just copy & paste of initialize function. Needs to refine this commit. Signed-off-by: hyeonseok lee <[email protected]>

- Added causal mask in attention layer - Implements PicoGPT Signed-off-by: hyeonseok lee <[email protected]>

Implementing picoGPT/GPT2's Encoder in CPP using nlohman/json.hpp file so we need to add or make some path to compile json parser Signed-off-by: Donghak PARK <[email protected]>

Add PicoGPT's user input Add Comment in encoder.hpp Signed-off-by: Donghak PARK <[email protected]>

This PR includes the PicoGPT(https://github.com/jaymody/picoGPT) Android Application with NNTrainer. We only use the PicoGPT Model Binary and provides the NNTrainer implementation nnstreamer#2212. This is the Android application implementation for that PR. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

- PoC of incremental inference - Only works if batch, channel size is 1 - For the concat layer, inference step only works if axis dimension is width axis Signed-off-by: hyeonseok lee <[email protected]>

- Each threads will copy the data with batchwise direction Signed-off-by: hyeonseok lee <[email protected]>

- Apply incremental inference to pico gpt Signed-off-by: hyeonseok lee <[email protected]>

This PR includes fixes for running GPT. Signed-off-by: jijoong.moon <[email protected]>

This pr includes some fixes to run PicoGPT with W16A16 on Android using NEON. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This PR includes, - Fixes to enable memory optimization - remove unnecessary memory buffer **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

Added LLaMA v2 application. This implementations is based on llama of meta. ref url: "https://github.com/facebookresearch/llama/" It contains... - implementations of swiglu, rmsprop, rotary embedding - load weights from pytorch implementation(official github) To do... - label encoding - load weights from huggingface format - refactoring scripts Signed-off-by: Seungbaek Hong <[email protected]>

- temp commit Signed-off-by: hyeonseok lee <[email protected]>

Added LLaMA v2 application. This implementations is based on llama of meta. ref url: "https://github.com/facebookresearch/llama/" It contains... - implementations of swiglu, rmsprop, rotary embedding - load weights from pytorch implementation(official github) To do... - label encoding - load weights from huggingface format - refactoring scripts Signed-off-by: Seungbaek Hong <[email protected]>

- To support scaled dot product on attention layer as described in paper "attention all you need" add scaled dot product property Signed-off-by: hyeonseok lee <[email protected]>

- Added causal mask in attention layer - Implements PicoGPT Signed-off-by: hyeonseok lee <[email protected]>

- PoC of incremental inference - Only works if batch, channel size is 1 - For the concat layer, inference step only works if axis dimension is width axis Signed-off-by: hyeonseok lee <[email protected]>

Describe a commit content (Until 80 colums per line) in detail ASAP. **Changes proposed in this PR:** - Added TOC generator for README.md Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This pr incluides some fixes to run LLaMA2 with W16A16. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped

- make freqs_cis as static Signed-off-by: hyeonseok lee <[email protected]>

Describe a commit content (Until 80 colums per line) in detail ASAP. **Changes proposed in this PR:** - Added TOC generator for README.md Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This pr fix the 32bit computing issue in RMS NORM layer **Changes proposed in this PR:** - Added TOC generator for README.md Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This PR enables the cache sliding if it exceeds the max length of cache. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This pr includes cache shifting when the seqence length is greater than max sequence length. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This PR includes the fixes correct the tensor buffer sharing in cache_tensor. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

this PR enables FP16 computation for LLaMA **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This PR adds the rotary embedding frequency value when the number of sequence is greater than max sequence. We also need to shift the windows for cache of K and V. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This pr add LLaMA2 summariaztion android application. The LLaMA2 is based on the LLaMA2 Application Example and only android interface is added. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This pr enable the tempearture generator for logit Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

To compute Q, K, V for initial sentences, it will use multi core Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

In This PR includes, . enable packed (bool) property for layer node. It is only for the wieght. If it is false, then it follows global activation datatype and if it is true, then it will follows global weight datatype. . add output axis in Weight Spec and set private variable in weight. it is to find right direction for multiplying scales and zero point **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This pr add outoput axis parameter in dequantize api Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This PR test Qint4 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This patch adds functionality to read quantized tensor from a binary file. Details are as follows. - Read the tensor in the following order (axis, scale factors, zero points, and values). - Tensor::read takes an extra argument of datatype to identify the datatype of scale factors and read exact bytes. - Dequantize function takes the output axis as a parameter instead of the tensor having an axis variable. - Fix QINT4 tensor print segfault issue. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Donghyeon Jeong <[email protected]>

Describe a commit content (Until 80 colums per line) in detail ASAP. **Changes proposed in this PR:** - Added TOC generator for README.md Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This patch optimizes dequantize by utilizing tensor operations. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Donghyeon Jeong <[email protected]>

- scopy_INT4 : convert int4 to float16 Tensor - ewvm_fp16 : elementwise vector multiplication Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

This PR enables int4 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

Describe a commit content (Until 80 colums per line) in detail ASAP. **Changes proposed in this PR:** - Added TOC generator for README.md Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

- Previously, if it was not for Android, wrong code block was called. - Now, if it is not Android, scopy_INT4_loop is called Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- ditto **Changes proposed in this PR:** - Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

Signed-off-by: Debadri Samaddar <[email protected]>

Describe a commit content (Until 80 colums per line) in detail ASAP. **Changes proposed in this PR:** - Added TOC generator for README.md Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This PR fixes the output of incremental inference. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

Added temporary fp32 storage along with fp32 intrinsics for all scenarios. Signed-off-by: Debadri Samaddar <[email protected]>

skykongkong8

LGTM!

Now we have to determine how delicate our accuracy preservation should be.
For example, using full fp32 for conservating decimal digits is the most extreme case and vice versa (previous implementation is the least one)
I think we should observe:
- how the model output and weight diverges with neon implementation change (especially when we have to proceed to backwarding implementation of fp16)
- how other fp16 frameworks are doing (to follow the standard)

Signed-off-by: Debadri Samaddar <[email protected]>

lhs8928 and others added 30 commits September 12, 2023 16:50

[cache_pool] bug fix request memory

33e1914

- Match requestMemory arguments with memory_pool - Added override keyword Signed-off-by: hyeonseok lee <[email protected]>

[attention] add scaled dot product on attention layer

fb4daae

- To support scaled dot product on attention layer as described in paper "attention all you need" add scaled dot product property Signed-off-by: hyeonseok lee <[email protected]>

[Poc] implement reinitialize

cdffad7

- To provide dynamic input dimension implement reinitialize function - This commit is PoC of reinitialize so many of codes are just copy & paste of initialize function. Needs to refine this commit. Signed-off-by: hyeonseok lee <[email protected]>

[PoC] implements PicoGPT

48a8b8f

- Added causal mask in attention layer - Implements PicoGPT Signed-off-by: hyeonseok lee <[email protected]>

[WIP][POC] Implements picoGPT Encoder

b000f90

Implementing picoGPT/GPT2's Encoder in CPP using nlohman/json.hpp file so we need to add or make some path to compile json parser Signed-off-by: Donghak PARK <[email protected]>

[PoC] Add User Input, Comment

d45f1ad

Add PicoGPT's user input Add Comment in encoder.hpp Signed-off-by: Donghak PARK <[email protected]>

[PoC] incremental inference

c47f7bc

- PoC of incremental inference - Only works if batch, channel size is 1 - For the concat layer, inference step only works if axis dimension is width axis Signed-off-by: hyeonseok lee <[email protected]>

[concat] enable incremental forwarding with multi threads

f1d5ceb

- Each threads will copy the data with batchwise direction Signed-off-by: hyeonseok lee <[email protected]>

[Application] apply incremental inference to pico gpt

70953cd

- Apply incremental inference to pico gpt Signed-off-by: hyeonseok lee <[email protected]>

[ Application ] Fix for running GPT

94abb93

This PR includes fixes for running GPT. Signed-off-by: jijoong.moon <[email protected]>

[ FP16 ] Run PicoGPT with W16A16

71f5e4f

This pr includes some fixes to run PicoGPT with W16A16 on Android using NEON. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

[tmp] temp commit for code share

5b06a93

- temp commit Signed-off-by: hyeonseok lee <[email protected]>

[attention] add scaled dot product on attention layer

44f7f9a

- To support scaled dot product on attention layer as described in paper "attention all you need" add scaled dot product property Signed-off-by: hyeonseok lee <[email protected]>

[PoC] implements PicoGPT

9937796

- Added causal mask in attention layer - Implements PicoGPT Signed-off-by: hyeonseok lee <[email protected]>

[PoC] incremental inference

833f0df

- PoC of incremental inference - Only works if batch, channel size is 1 - For the concat layer, inference step only works if axis dimension is width axis Signed-off-by: hyeonseok lee <[email protected]>

[ LLaMA2 ] Enable FP16(W)FP16(A)

16fcda0

This pr incluides some fixes to run LLaMA2 with W16A16. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped

[multi head attention] make freq as static

e688644

- make freqs_cis as static Signed-off-by: hyeonseok lee <[email protected]>

[ Multi Head ] enable cache sliding

2275b61

This PR enables the cache sliding if it exceeds the max length of cache. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

[ MHA ] add cache shifting

a3c5d4e

This pr includes cache shifting when the seqence length is greater than max sequence length. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

[ FP16 ] Enable FP16 LLaMA Computation

fd29a9b

this PR enables FP16 computation for LLaMA **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

jijoongmoon and others added 23 commits September 12, 2023 16:50

[ LLaMA ] apply temperature generator

402a145

This pr enable the tempearture generator for logit Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

[ Tensor ] Add Output Axis to dequantize api

45499ac

This pr add outoput axis parameter in dequantize api Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

[Application] modifiy appication

557bb53

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

[ QInt4 ] Support Qint4

90acef8

This PR test Qint4 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

[Tensor] Optimize dequantize operation

72cc9f3

This patch optimizes dequantize by utilizing tensor operations. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Donghyeon Jeong <[email protected]>

[ QINT ] enable int4 weight

97e7978

This PR enables int4 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

[neon] Use inter-fp32 values in sgemm_fp16 for better precision

c60700a

- ditto **Changes proposed in this PR:** - Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

[sgemm] Bug fix transB scenario

2639273

Signed-off-by: Debadri Samaddar <[email protected]>

[sgemv] Modified sgemv_trans to use fp32 intrinsics

f1a92f3

Signed-off-by: Debadri Samaddar <[email protected]>

[sgemm] Added fp32 intrinsics for SGEMM transB

8eb66c4

Signed-off-by: Debadri Samaddar <[email protected]>

[ Model ] Fix incremental output in FP32

73af039

This PR fixes the output of incremental inference. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

[sgemm] fp32 temporary storage logic added

98e0a47

Added temporary fp32 storage along with fp32 intrinsics for all scenarios. Signed-off-by: Debadri Samaddar <[email protected]>

s-debadri requested a review from jijoongmoon as a code owner September 26, 2023 09:50

skykongkong8 approved these changes Sep 27, 2023

View reviewed changes

[sgemm] Optimize transB memory usage

9a077c4

Signed-off-by: Debadri Samaddar <[email protected]>

s-debadri force-pushed the jjm_qin4 branch from e9b34a0 to 9a077c4 Compare October 6, 2023 05:18

jijoongmoon force-pushed the qin4 branch from 73af039 to ca087ea Compare October 6, 2023 10:50

s-debadri closed this May 23, 2024

s-debadri deleted the jjm_qin4 branch May 23, 2024 07:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sgemm] fp32 temporary storage logic added #47

[sgemm] fp32 temporary storage logic added #47

s-debadri commented Sep 26, 2023

skykongkong8 left a comment •

edited

Loading

[sgemm] fp32 temporary storage logic added #47

[sgemm] fp32 temporary storage logic added #47

Conversation

s-debadri commented Sep 26, 2023

skykongkong8 left a comment • edited Loading

Choose a reason for hiding this comment

skykongkong8 left a comment •

edited

Loading