Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sgemm] fp32 temporary storage logic added #47

Closed
wants to merge 57 commits into from

Conversation

s-debadri
Copy link

SGEMM changes added:

  • Use temporal fp32 array to reduce digit errors for large dimension.
  • Use fp32 NEON intrinsics to perform multiply/add operations.

Signed-off-by:s-debadri [email protected]

lhs8928 and others added 30 commits September 12, 2023 16:50
 - Match requestMemory arguments with memory_pool
 - Added override keyword

Signed-off-by: hyeonseok lee <[email protected]>
 - To support scaled dot product on attention layer as described in paper "attention all you need" add scaled dot product property

Signed-off-by: hyeonseok lee <[email protected]>
 - To provide dynamic input dimension implement reinitialize function
 - This commit is PoC of reinitialize so many of codes are just copy & paste of initialize function.
   Needs to refine this commit.

Signed-off-by: hyeonseok lee <[email protected]>
 - Added causal mask in attention layer
 - Implements PicoGPT

Signed-off-by: hyeonseok lee <[email protected]>
Implementing picoGPT/GPT2's Encoder in CPP
using nlohman/json.hpp file so we need to add or make some path to compile json parser

Signed-off-by: Donghak PARK <[email protected]>
Add PicoGPT's user input
Add Comment in encoder.hpp

Signed-off-by: Donghak PARK <[email protected]>
This PR includes the PicoGPT(https://github.com/jaymody/picoGPT)
Android Application with NNTrainer.
We only use the PicoGPT Model Binary and provides the NNTrainer
implementation nnstreamer#2212. This is the Android application implementation
for that PR.

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
 - PoC of incremental inference
 - Only works if batch, channel size is 1
 - For the concat layer, inference step only works if axis dimension is width axis

Signed-off-by: hyeonseok lee <[email protected]>
 - Each threads will copy the data with batchwise direction

Signed-off-by: hyeonseok lee <[email protected]>
 - Apply incremental inference to pico gpt

Signed-off-by: hyeonseok lee <[email protected]>
This PR includes fixes for running GPT.

Signed-off-by: jijoong.moon <[email protected]>
This pr includes some fixes to run PicoGPT with W16A16 on Android
using NEON.

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This PR includes,
  - Fixes to enable memory optimization
  - remove unnecessary memory buffer

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
Added LLaMA v2 application.

This implementations is based on llama of meta.
ref url: "https://github.com/facebookresearch/llama/"

It contains...
- implementations of swiglu, rmsprop, rotary embedding
- load weights from pytorch implementation(official github)

To do...
- label encoding
- load weights from huggingface format
- refactoring scripts

Signed-off-by: Seungbaek Hong <[email protected]>
 - temp commit

Signed-off-by: hyeonseok lee <[email protected]>
Added LLaMA v2 application.

This implementations is based on llama of meta.
ref url: "https://github.com/facebookresearch/llama/"

It contains...
- implementations of swiglu, rmsprop, rotary embedding
- load weights from pytorch implementation(official github)

To do...
- label encoding
- load weights from huggingface format
- refactoring scripts

Signed-off-by: Seungbaek Hong <[email protected]>
 - To support scaled dot product on attention layer as described in paper "attention all you need" add scaled dot product property

Signed-off-by: hyeonseok lee <[email protected]>
 - Added causal mask in attention layer
 - Implements PicoGPT

Signed-off-by: hyeonseok lee <[email protected]>
 - PoC of incremental inference
 - Only works if batch, channel size is 1
 - For the concat layer, inference step only works if axis dimension is width axis

Signed-off-by: hyeonseok lee <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP.

**Changes proposed in this PR:**
- Added TOC generator for README.md

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This pr incluides some fixes to run LLaMA2 with W16A16.

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped
 - make freqs_cis as static

Signed-off-by: hyeonseok lee <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP.

**Changes proposed in this PR:**
- Added TOC generator for README.md

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This pr fix the 32bit computing issue in RMS NORM layer

**Changes proposed in this PR:**
- Added TOC generator for README.md

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This PR enables the cache sliding if it exceeds the max length of
cache.

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This pr includes cache shifting when the seqence length is greater
than max sequence length.

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This PR includes the fixes correct the tensor buffer sharing in
cache_tensor.

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
this PR enables FP16 computation for LLaMA

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This PR adds the rotary embedding frequency value when the number of
sequence is greater than max sequence.

We also need to shift the windows for cache of K and V.

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This pr add LLaMA2 summariaztion android application.
The LLaMA2 is based on the LLaMA2 Application Example and only
android interface is added.

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
jijoongmoon and others added 23 commits September 12, 2023 16:50
This pr enable the tempearture generator for logit

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
To compute Q, K, V for initial sentences, it will use multi core

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
In This PR includes,

. enable packed (bool) property for layer node. It is only for the
wieght. If it is false, then it follows global activation datatype and
if it is true, then it will follows global weight datatype.

. add output axis in Weight Spec and set private variable in weight.
  it is to find right direction for multiplying scales and zero point

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This pr add outoput axis parameter in dequantize api

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This PR test Qint4

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This patch adds functionality to read quantized tensor from a binary file. Details are as follows.

- Read the tensor in the following order (axis, scale factors, zero points, and values).
- Tensor::read takes an extra argument of datatype to identify the datatype of scale factors and read exact bytes.
- Dequantize function takes the output axis as a parameter instead of the tensor having an axis variable.
- Fix QINT4 tensor print segfault issue.

**Self evaluation:**
1. Build test:   [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghyeon Jeong <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP.

**Changes proposed in this PR:**
- Added TOC generator for README.md

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This patch optimizes dequantize by utilizing tensor operations.

**Self evaluation:**
1. Build test:   [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghyeon Jeong <[email protected]>
- scopy_INT4 : convert int4 to float16 Tensor
- ewvm_fp16 : elementwise vector multiplication

Resolves:

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
This PR enables int4

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP.

**Changes proposed in this PR:**
- Added TOC generator for README.md

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP.

**Changes proposed in this PR:**
- Added TOC generator for README.md

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP.

**Changes proposed in this PR:**
- Added TOC generator for README.md

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP.

**Changes proposed in this PR:**
- Added TOC generator for README.md

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
- Previously, if it was not for Android, wrong code block was called.
- Now, if it is not Android, scopy_INT4_loop is called

Resolves:

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- ditto

**Changes proposed in this PR:**
-

Resolves:

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
Signed-off-by: Debadri Samaddar <[email protected]>
Describe a commit content (Until 80 colums per line) in detail ASAP.

**Changes proposed in this PR:**
- Added TOC generator for README.md

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This PR fixes the output of incremental inference.

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
Added temporary fp32 storage along with fp32 intrinsics for all scenarios.

Signed-off-by: Debadri Samaddar <[email protected]>
Copy link

@skykongkong8 skykongkong8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Now we have to determine how delicate our accuracy preservation should be.
For example, using full fp32 for conservating decimal digits is the most extreme case and vice versa (previous implementation is the least one)
I think we should observe:
- how the model output and weight diverges with neon implementation change (especially when we have to proceed to backwarding implementation of fp16)
- how other fp16 frameworks are doing (to follow the standard)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants