Accelerate first token gen with BF16-gemm MHA and concat-Silu MLP #106

abenmao · 2023-11-30T10:55:20Z

BF16-based flash attention. Enabled when prompt_len >=1024
concat-silu gate+up proj
Added intel-mkl library in cmake
In addition, an environment variable ENABLE_CBLAS_MLP can be set to 1 (change downProj with cblas kernel) when prompt len >= 2048. The default is 0.

pujiang2018 · 2023-12-04T06:54:36Z

src/utils/decoder_util.h

-        }
+    template <typename T>
+    static void single_thread_cvt2bf16_inplace(T *buf, int m, int n, int stride) {
+        if (!std::is_same_v<T, bfloat16_t>)


shall we check if T is float?

you are right, done~

pujiang2018 · 2023-12-04T06:56:36Z

src/utils/decoder_util.h

-                for (int j = 0; j < headNum; j++) {
-                    int srcOffEachLine = j * seqLen * headSize;
-                    int dstOffEachHead = j * headSize;
+    static inline __m512 dilExpKernel(__m512 vecSrc) {


Is it for vector version of exp? if so, there is already one in some place named vexp.

Have changed to our vexp func

pujiang2018 · 2023-12-05T06:04:46Z

src/layers/attention.h

@@ -878,7 +914,8 @@ class Attention {
    }

    virtual const float *getMask(const float *attnMask, int bId, int hId, int srcLen, int tgtLen) {
-        return attnMask + bId * srcLen * tgtLen;
+        return attnMask;


why different with origin?

It is just for performance tuning. Forgot to rollback. Already add comments.

pujiang2018 · 2023-12-05T06:12:08Z

src/layers/attention.h

-        int minBlk = (nth >= batchSize * numQHead ? 256 : 512);
-        int srcBlk = std::min(minBlk, srcLen);
-        int tgtBlk = std::min(minBlk, tgtLen);
+        int minBlk = (int)std::pow(2, int(std::log2(srcLen / 2)));


any design principle here? if any, would you make some comment?

The current block size is just derived from practical experience. Added comments.

pujiang2018 · 2023-12-05T06:13:46Z

src/layers/attention.h

        float **preSum = thrPtrBuf;
        float **sum = thrPtrBuf + nth;
        float **preMax = thrPtrBuf + nth * 2;
        float **max = thrPtrBuf + nth * 3;
        float **qkArr = thrPtrBuf + nth * 4;
        float **expQkvArr = thrPtrBuf + nth * 5;
+        float **qArr = thrPtrBuf + nth * 6;
+
+        thrBuf = (float *)malloc(sizeof(float) * nth * arrStride);


Does it better to use SimpleMemPool to get the buffer? (SimpleMemPool will maintain the buffer, so next layer directly use)

pujiang2018 · 2023-12-05T06:14:59Z

src/layers/mlp_chatglm2.h

-                memcpy(upW + i * colSplit, weightPTR + ctx->splitIdx * colSplit, colSplit * sizeof(float));
-                weightPTR += intermediateSize;
+
+        int enable = (getenv("ENABLE_CAT_MLP") ? atoi(getenv("ENABLE_CAT_MLP")) : 1);


why we provide the option to disable this?

Leave the option temporarily for future performance tuning. Maybe we can delete it after a while.

maybe a better method is to use a static variable, thus do not need to call 'getenv' every time.
(and if it is used multiple times in the file, we may declare a global static variable)

I added global vars in mlp_llama.cpp. Please review~

pujiang2018 · 2023-12-05T06:17:53Z

@abenmao Have we checked the output of this PR for long tokens?

abenmao · 2023-12-05T09:29:18Z

@abenmao Have we checked the output of this PR for long tokens?

Yes, have checked the output of long prompt for those models

pujiang2018 · 2023-12-06T03:01:06Z

src/layers/mlp_llama.h

-        dbg.dumpMatrix(gateWeight);
-        dbg.debugPrint("gate output:\n");
-        dbg.dumpMatrix(imBuffer);
+            dbg.debugPrint("gateWeight:\n");


need to format?

looks like I made a mistake. The code already formatted.

pujiang2018 · 2023-12-06T04:54:11Z

will merge after the build server checking.

abenmao force-pushed the feature/layers/mha_bf16 branch 5 times, most recently from 21f180b to 6652c16 Compare December 4, 2023 08:30

pujiang2018 reviewed Dec 5, 2023

View reviewed changes

abenmao force-pushed the feature/layers/mha_bf16 branch from 6652c16 to 3d9f90b Compare December 5, 2023 09:24

pujiang2018 reviewed Dec 6, 2023

View reviewed changes

Accelerate first token gen with BF16-gemm MHA and concat-Silu MLP

3321e78

abenmao force-pushed the feature/layers/mha_bf16 branch from 3d9f90b to 3321e78 Compare December 6, 2023 04:47

pujiang2018 approved these changes Dec 6, 2023

View reviewed changes

pujiang2018 merged commit 605e62e into intel:main Dec 6, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate first token gen with BF16-gemm MHA and concat-Silu MLP #106

Accelerate first token gen with BF16-gemm MHA and concat-Silu MLP #106

abenmao commented Nov 30, 2023 •

edited

Loading

pujiang2018 Dec 4, 2023

abenmao Dec 5, 2023

pujiang2018 Dec 4, 2023

abenmao Dec 5, 2023

pujiang2018 Dec 5, 2023

abenmao Dec 5, 2023

pujiang2018 Dec 5, 2023

abenmao Dec 5, 2023

pujiang2018 Dec 5, 2023

abenmao Dec 5, 2023

pujiang2018 Dec 5, 2023

abenmao Dec 5, 2023

pujiang2018 Dec 6, 2023

abenmao Dec 6, 2023

pujiang2018 commented Dec 5, 2023

abenmao commented Dec 5, 2023

pujiang2018 Dec 6, 2023

pujiang2018 Dec 6, 2023

pujiang2018 commented Dec 6, 2023

Accelerate first token gen with BF16-gemm MHA and concat-Silu MLP #106

Accelerate first token gen with BF16-gemm MHA and concat-Silu MLP #106

Conversation

abenmao commented Nov 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pujiang2018 commented Dec 5, 2023

abenmao commented Dec 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pujiang2018 commented Dec 6, 2023

abenmao commented Nov 30, 2023 •

edited

Loading