[neon] Modify neon sgemv_fp16 #46

- Previously, sgemv_fp16 was dependent of two conditions: 1. should have 8-divisible column or row 2. fully work with fp16 digit (which might raise accuracy issue) - In this commit, we expect sgemv to work like: 1. support every column length (with adaptive-compute optimization) 2. use temporal fp32 array to secure cumulative digit error in large scale Tensor 3. accelerate fp32 to fp16 copy and vice versa with neon to enhance time performance - some trivial typo fix included **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Instead of explicitly declaring float16x4_t and converting into float32x4_t, it is better to implement it in inline code considering the number of registers on device, and memory consumption. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Instead of explicitly declaring float16x4_t and converting into float32x4_t, it is better to implement it in inline code considering the number of registers on device, and memory consumption. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Previously, we used full-fp16 variables for sgemv and sgemm loop code. - However, such practice might cause acummulation error that exceeds our expected epsilon. - Now, it uses inter-fp32 value to preseve accuracy and avoid precision loss. Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[neon] Modify neon sgemv_fp16 #46

[neon] Modify neon sgemv_fp16 #46

Commits on Sep 26, 2023

Commits on Sep 27, 2023

Commits on Oct 5, 2023