-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[neon] Modify neon sgemv_fp16 #46
Conversation
- Previously, sgemv_fp16 was dependent of two conditions: 1. should have 8-divisible column or row 2. fully work with fp16 digit (which might raise accuracy issue) - In this commit, we expect sgemv to work like: 1. support every column length (with adaptive-compute optimization) 2. use temporal fp32 array to secure cumulative digit error in large scale Tensor 3. accelerate fp32 to fp16 copy and vice versa with neon to enhance time performance - some trivial typo fix included **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
0fc814a
to
cf32826
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- Instead of explicitly declaring float16x4_t and converting into float32x4_t, it is better to implement it in inline code considering the number of registers on device, and memory consumption. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
nntrainer/tensor/blas_neon.cpp
Outdated
float32x4_t b0_7_2_low = vcvt_f32_f16(vget_low_f16(b0_7_2)); | ||
float32x4_t b0_7_2_high = vcvt_f32_f16(vget_high_f16(b0_7_2)); | ||
float32x4_t b0_7_0_low = vcvt_f32_f16(vld1_f16(&B[k * N + n])); | ||
float32x4_t b0_7_0_high = vcvt_f32_f16(vld1_f16(&B[k * N + n])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be float32x4_t b0_7_0_high = vcvt_f32_f16(vld1_f16(&B[k * N + n + 4]));
Otherwise it will load lower bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks!
- Instead of explicitly declaring float16x4_t and converting into float32x4_t, it is better to implement it in inline code considering the number of registers on device, and memory consumption. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
ca7ca69
to
4f55742
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the changes of tensor.h and tensor.cpp is not related with this PR. Could you remove those files in this PR, so that we can merge this pr directly.
Also you can make this pr to upstream.
- Previously, we used full-fp16 variables for sgemv and sgemm loop code. - However, such practice might cause acummulation error that exceeds our expected epsilon. - Now, it uses inter-fp32 value to preseve accuracy and avoid precision loss. Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
Already merged to upstream. |
Self evaluation:
Signed-off-by:ss.kong[email protected]