Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[neon] Modify neon sgemv_fp16 #46

Closed
wants to merge 4 commits into from

Conversation

skykongkong8
Copy link

@skykongkong8 skykongkong8 commented Sep 26, 2023

  • Previously, sgemv_fp16 was dependent on two conditions:
    1. should have 8-divisible column or row
    2. fully work with fp16 digit (which might raise accuracy issue)
  • In this commit, we expect sgemv to work like:
    1. support every column length (with adaptive-compute optimization)
    2. use temporal fp32 array to secure cumulative digit error in large scale Tensor
    3. accelerate fp32 to fp16 copy and vice versa with neon to enhance time performance
  • some trivial typo fix included

Self evaluation:

  1. Build test: [X]Passed [ ]Failed [ ]Skipped
  2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by:ss.kong[email protected]

- Previously, sgemv_fp16 was dependent of two conditions:
	1. should have 8-divisible column or row
	2. fully work with fp16 digit (which might raise accuracy issue)
- In this commit, we expect sgemv to work like:
	1. support every column length (with adaptive-compute optimization)
	2. use temporal fp32 array to secure cumulative digit error in large scale Tensor
	3. accelerate fp32 to fp16 copy and vice versa with neon to enhance time performance
- some trivial typo fix included

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
Copy link

@s-debadri s-debadri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

- Instead of explicitly declaring float16x4_t and converting into float32x4_t, it is better to implement it in inline code considering the number of registers on device, and memory consumption.

Resolves:

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
float32x4_t b0_7_2_low = vcvt_f32_f16(vget_low_f16(b0_7_2));
float32x4_t b0_7_2_high = vcvt_f32_f16(vget_high_f16(b0_7_2));
float32x4_t b0_7_0_low = vcvt_f32_f16(vld1_f16(&B[k * N + n]));
float32x4_t b0_7_0_high = vcvt_f32_f16(vld1_f16(&B[k * N + n]));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be float32x4_t b0_7_0_high = vcvt_f32_f16(vld1_f16(&B[k * N + n + 4]));
Otherwise it will load lower bits.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

- Instead of explicitly declaring float16x4_t and converting into float32x4_t, it is better to implement it in inline code considering the number of registers on device, and memory consumption.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
Copy link
Owner

@jijoongmoon jijoongmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the changes of tensor.h and tensor.cpp is not related with this PR. Could you remove those files in this PR, so that we can merge this pr directly.
Also you can make this pr to upstream.

- Previously, we used full-fp16 variables for sgemv and sgemm loop code.
- However, such practice might cause acummulation error that exceeds our expected epsilon.
- Now, it uses inter-fp32 value to preseve accuracy and avoid precision loss.

Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
@skykongkong8
Copy link
Author

Already merged to upstream.
Close.

@skykongkong8 skykongkong8 deleted the soft_pr branch November 9, 2023 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants