-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed/sequence expand #9289
Speed/sequence expand #9289
Conversation
a4470fe
to
5447046
Compare
@@ -362,6 +362,9 @@ def __assert_is_close(self, numeric_grads, analytic_grads, names, | |||
for a, b, name in itertools.izip(numeric_grads, analytic_grads, names): | |||
abs_a = np.abs(a) | |||
abs_a[abs_a < 1e-3] = 1 | |||
print("actual", a) | |||
print("*****") | |||
print("expected", b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove those debug code.
the instance length*/ | ||
const int x_item_length, T* out_data) { | ||
constexpr int N = 1024; | ||
__shared__ int mem[N]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The length of shared memory can be defined outside.
- I'm curious about how much it effects on performance that using
shared memory
. Do you have a benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about how much it effects on performance that using shared memory. Do you have a benchmark?
Good question.
Not using the shared memory. The speed keeps same with using shared memory. Note thatshared memory
about 100x times fast that general memory, maybe that cover the overhead.
Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave.
thread0::sum 60819 4296.31 0.010656 4.47222 0.0706409
thread0::mul_grad 21669 3030.08 0.032448 2.92419 0.139835
thread0::sequence_softmax_grad 1959 1761.28 0.039712 4.09632 0.89907
thread0::mul 21669 1613.08 0.02496 2.75418 0.0744419
thread0::sequence_softmax 1959 1441.05 0.038592 3.60509 0.735606
thread0::elementwise_add_grad 9795 661.935 0.022528 2.10016 0.0675789
thread0::sequence_expand_grad 1959 654.569 0.121216 2.8896 0.334134
thread0::sequence_expand 1959 553.549 0.119872 6.49626 0.282567
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the offset out of kernel, here we have some benifit.
Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave.
thread0::sum 60695 4183.43 0.010816 6.24307 0.0689254
thread0::mul_grad 21625 2983.08 0.032448 2.84122 0.137946
thread0::mul 21625 1599.23 0.026432 14.0889 0.0739527
thread0::sequence_softmax_grad 1955 1559.16 0.039232 3.93517 0.797522
thread0::sequence_softmax 1955 1243.7 0.035968 2.35155 0.636165
thread0::elementwise_add_grad 9775 645.952 0.020096 2.46816 0.0660821
thread0::sequence_expand_grad 1955 517.621 0.12704 2.53744 0.264768
thread0::lstm_grad 60 460.934 6.75344 8.54125 7.68223
thread0::sequence_expand 1955 416.672 0.124384 2.38714 0.213131
// TODO(dzhwinter) : too many atomicAdd | ||
// use shared memory to reduce memory visits | ||
constexpr int N = 1024; | ||
__shared__ int mem[N]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same above.
if (i < lod_size - 1) { | ||
offset += (ref_lod[i + 1] - ref_lod[i]) * (x_lod[i + 1] - x_lod[i]); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the block size of threads is 16x16, line 35~40 is run 256 times, this is to say mem
is assigned 256 times, it is unnecessary. Please make a double check.
int thread_y = 16; | ||
int thread_z = 1024 / thread_x / thread_y; | ||
int block_x = static_cast<int>(ref_lod.size()); | ||
dim3 block_size(thread_x, thread_y, thread_z); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please double check the block_size.
@@ -13,7 +13,135 @@ See the License for the specific language governing permissions and | |||
limitations under the License. */ | |||
|
|||
#define EIGEN_USE_GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe #define EIGEN_USE_GPU
is no usefull now.
const size_t* offset, | ||
const size_t lod_size, | ||
/* default=1, | ||
the instance length*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line30~31 is useful? if no, please remove them.
const T* dout_data, const size_t* ref_lod, const size_t* dx_lod, | ||
const size_t* offset, const size_t lod_size, | ||
/* default=1, | ||
the instance length*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same above.
dim3 block_size(thread_x, thread_y, thread_z); | ||
dim3 grid_size(block_x, 1); | ||
|
||
sequence_expand_kernel<<<grid_size, block_size, 0, context.stream()>>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design of sequence_expand_kernel is excellent. But I think that the logic is a little complex, maybe it can be more simple.
From my limited understanding, the function of sequence_expand_kernel is copying one matrix to the other according to the row index of source matrix, and the row index can be computed on CPU side.
For example:
cast 1
Given a 1-level LoDTensor input(X)
X.lod = [[0, 2, 4]]
X.data = [[a], [b], [c], [d]]
X.dims = [4, 1]
and input(Y)
Y.lod = [[0, 2, 4],
[0, 3, 6, 7, 8]]
ref_level: 0
then we get 1-level LoDTensor
Out.lod = [[0, 2, 4, 6, 8]]
Out.data = [[a], [b], [a], [b], [c], [d], [c], [d]]
Out.dims = [8, 1]
The row index should be [0,1,0,1,2,3,2,3].
Case 2:
Given 1-level LoDTensor input(X)
X.lod = [[0, 1, 4]]
X.data = [[a], [b], [c], [d]]
X.dims = [4, 1]
and input(Y)
Y.lod = [[0, 2, 4],
[0, 3, 6, 6, 8]]
ref_level: 0
then we get 1-level LoDTensor
Out.lod = [[0, 1, 2, 5, 8]]
Out.data = [[a], [a], [b], [c], [d], [b], [c], [d]]
Out.dims = [8, 1]
The row index should be [0,0,1,2,3,1,2,3].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge it first to avoid it being too large.
* fix llm/alignment/rm/flashmask * update
fix #9220
single batch sequence expand op 0.595545 -> 0.331348, improved ~1x speed.
Need to note that there is a lot of memory visiting inside Cuda kernel, and shared memory will given us a furture enhancement.
before optimize
after optimize