Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed/sequence expand #9289

Merged
merged 16 commits into from
Apr 13, 2018
Merged

Conversation

dzhwinter
Copy link
Contributor

@dzhwinter dzhwinter commented Mar 21, 2018

fix #9220
single batch sequence expand op 0.595545 -> 0.331348, improved ~1x speed.
Need to note that there is a lot of memory visiting inside Cuda kernel, and shared memory will given us a furture enhancement.

before optimize

Event                                     Calls       Total       Min.        Max.        Ave.
thread0::sum                              40856       3485.82     0.012192    89.6786     0.0853197
thread0::mul_grad                         14556       2180.88     0.047808    3.56227     0.149827
thread0::sequence_softmax_grad            1316        1359.13     0.052192    39.9963     1.03278
thread0::mul                              14556       1185.32     0.037536    2.49907     0.0814315
thread0::sequence_softmax                 1316        1111.14     0.046752    7.49488     0.844328
thread0::sequence_pool                    1336        927.845     0.043168    89.3362     0.694495
thread0::sequence_expand_grad             1316        783.738     0.046336    4.13078     0.595545
thread0::sequence_pool_grad               1336        566.56      0.03648     1.31149     0.424072
thread0::sequence_expand                  1316        538.142     0.029536    18.5873     0.408922
thread0::elementwise_add_grad             6580        482.06      0.0304      0.765728    0.0732613

after optimize

thread0::sum                              40112       3388.82     0.012928    121.016     0.084484
thread0::mul_grad                         14292       2073.97     0.042816    1.68877     0.145114
thread0::sequence_softmax_grad            1292        1400.66     0.050528    5.28992     1.0841
thread0::sequence_softmax                 1292        1177.63     0.045984    2.67725     0.91148
thread0::mul                              14292       1105.29     0.033536    2.02675     0.0773365
thread0::sequence_pool                    1312        856.938     0.039168    1.75539     0.653154
thread0::sequence_pool_grad               1312        617.013     0.03408     2.27245     0.470284
thread0::elementwise_add_grad             6460        452.989     0.025408    0.492832    0.0701221
thread0::sequence_expand_grad             1292        428.101     0.140192    12.6613     0.331348
thread0::sequence_expand                  1292        354.059     0.139392    0.813536    0.274039
thread0::elementwise_add                  6460        307.216     0.027424    29.2352     0.0475566
thread0::lstm_grad                        40          304.187     6.11402     8.52278     7.60468

@@ -362,6 +362,9 @@ def __assert_is_close(self, numeric_grads, analytic_grads, names,
for a, b, name in itertools.izip(numeric_grads, analytic_grads, names):
abs_a = np.abs(a)
abs_a[abs_a < 1e-3] = 1
print("actual", a)
print("*****")
print("expected", b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove those debug code.

the instance length*/
const int x_item_length, T* out_data) {
constexpr int N = 1024;
__shared__ int mem[N];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The length of shared memory can be defined outside.
  2. I'm curious about how much it effects on performance that using shared memory. Do you have a benchmark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about how much it effects on performance that using shared memory. Do you have a benchmark?

Good question.
Not using the shared memory. The speed keeps same with using shared memory. Note thatshared memory about 100x times fast that general memory, maybe that cover the overhead.

Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread

Event                                     Calls       Total       Min.        Max.        Ave.
thread0::sum                              60819       4296.31     0.010656    4.47222     0.0706409
thread0::mul_grad                         21669       3030.08     0.032448    2.92419     0.139835
thread0::sequence_softmax_grad            1959        1761.28     0.039712    4.09632     0.89907
thread0::mul                              21669       1613.08     0.02496     2.75418     0.0744419
thread0::sequence_softmax                 1959        1441.05     0.038592    3.60509     0.735606
thread0::elementwise_add_grad             9795        661.935     0.022528    2.10016     0.0675789
thread0::sequence_expand_grad             1959        654.569     0.121216    2.8896      0.334134
thread0::sequence_expand                  1959        553.549     0.119872    6.49626     0.282567

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the offset out of kernel, here we have some benifit.

Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread

Event                                     Calls       Total       Min.        Max.        Ave.
thread0::sum                              60695       4183.43     0.010816    6.24307     0.0689254
thread0::mul_grad                         21625       2983.08     0.032448    2.84122     0.137946
thread0::mul                              21625       1599.23     0.026432    14.0889     0.0739527
thread0::sequence_softmax_grad            1955        1559.16     0.039232    3.93517     0.797522
thread0::sequence_softmax                 1955        1243.7      0.035968    2.35155     0.636165
thread0::elementwise_add_grad             9775        645.952     0.020096    2.46816     0.0660821
thread0::sequence_expand_grad             1955        517.621     0.12704     2.53744     0.264768
thread0::lstm_grad                        60          460.934     6.75344     8.54125     7.68223
thread0::sequence_expand                  1955        416.672     0.124384    2.38714     0.213131

// TODO(dzhwinter) : too many atomicAdd
// use shared memory to reduce memory visits
constexpr int N = 1024;
__shared__ int mem[N];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same above.

if (i < lod_size - 1) {
offset += (ref_lod[i + 1] - ref_lod[i]) * (x_lod[i + 1] - x_lod[i]);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the block size of threads is 16x16, line 35~40 is run 256 times, this is to say mem is assigned 256 times, it is unnecessary. Please make a double check.

int thread_y = 16;
int thread_z = 1024 / thread_x / thread_y;
int block_x = static_cast<int>(ref_lod.size());
dim3 block_size(thread_x, thread_y, thread_z);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double check the block_size.

@@ -13,7 +13,135 @@ See the License for the specific language governing permissions and
limitations under the License. */

#define EIGEN_USE_GPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe #define EIGEN_USE_GPU is no usefull now.

const size_t* offset,
const size_t lod_size,
/* default=1,
the instance length*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line30~31 is useful? if no, please remove them.

const T* dout_data, const size_t* ref_lod, const size_t* dx_lod,
const size_t* offset, const size_t lod_size,
/* default=1,
the instance length*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same above.

dim3 block_size(thread_x, thread_y, thread_z);
dim3 grid_size(block_x, 1);

sequence_expand_kernel<<<grid_size, block_size, 0, context.stream()>>>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design of sequence_expand_kernel is excellent. But I think that the logic is a little complex, maybe it can be more simple.
From my limited understanding, the function of sequence_expand_kernel is copying one matrix to the other according to the row index of source matrix, and the row index can be computed on CPU side.
For example:

cast 1 
Given a 1-level LoDTensor input(X)
    X.lod =  [[0,   2,        4]]
    X.data = [[a], [b], [c], [d]]
    X.dims = [4, 1]
and input(Y)
    Y.lod = [[0,    2,    4],
                  [0, 3, 6, 7, 8]]
ref_level: 0
then we get 1-level LoDTensor
    Out.lod =  [[0,   2,        4,        6,        8]]
    Out.data = [[a], [b], [a], [b], [c], [d], [c], [d]]
    Out.dims = [8, 1]

The row index should be [0,1,0,1,2,3,2,3].

Case 2:

Given 1-level LoDTensor input(X)
    X.lod =  [[0,   1,        4]]
    X.data = [[a], [b], [c], [d]]
    X.dims = [4, 1]
and input(Y)
    Y.lod = [[0,    2,    4],
             [0, 3, 6, 6, 8]]
ref_level: 0
then we get 1-level LoDTensor
    Out.lod =  [[0,   1,   2,        5,             8]]
    Out.data = [[a], [a], [b], [c], [d], [b], [c], [d]]
    Out.dims = [8, 1]

The row index should be [0,0,1,2,3,1,2,3].

Copy link
Collaborator

@JiayiFeng JiayiFeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge it first to avoid it being too large.

@JiayiFeng JiayiFeng merged commit c1bf06f into PaddlePaddle:develop Apr 13, 2018
blacksheep-Aristotle pushed a commit to blacksheep-Aristotle/Paddle that referenced this pull request Nov 22, 2024
* fix llm/alignment/rm/flashmask

* update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Speed] speed up the sequence_expand op
3 participants