optimize graph sampling in graph engine #21

Liwb5 · 2021-11-17T08:17:46Z

PR types

Function optimization

PR changes

Others

Describe

optimize the data structure from c++ to python to speed up sampling in graph engine

…dle#37119) Modify serveral implements on CinnLaunchOp： 1. Skip checking input variables must be used 2. Move current helper functions to a CinnlaunchContext

* reshape kernel refactor * fix compile bugs when run ci * support xpu for reshape * fix bugs when run unittest in kunlun ci * fix compile bugs when run kunlun * perfect code according to suggestion

…ddle#37166) * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop

* modify sparse_attention docs, test=develop * add warning * add warning ,test=document_fix

Optimize dot product of Matmul_v2

…dle#37172)

* graph engine demo * upload unsaved changes * fix dependency error * fix shard_num problem * py client * remove lock and graph-type * add load direct graph * add load direct graph * add load direct graph * batch random_sample * batch_sample_k * fix num_nodes size * batch brpc * batch brpc * add test * add test * add load_nodes; change add_node function * change sample return type to pair * resolve conflict * resolved conflict * resolved conflict * separate server and client * merge pair type * fix * resolved conflict * fixed segment fault; high-level VLOG for load edges and load nodes * random_sample return 0 * rm useless loop * test:load edge * fix ret -1 * test: rm sample * rm sample * random_sample return future * random_sample return int * test fake node * fixed here * memory leak * remove test code * fix return problem * add common_graph_table * random sample node &test & change data-structure from linkedList to vector * add common_graph_table * sample with srand * add node_types * optimize nodes sample * recover test * random sample * destruct weighted sampler * GraphEdgeBlob * WeightedGraphEdgeBlob to GraphEdgeBlob * WeightedGraphEdgeBlob to GraphEdgeBlob * pybind sample nodes api * pull nodes with step * fixed pull_graph_list bug; add test for pull_graph_list by step * add graph table;name * add graph table;name * add pybind * add pybind * add FeatureNode * add FeatureNode * add FeatureNode Serialize * add FeatureNode Serialize * get_feat_node * avoid local rpc * fix get_node_feat * fix get_node_feat * remove log * get_node_feat return py:bytes * merge develop with graph_engine * fix threadpool.h head * fix * fix typo * resolve conflict * fix conflict * recover lost content * fix pybind of FeatureNode * recover cmake * recover tools * resolve conflict * resolve linking problem * code style * change test_server port * fix code problems * remove shard_num config * remove redundent threads * optimize start server * remove logs * fix code problems by reviewers' suggestions * move graph files into a folder * code style change * remove graph operations from base table * optimize get_feat function of graph engine * fix long long count problem * remove redandunt graph files * remove unused shell * recover dropout_op_pass.h * fix potential stack overflow when request number is too large & node add & node clear & node remove * when sample k is larger than neigbor num, return directly * using random seed generator of paddle to speed up * fix bug of random sample k * fix code style * fix code style * add remove graph to fleet_py.cc * fix blocking_queue problem * fix style * fix * recover capacity check * add remove graph node; add set_feature * add remove graph node; add set_feature * add remove graph node; add set_feature * add remove graph node; add set_feature * fix distributed op combining problems * optimize * remove logs * fix MultiSlotDataGenerator error * cache for graph engine * fix type compare error * more test&fix thread terminating problem * remove header * change time interval of shrink * use cache when sample nodes * remove unused function * change unique_ptr to shared_ptr * simplify cache template * cache api on client * fix * reduce sample threads when cache is not used * reduce cache memory * cache optimization * remove test function * remove extra fetch function Co-authored-by: Huang Zhengjie <[email protected]> Co-authored-by: Weiyue Su <[email protected]> Co-authored-by: suweiyue <[email protected]> Co-authored-by: luobin06 <[email protected]> Co-authored-by: liweibin02 <[email protected]> Co-authored-by: tangwei12 <[email protected]>

…PaddlePaddle#36643) * add split_program * make ut faster * increase ut timeout * make result deterministic * add fuse_all_reduce pass * add ut framework, update * fix ut framework * remove useless code * add coverage support * update * fix CI * fix some bugs and fix ci coverage * fix conflict

…7152) * Add elementwise_mul triple grad kernel * Removed InplaceInferer and polished code

* Added BF16 to mean op * fix for CI * fix for CI * fix for CI

* fix 3 bug, test=develop * refine, test=develop

* remove input dim check of activation in op_teller * remove input dim check of concat in op_teller * remove input dim check of clip in op_teller * remove input dim check of scale in op_teller * remove input dim check in op_teller * update attr check of slice in op_teller

* fix revord_event * refine class Instruction * refine Instruction and InterpreterCore * make instruction and operator_base consistent * support NoNeedBufferVar in stream_analyzer * fix place of event * add vlog before continue

…7122) * move extension into pten [no-verify] * append tensor methods by ext_tensor [no-verify] * append other tensor methods [no-verify] * ext related files tidy [no-verify] * include relation tidy [no-verify] * add pten tensor test [no-verify] * replace tensor in custom op & compile success * refine tensor constructor for unittest * custom relu jit run success * fix all custom op unittests * add inference cmake adapt [no-verify] * fix failed unittests * fix windows failed unittests * try to fix kunlun and inference failed * fix test_elementwise_api error * try to fix win compile failed * fix kunlun fp16 type error * remove useless haddle error macro * add custom linear op test * fix compile failed & add win symbols * fix non pten kernel cast failed * add dll decl for api * polish several deetails * polish details by review comment * add dll_decl for register

fused_attention_op的实现中，使用了bias_add，且其实现是通过使用kernel primitive来实现的，之后kernel primitive的WriteData api接口及函数内部实现发生了更改，将判断越界的逻辑移到了template的参数中，使得调用的分支有错误，产生了越界赋值操作，污染了别的显存空间的内容。具体表现为：test_fused_attention_op_api.py 单次执行基本上不会报错，多次循环执行不同shape的输入，结果计算不对，具有偶发性，bug不易察觉。

) * Make FLAGS_determinstic effective in conv2d forward. * Add call of SetCinnCudnnDeterministic in cinn_launch op.

* make pass ut timeout smaller * increate ut timeout

Add pure fp16 support for fused transformer.

* reshape kernel refactor * fix compile bugs when run ci * support xpu for reshape * fix bugs when run unittest in kunlun ci * fix compile bugs when run kunlun * perfect code according to suggestion * add api and unit test for reshape

…7206)

…#37233)

* Added BF16 Pool2d grad * upstream pulled * fix for CI * fixes after review

* add * add BuildOperatorDependences * fix bug * add unittest for write after write * fix merge bug * fix

* [Einsum] correct output dimension errors due to single element tensors. * [Einsum] format polish.

* copy beta pow to same place when skip_update=1 * fix xpu

…e#37248)

* fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * refactor heter trainer. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop

…n graph engine

* search and fill slot_feature * search and fill slot_feature, fix compile error * search and fill slot_feature, rename 8 as slot_num_ * remove debug code Co-authored-by: root <[email protected]>

CtfGo and others added 30 commits November 13, 2021 20:13

cinn_launch_op: skip checking input variables must be used (PaddlePad…

228eb89

…dle#37119) Modify serveral implements on CinnLaunchOp： 1. Skip checking input variables must be used 2. Move current helper functions to a CinnlaunchContext

[PTen]Reshape Kernel Refactor (PaddlePaddle#37164)

895692e

* reshape kernel refactor * fix compile bugs when run ci * support xpu for reshape * fix bugs when run unittest in kunlun ci * fix compile bugs when run kunlun * perfect code according to suggestion

[heterps]bug fix for local training with --heter_worker_num (PaddlePa…

31cd914

…ddle#37166) * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop

modify sparse_attention docs, test=document_fix (PaddlePaddle#36554)

6b0cc2b

* modify sparse_attention docs, test=develop * add warning * add warning ,test=document_fix

Optimize Matmul_v2 (PaddlePaddle#37037)

444a735

Optimize dot product of Matmul_v2

add fetch op for cinn graph output node of build_cinn_pass (PaddlePad…

10cc040

…dle#37172)

fix bug of indexing with ellipsis (PaddlePaddle#37182)

f2a56c6

Accessor 20211112 2 (PaddlePaddle#37181)

84b0ec9

[New features] Add elementwise_mul triple grad kernel (PaddlePaddle#3…

59fdf4d

…7152) * Add elementwise_mul triple grad kernel * Removed InplaceInferer and polished code

fix cinn_compile_test not pass problem (PaddlePaddle#37190)

83eef6d

Added BF16 to mean op (PaddlePaddle#37104)

df7cc45

* Added BF16 to mean op * fix for CI * fix for CI * fix for CI

fix:delete macro INFERENCE (PaddlePaddle#37130)

b628c31

fix 3 bug of new_executor (PaddlePaddle#37142)

8358d61

* fix 3 bug, test=develop * refine, test=develop

fix ctest depent probs (PaddlePaddle#37203)

cf958f2

remove needless declare (PaddlePaddle#37195)

9c59170

[new-exec] fix stream analysis (PaddlePaddle#37161)

584b4b2

* fix revord_event * refine class Instruction * refine Instruction and InterpreterCore * make instruction and operator_base consistent * support NoNeedBufferVar in stream_analyzer * fix place of event * add vlog before continue

[fleet_executor] Add sync method (PaddlePaddle#37167)

f49c2c2

supports the slice of upper tensor, test=develop (PaddlePaddle#37215)

c5ccff7

added onednn elu kernel (PaddlePaddle#37149)

ae40ee3

modify long time ut list (PaddlePaddle#37220)

5091fed

Make FLAGS_determinstic effective in conv2d forward. (PaddlePaddle#37173

ea47d21

) * Make FLAGS_determinstic effective in conv2d forward. * Add call of SetCinnCudnnDeterministic in cinn_launch op.

Make Distributed Pass UT Timeout Smaller (PaddlePaddle#37199)

a01e27c

* make pass ut timeout smaller * increate ut timeout

test=document_fix (PaddlePaddle#37234)

56810f4

for pure fp16 (PaddlePaddle#37230)

6ebc318

Add pure fp16 support for fused transformer.

veyron95 and others added 17 commits November 16, 2021 15:45

Removed unnecessary ENFORCE statement (PaddlePaddle#37219)

70b7c7e

refine pass by removing CommOpt, CalcOpt, ParallelOpt (PaddlePaddle#3…

4c160be

…7206)

Fix the logic of VarBase _to func (PaddlePaddle#37193)

f29a3c6

[psgpu]fix pipe bug:save and pull overlap; test=develop (PaddlePaddle…

62ec644

…#37233)

Added BF16 Pool2d grad (PaddlePaddle#37081)

f95d44a

* Added BF16 Pool2d grad * upstream pulled * fix for CI * fixes after review

decrease pten log level (PaddlePaddle#37239)

d8982c5

Dependence analysis (PaddlePaddle#37231)

d943459

* add * add BuildOperatorDependences * fix bug * add unittest for write after write * fix merge bug * fix

[Einsum] correct output dimension errors. (PaddlePaddle#37222)

5237cc0

* [Einsum] correct output dimension errors due to single element tensors. * [Einsum] format polish.

[npu][hybrid] support offload (PaddlePaddle#37224)

762819a

[Fleet Executor] Construct runtime graph (PaddlePaddle#37158)

0daa69d

rename TensorBase interface data_type() to dtype() (PaddlePaddle#37257)

1e9b3a3

copy beta pow to same place when skip_update=1 (PaddlePaddle#37245)

5e4b419

* copy beta pow to same place when skip_update=1 * fix xpu

add ut parallel (PaddlePaddle#37211)

1223238

fix compile error when pslib use cpu branch;test=develop (PaddlePaddl…

0057c12

…e#37248)

update dataset (PaddlePaddle#37194)

ca8c4f3

optimize the data structure from c++ to python to speed up sampling i…

06847bc

…n graph engine

Liwb5 closed this Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize graph sampling in graph engine #21

optimize graph sampling in graph engine #21

Liwb5 commented Nov 17, 2021

optimize graph sampling in graph engine #21

optimize graph sampling in graph engine #21

Conversation

Liwb5 commented Nov 17, 2021

PR types

PR changes

Describe