Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix integration test fail due to compatible to ray 1.5.0 #4412

Merged
merged 1 commit into from
Aug 5, 2021

Conversation

shanyu-sys
Copy link
Contributor

@shanyu-sys shanyu-sys commented Aug 5, 2021

@shanyu-sys
Copy link
Contributor Author

shanyu-sys commented Aug 5, 2021

@shanyu-sys shanyu-sys merged commit 5e662bd into intel-analytics:master Aug 5, 2021
shanyu-sys added a commit to shanyu-sys/analytics-zoo that referenced this pull request Sep 8, 2021
shanyu-sys added a commit to shanyu-sys/analytics-zoo that referenced this pull request Sep 9, 2021
shanyu-sys added a commit to shanyu-sys/analytics-zoo that referenced this pull request Sep 10, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 13, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 13, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 14, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 14, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 15, 2021
Le-Zheng added a commit that referenced this pull request Sep 17, 2021
* [Enhancement] Check duplicate layers in the container (#2351)

* check duplicate layers in the container

* add more unit ttest

* fix unit test

* fix test

* meet code review

* add [[MkString]] Operation (#2355)

* add [[MkString]] Operation

* add SerializerTest

* modify the ScalaDoc

* fix the complier error

* Enhance and refactor the logic of InferShape (#2293)

* refactor keras api

style and imports

fix

* fix warning

* fix python

* fix ser test

* fix test

* fix test

* remove checking for weight sharing

* fix unittest

* Keras-like API for training and evaluation (#2306)

* update scala compile fit

* update

* python topology

* refactor lenet

* update

* refactor fit

* add python ut

* ut for image dataset

* fix ut

* fix

* update docs

* update readme

* fix

* [BugFix] TensorOp & SelectTensor (#2380)

* solve conflicts

* resolve conflict again

* Keras-like API functional merge and some fix (#2313)

* resolve conflicts

* update merge

* update merge

* update doc

* refactor python

* fix

* style

* meet review

* Fix reload model in python (#2382)

* fix

* fix default

* Public getInputShape and getOutputShape (#2401)

* open getInputShape and getOutputShape

* clean

* remove logger

* fix

* [Bug Fix] Fix duplicate check sometimes should be suspend (#2403)

* allow keras.Input layer skip duplicate check

* fix some bug

* add some comments

* meet code review

* fix timedistributed to make it compatible with 0.4 (#2408)

* Add Keras Website Documentation for Layers I (#2414)

* add core layers doc

* add advancedactivations and convolutional layers doc

* add dropout, embedding and normalization layers doc

* add pooling and recurrent layers doc

* update

* change

* update

* add merge

* add embedding

* change data

* modify

* update

* update

* change

* update

* Refine Keras-like API LeNet definition (#2407)

* lenet keras seq

* update

* refine

* update

* update

* fix java creator

* meet review

* fix lenet definition

* fix lenet

* test

* fix lenet

* update

* fix

* remove

* [Bug Fix] Fix typo in SpatialSeparableConvoluiton layer name and add related docs (#2420)

* fix typo in spatialseparableconvoluiton name and add docs

* meet code review

* meet code review and fix tests

* Add Keras Website Documentation for Layers II (#2421)

* add zeropadding doc

* modify

* add upsampling

* add atrousconvolution

* add deconvolution2d

* add more

* add locallyconnected

* refactor model evaluate (#2434)

* Keras-Style API website doc for training; refine doc format (#2441)

* training api

* add

* finish

* remove

* fix

* refine doc

* update

* remove

* update

* [Enhancement] ImageFrame adding more APIs (#2464)

* add api for evaluation

* rename method

* fix

* wrapper parquet

* fix api

* fix read issue

* fix BCE return Nan (#2473)

* [bug fix]refine getTimes and time counting. (#2506)

* refine getTimes

* delete some useless code

* sort times

* add unit test

* [New Feature] Add new operation Gather (#2510)

* add opteration gather

* add comments

* gather support float indices

* add gather unit test

* some change

* fix style check

* fix squeeze test (#2509)

* [new feature]add operation max (#2523)

* add max

* add serialization test

* max spec

* meet code review

* meet code review

* fix unit test

* [new feature]add generateBackward for loadTF (#2529)

* max support one-element tensor indices (#2530)

* fix const status not handled correct in loop (#2531)

* add parameter sync for batchnorm (#2559)

* add parameter sync for batchnorm

* fix test

* fix test

* fix test

* fix test

* refinement

* fix

* fix test issue

* add backward compatibility

* refine to set Id instead of renaming

* fix style issue

* refinement per review

* fix

* add change to example

* refinement

* fix ut issue to avoid multiple context

* add comments

* fix style

* [new feature] refine Stridedslice, support begin/end/shrinkAxis mask (#2526)

* refine stridedslice

* delete some file

* meet code review

* fix serial unit test

* fix serialization test

* [new feature] multi optimMethods support in Optimizer (#2560)

* multiOptimMethod

* some update

* fix unit test

* fix ut

* fix unit test

* fix python unit test

* meet code review

* update optimizer.py

* update python

* meet code review

* fix: memory leak in `model.predictImageSet`. (#2557)

* fix: memory leak in `model.predictImageSet`.

There're three reasons of memory leak.

1. repeat allocations in bigquant, which will be fixed in BigDL-core.
2. repeat clone module but no release. `model.predictImageSet` will new
   Predictor again and again.
2. share weights.

This patch add a `StorageManager` which contains a concurrent hash map
to maintain all allocations of native memory/resources and prevent
duplicate release. It's also helpful for debug.

* fix: delete .

* refator:  as the API for AbstractModule

* fix: distribute predictor memory leak

* fix: move delete operation to ModelBroadcast

* refinement per review

* fix ut

* fix scala version issue

* Feat: MKL-DNN Supports (#2482)

This feature enables mkl-dnn support, which can speed up deep learning model. We wrapper the native c api in the java, which are in BigDL-core projects. And in BigDL, we integrated the convolution, batchnorm, maxpooling, avgpooling, relu, lrn, softmax, caddtable and concattable. Currently, it  supports create the model which only contains dnn layer or container.

Because the data layout is optimized in mkl-dnn. The mkl-dnn model will use `DnnTensor` which contains the native buffer as a default tensor. So there're some notations,

1. User should copy the data from jvm heap at the first layer and copy back to jvm heap at the last layer.
2. User should compile the model, which contains the phase (training/inference) and input tensor size. It will infer and allocate the other information.

* fix: linear performance issue and serialization of java object in MklDnnTensor

* memory leak refactor

* memory leak and bn performance issues

1. Memory Leak
The internal buffer with MklDnnTensor should not be re-assigned without
releasing. So we should check it first. At first iteration or after the
changing of input size, we create a new MklDnnTensor as a buffer.

2. Bn perf
The JIT BatchNormalization only supports avx2 or avx512, which has much
batter performance than ref version. The input and gradOutput format
should be the same to get the best performance.

* test: add some test cases for BatchNorm.

The computation of float value is not the same as C/C++/Native with JVM.
And batch norm will make it much greater such as 10^-8 -> 10^-4 -> 10^-1

* fix: rebase with upstream master:

1. Concat and ConcatTable should inherit from DynamicContainer.
2. updateParameters has been depricated.
3. zeroGradParameters should be final. But from now on, the Linear
   should use it.
4. Some other syntax or semantic errors.

* perf: single node and single model performance

* perf: single model

* feat: add fusion for mkl-dnn

* test: add test utils to compare dnn output

* test: add some tests compared with caffe

* add unit tests for dnn tensor

* add unit test for reorder memory

* test: fix the test regression errors

* checkin reorder manager

* add backward for sequential

* fix some bugs

* update core ref

* add unit tests

* refactor: move the static class DataType, AlgKind and so on to standalone class (#4)

* refactor: delete MklDnn.MemoryFormat

* refactor: move the static class DataType, AlgKind and so on to standalone class

* fix: core refactor errors

* refactor: spec errors (#5)

* Mkl dnn dev (#6)

* checkin reorder manager

* add container and refine reorder manager

* fix merge issue

* add join table forward

* refine inteface (#7)

* add LRN and ReLU

* add pooling

* refactor: conv + linear + bn

* add JoinTable backward

* refactor: conv + linear + bn

* add cAddTable concattable

* fix: reorder failed on some of convs

* refactor: softmax

* refactor: fusion support

* refactor: resnet_50

* refactor: move tests to this branch

* refactor: delete unusefull files and enable the special old tests.
refactor: delete unsed methods in MklDnnOps
fix: scalastyle check

* fix: rebase with upstream

* fix: ignore the prototxt tests

* fix: do not change the core commit ref

* fix: move set num of threads for mkldnn to ResNet50Perf

* fix: serialization disabled for mkldnn module

* [Issue fix] - Fix MM layer multi forward/backward issue (#2583)

* fix mm issue

* refinement

* [Bug Fix]clear preTopology's output while cloneCells (#2585)

* clear preTopology's output while cloneCells

* fix unit test

* Dnn model serialization supports. (#2598)

* feat: add simple serialization supports

* feat: mkl-dnn modules serialization supports

* fix: make primitive(desc) to private

* fix: typo

* fix: modified based on comments

* test: private to call api.

* Add serialization for mkl dnn (#2593)

Add dense weights and gradients and support optimizer (local, distribute).

Add a `Blob` for the pair of dense and native weights/gradients with the MemoryData layout.

* Modify the thread pool to adopt mkldnn models (#2608)

The `Engine.default` will support single thread including `LocalOptimizer` and `DistriOptimizer`. For supporting single thread version of `invokeWait2` method in `ThreadPool`, it will set the threadpool to current thread.

1. For dnn model, it will use the affinity to bind omp thread. And for performance issue, the default thread must use current main thread.
2. For MTLabeldBGRImgToBatch will use another new threads pool which is called io. So it will not be blocked when the default thread pool is single thread.
3. For FileWriter, it will not use default, otherwise the whole app will stuck at creating summary.

* feature: add shutdown for optimizer which will release the native resources (#2609)

Release native resources at the end of training.

It will call `release` of model for all models cloned in optimizer at the end of training.

1) `LocalOptimizer` is very simple because all models cloned is local.
2) `DistriOptimizer` is a little complicated. We should do release before `models.unpersist`, otherwise it
     will serialize and transfer the model again. And `ModelBroadcast` will clone new model when do
     value, so we should release them also.

* NLL unlabeled data fix (#2620)

* fix: the inference performance regression of mkldnn (#2622)

We should copy weights when updateOutput at training. The weights are loaded before and will not be changed when do inference.

* feat: training ResNet50 w/ dnn backend on Spark. (#2624)

* feat: resnet50 training with distributed mode
* fix: unknown segmentfault
* fix: clone of dnn tensor
* fix: delete unused codes
* fix: bn initialization
* fix: performance regression
* fix: convergence regression
* fix: delete the release in ModelBroadcast
* fix: to pass all uni tests and delete segment fault.

* feat: add dnn vgg model. (#2627)

* feat: add dnn vgg model.

* fix: rename the ResNet50Perf to Perf

* fix join table will throw exception during backward if batchsize is changed (#2638)

* fix join table backward

* change to resize as

* feature: vgg-16 with mkldnn backend (#2631)

* feat: vgg-16 with mkldnn backend
* fix: tests errors
* fix: case class too much arguments
* fix: vgg_16 blas model supports
* fix: protobuf of serializer
* fix: sgd poly test case error
* fix: consitent of poly impl
* fix: rename the version2 of Xavier to varianceNormAverage

* perf: need not narrow the gradients and zero gradients for dnn backend (#2632)

* perf: need not narrow the gradients and zero gradients for dnn backend
* fix: empty gradient zero for dnn backend
* fix: delete affine

* New parallel optimizer (#2643)

* add new parallel optimizer

* change infor back to debug for metrics log

* refinement per comments

* refinement per comments on single model optimization

* refinement for sharing common methods

* fix style

* refinement to reuse duplicate code

* Fix transfer learning (#2645)

* fix transfer learning
* add ParseSingleExample, DecodeBmp tf loader
* add corresponding unit tests

* remove potential performance downgrader (#2651) (#2652)

* Fix transfer learning (#2645)

* fix transfer learning
* add ParseSingleExample, DecodeBmp tf loader
* add corresponding unit tests

* remove potential performance downgrader

* add dnn graph (#2666)

* add dnn graph

* move compile to forward, add graph test to perf

* add dnn graph option to example

* style check

* replace dnn with dnn graph in examples

* add no phase api when initPrimitives (#2686)

* delete phase in iniPrimitives

* fix style check

* improve memoryReorder layer to handle conversion between nhwc and nchw (#2683)

* fix reorder to handle nhwc

* add init memory for ReorderMemory

* support same padding in dnn layer (#2684)

* support same padding in dnn layer

* meet review

* add BlasWrapper (#2690)

* add BlasWrapper

* refactor code

* meet review

* SerializerSpec excluded mkldnn.BlasWrapper

* change some comments

* add dnn output layer (#2691)

* add dnn output layer

* SerializerSpec excluded mkldnn Output

* change some comments

* add IR graph and conversion from IR graph to blas graph or dnn graph (#2704)

* add ir graph

* fix model evaluate & conv without bias

* add dnnMode & support table inputs

* irelement & graph layer use same weights

* meet pr comments and code refactor

* convert static graph to IR graph and build (#2711)

* add static graph to IR graph

* meet pr comments

* fix: move mkldnn computing to a single thread pool (#2724)

Because if we use the parent thread directly, there will be two bugs,
1. The child threads forked from parent thread will be bound to core 0
because of the affinity settings.
2. The native thread has some unknown thread local variables. So if
the parent thread exits and is recreated, such as the thread from
Executors.newFixedThreadPool. The whole app will be segment fault.
The parent thread means the main thread (Local Mode) or worker thread of
mapPartition (Distributed Mode).

* add ceilMode for Pooling & fix batchNorm evaluate (#2708)

* add ceilMode for Pooling & fix batchNorm evaluate

* add training status for dnn layer

* fix comments

* fix IRGraph init & Add regualizer (#2736)

* fix IRGraph init & Add regualizer

* meet review comments

* fix: update mkldnn version to v0.17 issues. (#2712)

There're two issues,

1. the padding tensor required. mkl-dnn will use a padding tensor which
    will use more memory, such as 4x1x28x28 to 4x8x28x28(avx2). It will
    pad to times of simd width.
2. the TensorMMap between DenseTensor and DnnTensor. Previous impl
    will allocate DnnTensor when model is created, which will cost too much
    space. So this patch will allocate it at runtime.

* add computshape for some layers and add skip primitives in DnnGraph (#2740)

* add computshape for some layer and add skip primitives in DnnGraph

* meet pr comments

* include edge case to cover all the data types (#2742)

* layer auto fusion for dnn graph (#2746)

* add auto fusion in dnn graph

* refactor predict for dnn model (#2737)

* refactor predict for dnn model

* [New Feature] Calculating Scales (#2750)

* [New Feature]Calculating Scales

* recursively update mask for container module (#2754)

* recursively update mask for container module

* [Enhancement] - Speed up BlasWrapper performance under MKL-DNN (#2748)

* add parallel in Blaswrapper

* refactor to support ssd

* meet pr comments

* fix logger serialize

* feat: reorder for int8 supports (#2756)

1. Because the new data type, we should add a new attribute called dataType
    to the `MemoryData`.
2. Because we should transfer the scales between FP32->int8 and Int8->FP32.
    we should add two new attributes called `mask` and `scales`.

* fix conversion accuracy (#2760)

*  fix accuracy for saved model

* exclude mkldnn model when conversion

* feature: layer wise supports of int8 (#2762)

Enable the int8 data type in layers, especially for convolutions.
So for a specific layer, it can accept a int8 input. If you want to the fp32
output, should add a reorder.

* feature: mkldnn int8 layer wise supports (#2759)

including 3 steps.

1. generate scales of model.
   need an api like `generateScalesWithMask` to generate the scales of
   fp32 model. and the model returned is an fp32 model too.
2. quantize the model
   the `quantize()` api will be compatible with the `bigquant`
   backend, which will set the quantize flag. And when doing compile,
   the quantized weight, output, input will be generated by mkldnn at
   runtime.
3. do the inference (forward).

* enable fustion by default (#2766)

* fix: use too much memory of mkldnn models (#2783)

* fix: inplace of input/output and weight dimension error (#2779)

Some layer's input and output use the same memory. We can't do forward in the
`calcScales`. Because at that time, the input has been changed, its scales maybe
not right. Such as,

Seqeuntail().add(Conv).add(ReLU)

it will do two steps, seq.forward(input) first. and when go into the ReLU, it
will do another forward, so the input will be the output. And scales will be
wrong.

For convolution's weight, the dimension always is 5, although the group number
is 1. But for dnn convolution, if there's no group, the weight's dimension
should be 4.

* fix: the blas wrapper has no scales (#2778)

* fix softmax (#2777)

* fix: performance regression on resnet50 (#2774)

the u8 to s8 or s8 to u8 needs no reorder on this case.

* fix log init (#2781)

* fix: dropout should init primitive (#2789)

* fix the wrong error message (#2800)

* [New feature] Add attention layer and ffn layer (#2795)

* add attention layer

* add ffn layer and more unit tests

* refactor according to pr comments

* add SerializationTest

* fix unit tests

* add python api

* bugfix - set mask for container (#2807)

* bugfix - set mask for container

* bugfix #2805: set dimension mask

* Update Graph.scala

* Update Graph.scala

* change set mask indicator's name

* rename set mask params

* fix: memory data hash code should contain data type (#2821)

* Optimize backward graph generation and CAddTable (#2817)

* Optimize backward graph generation and caddtable

* refine add table

* change api name

* add layer norm and expand size layers (#2819)

* add layer norm and expand size

* meet pr comments

* feat: enable global average pooling (#2823)

* feat: enable global average pooling

* test: add more unit tests

* Dilation in MKL-DNN Convolution (#2815)

* mkldnn-dilatedconv

* mkldnn-dilatedconv

* mkldnn-dilatedconv

* mkldnn-dilatedconv

* mkldnn-dilatedconv

* mkldnn-dilatedconv

* fix typos

fix typos

* make todo all uppercase

* fix: calculate arbitrary mask of scales (#2822)

* [New feature] add transformer layer (#2825)

* add transformer

* refactor class name

* use same embedding for translation

* fix pr comments

* Support init_spark_on_yarn and RayContext (#1344)

* rayrunner

* add a jvm killer

* disable killer from spark job and rely on jvm killer only

* add env and verify the cv2 installation

* enhance

* minor

* style

* comments

* local and enhancement

* better local strategy

* doc and style

* doc

* style

* revert

* doc

* disable

* comments

* fix test

* feat: MKLDNN LSTM unidirectional/bidirectional inference support (#2806)

* LSTM draft

* MKLDNN LSTM fixed MD

* added hiddenSize

* setMemoryData NativeData

* weights NativeData format set to ldigo, all 1 test passed

* fixed format any problem

* LSTM weights bias initialisation

* add LSTM2 in nn

* Bidirectional LSTM inference enabled

* modified Bidirectional test

* LSTMSpec input format conversion bug between bigdl and mkldnn fixed, not support random weights, bias

* fixed the last problem 1 3 2 4

* Three inference tests with randomly generated parameters

* Added comments and modified the LSTMSpec (tests using Equivalent.nearequals)

* Deleted nn/LSTM2. Renamed methods. Added a requirement in nn/TimeDistributed

* combined initMemoryDescs() into initFwdPrimitives()

* Add require for input size and hidden size matching if layers of LSTM is more than one

* Refactor RNN

* Add comment on gate order to mkldnn/RNN

* Add unidirectional multilayer test

* add comments/ modify UTs

* phase is not used anymore/ use isTraining() in stead

* operationWant enhanced/ weight init/ release() parameters()

* remove input format check and change some variables names

* input format check / throw exception print info / release code

* comment style and RNNSerialTest

* remove unnecessary comments

* bug fix for cmul (#2836)

* bug fix for cmul

* meet pr comments

* Enhance Ray on spark (#1449)

* add more doc, spark_conf and extra_options

* fix pip install

* doc

* fix release.sh

* We should search and find the necessary jars from python env rather than upload them again (#1460)

* fix path

* jars

* set new storage to weight and bias for weight fusion (#2839)

* Add transformer to LM example (#2835)

* add transformer to LM example

* refactor dropout in Transformer

* meet pr comments

* feat: MKLDNN LSTM unidirectional/bidirectional backward support (#2840)

* MKLDNN LSTM backward support with accuracy testing

* fix: require consistent between shape and layout of mkldnn (#2824)

* fix: fusion for multi-group of convolution (#2826)

* fix: support int8 of jointable (#2827)

* fix: support int8 of jointable
* doc: add more docs

* fix acc bug & init dnn thread (#2841)

* support tnc and ntc conversion (#2844)

* support ntc in dnn layer (#2847)

* support ntc in dnn layer

* meet pr comments

* [WIP]Add beam search feature in transformer model (#2834)

* add beam search feature

* Update beam search feature and unit test

* add symbolToLogits function set check

* update clearState and add serial test

* add SequenceBeamSearch to python layers

* add createSequenceBeamSearch method to python api

* Remove ray dependencies from init_spark_on_yarn (#1500)

* remove ray dependencies

* delete

* update beam search feature for interface with transformer model (#2855)

* update beam search for padding value and cache structure

* update python API for beam search

* add comments and update python layer

* modify comments format

* modify comments format

* Support converting blas lstm to dnn lstm (#2846)

* convert from blas lstm to dnn lstm

* meet pr comments

* fix load lstm error bug (#2858)

* Add beam search in transformer (#2856)

* Add beam search in transformer

* meet pr comments

* fix: upgrade the performance of normalize (#2854)

* feat: add axis to softmax (#2859)

* Add an example notebook for implementing PS with Ray  (#1522)

* upup

* add ps notebook for ray

* Expose more option for driver in RayContext (#1541)

* expose more option for driver

* minor

* more

* modify run-pytests to check version before test ray (#1532)

* Create .keep

* update .keep path

* add rl_pong example

* move to rl_pong direction

* remove original file

* add parameter server example

* add license

* PEP8 checks

* PEP8 checks

* Add into integration test

* Wrap tests into bash function

* Update license

* PEP8 checks

* Correct syntax of rl_pong

* modify run-pytests to check version before test ray

* test pyspark version and spark home

* add check spark_home's pyspark in case pyspark can't be found

* add check version before run ray examples

* change spark home

* change spark home

* install packages which are needed in ray examples

* check error

* fix error

* change execution to spark-submit

* change memory

* change object memory to test

* add atari_py dependency

* remove .keep

* move ray test to new files

* change some ray-pip lines into function

* remove rl_pong and fix parameter_server iterations

* add iteration

* change iterate, print info

* add more info

* add __init__ files

* change ray to rayexample to avoid conflict and change spark-submit to python to submit tasks

* renamed foreach_evaluator to foreach_worker because rllib update and rename file rllib to rllibexample

* add a dedicated file for the ray test

* PEP8 check fix

* PEP8 check fix

* remove test_split

* remove --doctest-modules about ray

* add time.sleep

* feat: RoiAlign Forward (#2874)

* feat: Feature Pyramid Networks Forward (#2870)

* add gemm layer (#2882)

* add gemm layer

* add gemm layer

* add gemm layer

* add gemm layer

* add gemm layer

* add gemm layer

* add gemm layer

* add gemm layer

* add gemm layer

* add gemm layer

* add gemm layer

* add gemm layer

* add gemm layer

* add gemm layer

* add transpose in gemm layer

* add transpose in gemm layer

* add transpose in gemm layer

* add gemm layer

* add gemm layer

* add Shape layer (#2885)

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add shape layer

* add Gather layer (#2897)

* add gather layer

* [New feature] Add maskhead (#2892)

* support for maskhead

* fix ray and add more test (#1596)

* fix ray and add more test

Signed-off-by: Jieru Hong <[email protected]>

* modify raycontext and move test file to func

Signed-off-by: Jieru Hong <[email protected]>

* modify process and add sc.stop in the end

Signed-off-by: Jieru Hong <[email protected]>

* delete one repeat and check PEP8

Signed-off-by: Jieru Hong <[email protected]>

* change file name and remove some useless code

Signed-off-by: Jieru Hong <[email protected]>

* rename test yarn reinit file

Signed-off-by: Jieru Hong <[email protected]>

* ignore test reinit raycontext

Signed-off-by: Jieru Hong <[email protected]>

* modify  predict/predictClass function  (#2868)

* predictClass output modification

* predict/predictClass function modification in Beta Api

* predict/predictClass function modification

* predict/predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* predictClass function modification

* [New feature] Add Boxhead (#2894)

* add boxhead

* add SerialTest

* meet pr comments

* fix: Add TopBlocks to Feature Pyramid Networks (FPN) (#2899)

* Auto memory management for MKLDNN (#2867)

* add memory owner

* Add DnnTensor to MemoryOwner

* delete unused file

* style fix

* Move ReorderManager to MemoryOwner

* Fix compiling errors

* use Releasable as a general management type. release input layer.

* remove redundant null checking

* style fixes

* change _implicitMemoryOwner -> _this

* [New feature] Add region proposal (#2896)

* add Regionproposal

* [New feature] add maskrcnn (#2908)

* add maskrcnn

* fix mask head

* move maskrcnn to models

* add maskrcnn serialTest

* Add Onnx Supported Layers (#2902)

* remove duplicated layers

* feat: MKLDNN GRU forward/backward support (#2893)

* Onnx support: modify unsqueeze function (#2910)

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* modeify unsqueeze function

* Fix memory leaks on training (#2914)

* add memory owner

* Add DnnTensor to MemoryOwner

* delete unused file

* style fix

* Move ReorderManager to MemoryOwner

* Fix compiling errors

* use Releasable as a general management type. release input layer.

* remove redundant null checking

* fix memory leak in batch norm

* style fixes

* change _implicitMemoryOwner -> _this

* release submat

* release opencv submats

* support batch for mask head and pooler (#2926)

* support batch for mask head

* meet comments

* Onnx support: add a dim parameter to ops.Gather (#2920)

* add dim parameter to ops.Gather

* improve and simplify code

* improve and simplify code

* improve and simplify code

* improve and simplify code

* support batch for regionproposal (#2928)

* support batch for regionproposal

* Onnx support: add pos parameter to softmax (#2933)

* add pos parameter to softmax

* add pos parameter to softmax

* add pos parameter to softmax

* fix review problem

* fix review problem

* support batch input for boxhead (#2924)

* boxhead support batch input

* meet pr comments

* ONNX Support (#2918)

* onnx dev

* add onnx loader

* clean up

* add post processing for maskrcnn model (#2931)

* add mask postprocessing

* put image info to mask model

* revert back api (#2943)

* fix: softmax and bn+scale fusion (#2937)

* feat: multi models support with MKL-DNN backend (#2936)

* feat: multi models support with MKL-DNN backend

* add no argument apply api for softmax (#2945)

* add no argument apply api for softmax

* add no argument apply api for softmax

* add maskrcnn inference example (#2944)

* add maskrcnn inference example

* meet pr comments

* add model download url

* Minor fix for detecting sc is local (#1735)

* fix sc local

* update

* memory data cleanup (#2956)

* memory data cleanup

* Onnx support: RoiAlign and TopK parameter update (#2957)

* Topk add dim and increase parameter

* RoiAlign add max pooling mode

* add test cases

* add test cases

* add callZooFunc and change all callBigDlFunc to callZooFunc (#1793)

* feat: add softmax backward (#2967)

* feat: add softmax backward

* fix: fuse bn scale and relu to bn. (#2966)

* fix: fuse bn scale and relu.

* refactor anchor generator (#2963)

* refactor anchor generator

* meet pr comments

* fix code style

* ROIAlign refactor (#2960)

* ROIAlign refactor

* fix unit tests

* support roialign backward (#2975)

* support roialign backward

* fix sparselinear unit test

* Remove default mkl settings when start ray (#1837)

* fix: bn nhwc error, the channel should be the last dim (#2981)

* fix: softmax dnn backend wrong order of primitive (#2986)

* Add a method to merge nested StaticGraphs (#2985)

* NHWC support when running with MKL-DNN (#2989)

* support NHWC for MKLDNN

* fix unit tests

* Keras with MKL-DNN backend support (#2990)

* Fix ray get in related unit tests (#1985)

* move automl ut test from run-pytests into run-pytests-ray, add limitation for object_store_memory

* config unit tests

* fix echo message in tests script

* Support MXNetTrainer (#2143)

* add mxnet support

* add dummy data

* refactor api

* minor

* print ip

* update to the latest

* some update

* remove image classification example and update lenet

* add lenet

* fix style

* minor update

* update example

* add unit test

* style

* meet review

* style

* update

* fix resource tags

* update

* try to fix

* stop sc and ray_ctx

* add conftest to fix ut

* fix ray local test

* trial

* test

* move mxnet to package

* update conftest

* revert

* update

* [Fix issue #2150] Remove waiting in Ray init (#2151)

* remove wait

* remove waiting before and after

* Xshard pandas on ray support (#2115)

* add xshard interfaces

* add example

* update hdfs,s3 implementation

* update xshard api and example

* remote get_shards methond

* update support for s3

* update pyarrow order

* add comments, docs, update example

* update example readme

* update code, add test

* update apply, read

* update text

* update by comments

* restore gitignore

* update license

* update test

* add conftest

* fix style

* fix style

* add xshard test

* update pytest

* update pytest

* Refactor RayContext (#2175)

* Add doc for MXNetTrainer and some code refactor (#2198)

* add doc and some code refactor

* minor

* update

* minor

* Change ray import folder structure (#2194)

* rebase

* revert automl readme

* Add README for MXNet LeNet example (#2208)

* initial readme

* update

* typo

* typo

* add batch size

* update style

* add validation for gluon

* fix style

* minor

* update

* minor

* more doc

* minor

* Upgrade ray to 0.8.4 (#2249)

* fix ray ip mismatch

* adjust to ray 0.8

* upgrade ray

* upgrade automl ray 0.8.4, feature ut failed

* fix test case and doc

* fix style

* fix style

* fix rllib

* update docker

* remove docker changes and change setup.py

* remove docker change

* reflect ray changes

* fix style

* fix cluster

Co-authored-by: Yu Shan <[email protected]>
Co-authored-by: Shan Yu <[email protected]>

* orca init (#2304)

* orca init

* xshard migration

* doc fix

* add license

* indent

* add csv files

* fix path

* Expose driver core in RayContext; polish ray docs (#2315)

* update doc

* style

* Orca MXNetTrainer migration (#2320)

* migrate mxnet_trainer to estimator

* fix

* newline

* indent

* fix test path

* fix

* ignore mxnet estimator test in spark2.4-

* estimator to trainer

* style

* Remove final for AbstractModule (#3001)

* Add initial version of PyTorchTrainer in orca.learn (#2349)

* add torchtrainer

* style

* meet review

* update

* fix typo

* add ut

* add test

* RayContext init return address info (#2376)

* return

* meet review

* style

* Get the current RayContext (#2411)

* initial

* update

* minor

* remove final setExtraParameters (#3014)

* deprecate nn.keras (#3013)

* deprecate nn.keras

* Move orca learn ray tests to a ray sub dir (#2577)

* move orca learn ray tests to a ray sub dir

* fix path

* fix KerasLayer new parameters() (#3034)

* Update init_spark_on_yarn (#2587)

* update

* meet review

* update doc

* style

* Minor update on Ray (#2604)

* minor update

* update

* Fix Spark configurations in RayContext (#2642)

* update

* meet review

* style

* minor

* update

* Add init_spark_standalone for local node (#2685)

* initial version

* remove

* update to local

* remove

* style

* fix path

* fix

* update pythonhome

* Add horovod estimator save/load/get_models/shutdown (#2702)

* add get_model/save/load/shutdown for pytorch horovod estimator

* change args

* remove condition

* change name

* remove get multiple models

* add pytorch estimator ut and disable for now

* meet comments

* Add Horovod tests (#2761)

* add pytorch horovod tests

* add horovod tf tests

* fix

* fix style

* fix tests

* fix tests

* fix tests

* fix tests

* fix tests

* fix tests

* Add init_orca_context (#2774)

* initial imple

* update

* meet review

* review and style

* remove stopped

* add doc

* minor

* move import

* fix mxnet

* remove

* add file lock on ray start to avoid port conflict on the same machine (#2777)

* Update UTs and examples with init_orca_context (#2787)

* update unit tests

* minor

* update

* update mxnet

* move barrier

* fix mxnet

* update

* bug fix

* update

* update test

* update mxnet example

* update mxnet

* minor

* minor

* minor

* update examples

* move ray import dependencies

* readme

* minor

* bug fix

* remove default

* fix duplicated ray worker logs (#2799)

* fix multiple ray worker logs

* refine comments

* Update raycontext.py

* Support init_spark_on_k8s (#2813)

* initial

* fix

* code refactor

* bug fix

* update docker

* style

* Support RayOnSpark for k8s and add docs (#2836)

* support ray on k8s

* add to init orca context

* style

* minor

* minor

* ut

* Hotfix for ray psutil on macOS (#2856)

* Fix psutil test fail on macOS.
* Add exception handle for running without root.

* [WIP] spark 3.0 (#3054)

* spark 3.0

* expose tcmf num_workers to zouwu (#2884)

* expose tcmf num_workers to zouwu

* fix style & add logger handler & default num_workers

* change backend to horovod

* change zouwu forecast base test case from ZooTestCase to TestCase

* split zouwu model forecast test into with and without ft

* change import

* change import

* change zouwu tests structure

* fix typo

* expose evaluate and fix num_workers after rebase

* change default value of num_workers

* Add ray rdd (#2996)

* add ray rdd

* fix style

* add more tests

* Fix orca ray pytorch example (#3007)

* fix horovod pytorch exampe

* fix bug

* fix process group

* fix style

* fix tests

* fix test

* fix tests

* revert ray context change

* Fix validate in Orca PyTorch Estimator (#3012)

* fix validate

* rename

* fix

* fix ut

* update

* minor

* fix ut

* squeeze target dimension (corner case) in ClassNLLCriterion (#3072)

* fix target dimension match error

* hotfix ClassNLLCriterion with cloned target (#3081)

* hotfix ClassNLLCriterion with cloned target

* Fix locale environment variables when launching Ray (#3167)

* run ray stop before ray tests started (fix ray memory test fail issue) (#3193)

* run ray stop before ray tests started

* add more ray stop

* fix ray memory issue

* change scope to class

* change mxnet scope to function

* attempt to fix ray memory (#3205)

* attempt to fix ray memory

* exclude webui

* Update doc for RayContext (#3211)

* update

* update

* Fix macOS ZombieProcess Exception (#3221)

* Fix macOS ZombieProcess
* Add ProcessLookupError catch

* add serializeUid (#3099)

* update doc (#3104)

* upgrade ray to 1.0 (#3257)

* upgrade ray to 1.0

fix automl

ray port

* fix tests

* fix bug

* fix bug

* fix tests

* fix example

* fix example

* fix tests

* change back

* comment out test

* upate setup

* Add initial version of auto estimator (#3731)

* add initial version of auto estimator

* support str for torch optimizor and loss

* move create searcher

* change name to model_builder

* add best model

* add convert probability to class in metrics.Accuracy

* change condition order

* add pytorch ut

* change util name

* add LR_NAME

* add document for LR_NAME

* add check with best model and optimizer class test

* add ut for tf.keras

* move tests to orca/automl/autoestimator/

* change name

* add auto test

* fix pep8

* remove return in fit

* add raise error for fit multiple times

* remove optimizer class

* fix bug

* change error message

* read parquet dataset as tf.data.Dataset (#3956)

* Change zouwu to chronos in source codes (#4000)

* change zouwu to chronos in all codes

* change autotcn location

* change zouwu to chronos in notebook

* fix conflict

* fix conflict again

* fix bug

* manual check fix

* links in doc

* Add support for non-barrier mode to launch ray (#4014)

* add support for non-barrier mode

* fix style

* meet review

* meet review

* move barrier mode to zoocontext

* bug fix

* modify

* update

* May fix jenkins AutoEstimator randomly fail (#4164)

* change order of autoestimator test

* reduce core num and trial num

* Add assert error message when launching Ray for non-barrier mode (#4221)

* Be compatible to ray 1.5.0 (#4387)

* compatible to ray 1.5.0

* refine

* fix raycontext (#4412)

* Kill process group instead of iterator of pids in shutdown hook (#4494)

* kill process group instead of process iter

* change name

* change name

* update doc

* fix style

* change to string

* Move automl.model to orca.automl (#4667)

* delete common folder

* fix reference

* move test

* modify scripts

* Delete automl (#4680)

* rm automl

* rm test automl

* rm automl in tests

* Add ray daemon to kill ray processes (#4571)

* add ray daemon

* remove in bigdl

* add ray daemon in start_restricted_worker

* change to static method

* remove ProcessMonitor.register_shutdown_hook and clean_fn

* change name

* clean useless code

* add license

* zoo.ray -> bigdl.orca.ray; zoo.util->bigdl.dllib.utils

* zoo -> bigdl.dllib.utils.nncontext

* comment out other tests than ray

* change path in run-pytest-ray

* update test script and rename tfpark package

* fix pythotfpark ut I

* turn off scala style check (#4695)

* add resources for python tfpark

* fix prepare_env

* add __init__.py in test subfolders

* remove orca/src/test

* fix scala style check and disable header check  (#4715)

* uncomment partial keras ut test (#4716)

* move autograd to keras/autograd and migrate local estimator (#4721)

* bigdl keras private (#4731)

* add inferenceModelLoadOpenVINONg (#4730)

* uncomment TimeDistributed (#4736)

* move bigdl.nn to bigdl.dllib.nn

* rm bigdl keras ut test from keras

* update keras path in ut

Co-authored-by: Ian Wong <[email protected]>
Co-authored-by: tosky001 <[email protected]>
Co-authored-by: li,zhichao <[email protected]>
Co-authored-by: Kai Huang <[email protected]>
Co-authored-by: Xu Xiao <[email protected]>
Co-authored-by: Jerry Wu <[email protected]>
Co-authored-by: Quincy2014 <[email protected]>
Co-authored-by: zhangxiaoli73 <[email protected]>
Co-authored-by: Xin Qiu <[email protected]>
Co-authored-by: Yanzhang Wang <[email protected]>
Co-authored-by: Griffin Kardos <[email protected]>
Co-authored-by: megaSpoon <[email protected]>
Co-authored-by: LeicongLi <[email protected]>
Co-authored-by: yaochi <[email protected]>
Co-authored-by: Firecrackerxox <[email protected]>
Co-authored-by: majing921201 <[email protected]>
Co-authored-by: Jieru Hong <[email protected]>
Co-authored-by: Xiao <[email protected]>
Co-authored-by: Menooker <[email protected]>
Co-authored-by: Yu Shan <[email protected]>
Co-authored-by: Shane Huang <[email protected]>
Co-authored-by: Yang Wang <[email protected]>
Co-authored-by: jenniew <[email protected]>
Co-authored-by: Shan Yu <[email protected]>
Co-authored-by: Yina Chen <[email protected]>
Co-authored-by: Qiyuan Gong <[email protected]>
Co-authored-by: Junwei Deng <[email protected]>
Co-authored-by: dding3 <[email protected]>
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 17, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 17, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 22, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant