New Features to run MobileVLM on orin #5132

JidongZhang-THU · 2024-01-26T02:28:08Z

New Feature:

Sum_Rows:
fix block shape error when nrows too big
Im2Col:
Support Batch in cuda
Support f32 to f32 both in cpu && cuda
DepthWiseConv:
Support by Im2Col && MulMat
Pool_2d:
Supoort avg pooling in cuda
HardSigmoid:
Imp in cuda
HardSwish:
Imp in cuda

1. Sum_Rows: fix cuda kernel overflow fix block shape error when nrows too big 2. Im2Col: Support Batch in cuda Support f32 to f32 both in cpu && cuda 3. DepthWiseConv: Support by Im2Col && MulMat 4. Pool_2d: Supoort avg pooling in cuda 5. HardSigmoid: Imp in cuda 6. HardSwish: Imp in cuda

ggml-cuda.cu

ggml.c

ggml-cuda.cu

ggerganov · 2024-01-26T09:21:48Z

ggml.h

@@ -1493,7 +1493,8 @@ extern "C" {
            int                  p1,
            int                  d0,
            int                  d1,
-            bool                 is_2D);
+            bool                 is_2D,
+            enum ggml_type       dst_type);


I'm not sure about this API.

Long term we would like all ops to be able to output certain type which would mean we would have to extend all ops with dst_type. Is this OK?

We already started using another pattern with ggml_mul_mat_set_prec().
Not that it is great, but we might want to look for consistency here.

Somewhat related: ggerganov/ggml#455

I think this is something that we will need to deal with when we add support for full F16 pipeline. I don't think adding a dst_type parameter to every op would be a good way to do this, a setting in ggml_context that sets the output type for all the ops it would make more sense, but as it is now, I don't think we have a better way to handle this.

For now should we merge it like this and figure out later?

I think so.

ggml-cuda.cu

slaren · 2024-01-29T12:48:54Z

The pool 2D test doesn't pass. This failure indicates a both a correctness issue and a buffer overflow.

  POOL_2D(pool_type=1,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 1.207690462 > 0.000000100 sentinel mismatch: sent_2 FAIL
  POOL_2D(pool_type=0,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 0.583387657 > 0.000000100 sentinel mismatch: sent_2 FAIL

JidongZhang-THU · 2024-01-30T02:04:46Z

The pool 2D test doesn't pass. This failure indicates a both a correctness issue and a buffer overflow.

  POOL_2D(pool_type=1,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 1.207690462 > 0.000000100 sentinel mismatch: sent_2 FAIL
  POOL_2D(pool_type=0,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 0.583387657 > 0.000000100 sentinel mismatch: sent_2 FAIL

This is my result,
POOL_2D(pool_type=1,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): OK
POOL_2D(pool_type=0,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): OK
after
cmake .. -DCMAKE_BUILD_TYPE=Debug -DLLAMA_METAL=OFF -DLLAMA_CUBLAS=ON
cmake --build . --config Release
./bin/test-backend-ops

JidongZhang-THU · 2024-01-30T02:29:11Z

The pool 2D test doesn't pass. This failure indicates a both a correctness issue and a buffer overflow.

  POOL_2D(pool_type=1,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 1.207690462 > 0.000000100 sentinel mismatch: sent_2 FAIL
  POOL_2D(pool_type=0,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 0.583387657 > 0.000000100 sentinel mismatch: sent_2 FAIL

I fix a kernel bug in 1556d4c

nits

slaren · 2024-01-30T02:33:30Z

The buffer overflow is fixed, but the error is still high:

$ make clean; LLAMA_CUBLAS=1 make tests/test-backend-ops && tests/test-backend-ops -o POOL_2D -b CUDA0
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi
I NVCCFLAGS:
I LDFLAGS:
I CC:        cc (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0
I CXX:       g++ (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0

rm -vrf *.o tests/*.o *.so *.dll benchmark-matmult common/build-info.cpp *.dot *.gcno tests/*.gcno *.gcda tests/*.gcda *.gcov tests/*.gcov lcov-report gcovr-report main quantize quantize-stats perplexity imatrix embedding vdot q8dot train-text-from-scratch convert-llama2c-to-ggml simple batched batched-bench save-load-state server gguf llama-bench libllava.a llava-cli baby-llama beam-search speculative infill tokenize benchmark-matmult parallel finetune export-lora lookahead lookup passkey tests/test-c.o tests/test-llama-grammar tests/test-grammar-parser tests/test-double-float tests/test-grad0 tests/test-opt tests/test-quantize-fns tests/test-quantize-perf tests/test-sampling tests/test-tokenizer-0-llama tests/test-tokenizer-0-falcon tests/test-tokenizer-1-llama tests/test-tokenizer-1-bpe tests/test-rope tests/test-backend-ops tests/test-autorelease
removed 'ggml-alloc.o'
removed 'ggml-backend.o'
removed 'ggml-cuda.o'
removed 'ggml-quants.o'
removed 'ggml.o'
removed 'tests/test-backend-ops'
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi
I NVCCFLAGS: -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS:   -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib -L/usr/local/cuda/targets/aarch64-linux/lib -L/usr/lib/wsl/lib
I CC:        cc (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0
I CXX:       g++ (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml.c -o ggml.o
nvcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi" -c ggml-cuda.cu -o ggml-cuda.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml-alloc.c -o ggml-alloc.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml-backend.c -o ggml-backend.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion     -c ggml-quants.c -o ggml-quants.o
ggml-cuda.cu: In function ‘void ggml_cuda_op_pool2d(const ggml_tensor*, const ggml_tensor*, ggml_tensor*, const float*, const float*, float*, cudaStream_t)’:
ggml-cuda.cu:8719:26: warning: unused parameter ‘src1’ [-Wunused-parameter]
 8719 |     const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst,
      |       ~~~~~~~~~~~~~~~~~~~^~~~
ggml-cuda.cu:8720:23: warning: unused parameter ‘src1_dd’ [-Wunused-parameter]
 8720 |     const float * src0_dd, const float * src1_dd, float * dst_dd, cudaStream_t main_stream) {
      |          ~~~~~~~~~~~~~^~~~~~~
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi tests/test-backend-ops.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o tests/test-backend-ops -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib -L/usr/local/cuda/targets/aarch64-linux/lib -L/usr/lib/wsl/lib
Testing 3 backends

Backend 1/3 (CPU)
  Skipping
Backend 2/3 (CUDA0)
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Backend name: CUDA0
  POOL_2D(pool_type=1,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 0.811819988 > 0.000000100 FAIL
  POOL_2D(pool_type=0,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 1.100436806 > 0.000000100 FAIL
  1182/1184 tests passed
  Backend CUDA0: FAIL

Backend 3/3 (CUDA1)
  Skipping
2/3 backends passed
FAIL

$ git rev-parse HEAD
1556d4ca17718417a6dad9bf73939625c2b2e7a0

The result is the same when building with cmake.

JidongZhang-THU · 2024-01-30T02:55:18Z

LLAMA_CUBLAS=1 make tests/test-backend-ops && tests/test-backend-ops -o POOL_2D -b CUDA0

Would it work if you change the kernel like this?

-    #if __CUDA_ARCH__ >= 350
-                Ti cur = __ldg(i_ptr + i * iw + j);
-    #else
                 Ti cur = i_ptr[i * iw + j];
-    #endif

JidongZhang-THU · 2024-01-30T03:08:25Z

The buffer overflow is fixed, but the error is still high:

$ make clean; LLAMA_CUBLAS=1 make tests/test-backend-ops && tests/test-backend-ops -o POOL_2D -b CUDA0
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi
I NVCCFLAGS:
I LDFLAGS:
I CC:        cc (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0
I CXX:       g++ (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0

rm -vrf *.o tests/*.o *.so *.dll benchmark-matmult common/build-info.cpp *.dot *.gcno tests/*.gcno *.gcda tests/*.gcda *.gcov tests/*.gcov lcov-report gcovr-report main quantize quantize-stats perplexity imatrix embedding vdot q8dot train-text-from-scratch convert-llama2c-to-ggml simple batched batched-bench save-load-state server gguf llama-bench libllava.a llava-cli baby-llama beam-search speculative infill tokenize benchmark-matmult parallel finetune export-lora lookahead lookup passkey tests/test-c.o tests/test-llama-grammar tests/test-grammar-parser tests/test-double-float tests/test-grad0 tests/test-opt tests/test-quantize-fns tests/test-quantize-perf tests/test-sampling tests/test-tokenizer-0-llama tests/test-tokenizer-0-falcon tests/test-tokenizer-1-llama tests/test-tokenizer-1-bpe tests/test-rope tests/test-backend-ops tests/test-autorelease
removed 'ggml-alloc.o'
removed 'ggml-backend.o'
removed 'ggml-cuda.o'
removed 'ggml-quants.o'
removed 'ggml.o'
removed 'tests/test-backend-ops'
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi
I NVCCFLAGS: -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS:   -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib -L/usr/local/cuda/targets/aarch64-linux/lib -L/usr/lib/wsl/lib
I CC:        cc (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0
I CXX:       g++ (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml.c -o ggml.o
nvcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi" -c ggml-cuda.cu -o ggml-cuda.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml-alloc.c -o ggml-alloc.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml-backend.c -o ggml-backend.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion     -c ggml-quants.c -o ggml-quants.o
ggml-cuda.cu: In function ‘void ggml_cuda_op_pool2d(const ggml_tensor*, const ggml_tensor*, ggml_tensor*, const float*, const float*, float*, cudaStream_t)’:
ggml-cuda.cu:8719:26: warning: unused parameter ‘src1’ [-Wunused-parameter]
 8719 |     const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst,
      |       ~~~~~~~~~~~~~~~~~~~^~~~
ggml-cuda.cu:8720:23: warning: unused parameter ‘src1_dd’ [-Wunused-parameter]
 8720 |     const float * src0_dd, const float * src1_dd, float * dst_dd, cudaStream_t main_stream) {
      |          ~~~~~~~~~~~~~^~~~~~~
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi tests/test-backend-ops.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o tests/test-backend-ops -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib -L/usr/local/cuda/targets/aarch64-linux/lib -L/usr/lib/wsl/lib
Testing 3 backends

Backend 1/3 (CPU)
  Skipping
Backend 2/3 (CUDA0)
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Backend name: CUDA0
  POOL_2D(pool_type=1,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 0.811819988 > 0.000000100 FAIL
  POOL_2D(pool_type=0,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 1.100436806 > 0.000000100 FAIL
  1182/1184 tests passed
  Backend CUDA0: FAIL

Backend 3/3 (CUDA1)
  Skipping
2/3 backends passed
FAIL

$ git rev-parse HEAD
1556d4ca17718417a6dad9bf73939625c2b2e7a0

The result is the same when building with cmake.

I do fix a bug in 379f89f, but it's weird that passed when I test.

slaren · 2024-01-30T10:54:32Z

Max pool works now, but avg pool still fails:

  POOL_2D(pool_type=1,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 0.062665811 > 0.000000100 FAIL
  POOL_2D(pool_type=0,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): OK

JidongZhang-THU · 2024-01-30T12:37:54Z

make clean; LLAMA_CUBLAS=1 make tests/test-backend-ops && tests/test-backend-ops -o POOL_2D -b CUDA0
I llama.cpp build info:

count_include_pad, fix in 49f09aa

nits

slaren · 2024-01-30T13:14:58Z

The original tests pass. I have added more tests and some don't pass with F16. Here is the full result:

  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=0): [POOL_2D] NMSE = 0.999203570 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=1): [POOL_2D] NMSE = 0.999935250 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=0): [POOL_2D] NMSE = 1.034885951 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 1.035120687 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 1.000578880 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=1): [POOL_2D] NMSE = 1.137066784 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=0): [POOL_2D] NMSE = 1.000478261 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=1): [POOL_2D] NMSE = 1.272810572 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=0): [POOL_2D] NMSE = 1.056138950 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=1): [POOL_2D] NMSE = 0.970045638 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=0): [POOL_2D] NMSE = 1.046133138 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 1.027228429 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 1.000953057 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=1): [POOL_2D] NMSE = 1.152351934 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=0): [POOL_2D] NMSE = 0.999528377 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=1): [POOL_2D] NMSE = 0.910042767 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=0): [POOL_2D] NMSE = 1.018958447 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=1): [POOL_2D] NMSE = 1.007969632 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=0): [POOL_2D] NMSE = 1.000537444 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 0.995353507 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 1.000791375 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=1): [POOL_2D] NMSE = 1.038882169 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=0): [POOL_2D] NMSE = 0.998625353 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=1): [POOL_2D] NMSE = 0.997232468 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=0): [POOL_2D] NMSE = 1.001658055 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=1): [POOL_2D] NMSE = 0.984994554 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=0): [POOL_2D] NMSE = 0.999921076 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 1.114715196 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 0.999992255 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=1): [POOL_2D] NMSE = 0.981128804 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=0): [POOL_2D] NMSE = 1.001531619 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=1): [POOL_2D] NMSE = 1.014709895 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=0): [POOL_2D] NMSE = 1.012783650 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=1): [POOL_2D] NMSE = 1.129464706 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=0): [POOL_2D] NMSE = 0.981063991 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 1.036889803 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 0.998088056 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=1): [POOL_2D] NMSE = 1.011179116 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=0): [POOL_2D] NMSE = 1.001539971 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=1): [POOL_2D] NMSE = 1.244588730 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=0): [POOL_2D] NMSE = 0.998804437 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=1): [POOL_2D] NMSE = 1.012374516 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=0): [POOL_2D] NMSE = 1.032320652 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 1.049584847 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 1.001242671 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=1): [POOL_2D] NMSE = 1.007476776 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=0): [POOL_2D] NMSE = 0.998830986 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=1): [POOL_2D] NMSE = 1.060531142 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=0): [POOL_2D] NMSE = 0.987964582 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=1): [POOL_2D] NMSE = 1.110660686 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=0): [POOL_2D] NMSE = 0.967990393 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 1.103078426 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 1.000453049 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=1): [POOL_2D] NMSE = 1.100136834 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=0): [POOL_2D] NMSE = 0.999588599 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=1): [POOL_2D] NMSE = 0.985447313 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=0): [POOL_2D] NMSE = 0.999752246 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=1): [POOL_2D] NMSE = 1.032946381 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=0): [POOL_2D] NMSE = 0.985949820 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 0.950415594 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 1.001280259 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=1): [POOL_2D] NMSE = 1.043412381 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=0): [POOL_2D] NMSE = 0.998464873 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=1): [POOL_2D] NMSE = 1.195598004 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=0): [POOL_2D] NMSE = 1.000442296 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 0.998886299 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=0): [POOL_2D] inf mismatch: CUDA0=0.765890 CPU=-340282346638528859811704183484516925440.000000 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 1.000165245 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=0): [POOL_2D] NMSE = 0.992005694 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=1): [POOL_2D] NMSE = 0.980529778 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 0.998626000 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=1): [POOL_2D] NMSE = 0.990229979 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=0): [POOL_2D] NMSE = 0.997318728 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=1): [POOL_2D] NMSE = 0.998520385 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 0.996793723 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=1): [POOL_2D] NMSE = 0.999051836 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=0): [POOL_2D] NMSE = 0.951262132 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=0): [POOL_2D] NMSE = 0.956884939 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 0.997704292 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=0): [POOL_2D] NMSE = 0.999224200 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=0): [POOL_2D] NMSE = 0.999134844 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=0): [POOL_2D] NMSE = 0.998004729 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 0.997448918 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=0): [POOL_2D] NMSE = 0.999541720 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=0): [POOL_2D] NMSE = 0.944884228 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=1): [POOL_2D] NMSE = 0.926696798 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=0): [POOL_2D] NMSE = 0.994287629 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 0.955740948 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 0.995466166 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=1): [POOL_2D] NMSE = 0.952336879 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=0): [POOL_2D] NMSE = 0.996944178 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=1): [POOL_2D] NMSE = 0.995380036 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=0): [POOL_2D] NMSE = 0.952263193 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=1): [POOL_2D] NMSE = 0.991847250 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=0): [POOL_2D] NMSE = 0.958395142 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=1): [POOL_2D] NMSE = 0.943870832 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=0): [POOL_2D] NMSE = 0.993283836 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=1): [POOL_2D] NMSE = 0.957167099 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=0): [POOL_2D] NMSE = 0.995173773 > 0.000000100 FAIL
  POOL_2D(pool_type=max,type_input=f16,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=1): [POOL_2D] NMSE = 0.995009793 > 0.000000100 FAIL
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=1): OK

JidongZhang-THU · 2024-01-30T13:42:08Z

The original tests pass. I have added more tests and some don't pass with F16. Here is the full result:

Only support fp32 for now.
ggml_compute_forward_pool_2d assert(src->type == GGML_TYPE_F32) in ggml.c, so why test in FP16?

slaren · 2024-01-30T13:44:18Z

I didn't hit any asserts while testing F16, so I assume that it works. I see now that there are asserts in ggml_compute_forward_pool_2d, but they are only enabled in debug builds. These should be changed to GGML_ASSERT.

slaren · 2024-01-30T13:48:19Z

ggml.h

+            int                   p0,
+            int                   p1);


I think there was a reason for these to be float. @ggerganov

Yes, the yolo example relies on this:

https://github.com/ggerganov/ggml/blob/475cbad5c1c834e31e26a2283bc1413181644360/examples/yolo/yolov3-tiny.cpp#L398-L400

Context: ggerganov/ggml#576 (comment)

fallback in 66dd123

JidongZhang-THU · 2024-01-30T14:05:19Z

I didn't hit any asserts while testing F16, so I assume that it works. I see now that there are asserts in ggml_compute_forward_pool_2d, but they are only enabled in debug builds. These should be changed to GGML_ASSERT.

JidongZhang-THU · 2024-01-30T14:18:27Z

I didn't hit any asserts while testing F16, so I assume that it works. I see now that there are asserts in ggml_compute_forward_pool_2d, but they are only enabled in debug builds. These should be changed to GGML_ASSERT.
add assert in ca4ec6d

* New Feature: 1. Sum_Rows: fix cuda kernel overflow fix block shape error when nrows too big 2. Im2Col: Support Batch in cuda Support f32 to f32 both in cpu && cuda 3. DepthWiseConv: Support by Im2Col && MulMat 4. Pool_2d: Supoort avg pooling in cuda 5. HardSigmoid: Imp in cuda 6. HardSwish: Imp in cuda * fix tabs instead of spaces * code clean * CUDA POOL2D * ADD POOL2D test case in test-backend-ops.cpp * code clean * fix pool2d_kernel nits * fix bug in pool2d kernel * fix avg pooling, count_include_pad nits * test-backend-ops : add more pool_2d tests * cuda : fix warnings and formatting * ggml : check types in release builds too in pool_2d * test-backend-ops : remove f16 pool_2d tests * cuda : more style fixes * Add assert in ggml_cuda_op_pool2d * pool2d float padding fallback * test-backend-ops : add dst_type to im2col --------- Co-authored-by: slaren <[email protected]>

JidongZhang-THU added 2 commits January 24, 2024 18:57

fix tabs instead of spaces

b08c6b1

slaren reviewed Jan 26, 2024

View reviewed changes

ggml-cuda.cu Outdated Show resolved Hide resolved

ggml-cuda.cu Outdated Show resolved Hide resolved

ggml.c Outdated Show resolved Hide resolved

ggml-cuda.cu Outdated Show resolved Hide resolved

code clean

c29a855

ggerganov reviewed Jan 26, 2024

View reviewed changes

JidongZhang-THU added 2 commits January 29, 2024 10:31

CUDA POOL2D

ba5592c

ADD POOL2D test case in test-backend-ops.cpp

1a82788

slaren reviewed Jan 29, 2024

View reviewed changes

ggml-cuda.cu Outdated Show resolved Hide resolved

code clean

41a34cb

fix pool2d_kernel

1556d4c

nits

JidongZhang-THU force-pushed the MobileVLM branch from 00925f8 to 1556d4c Compare January 30, 2024 02:31

fix bug in pool2d kernel

379f89f

fix avg pooling, count_include_pad

49f09aa

nits

JidongZhang-THU force-pushed the MobileVLM branch from 212da07 to 49f09aa Compare January 30, 2024 12:46

slaren added 2 commits January 30, 2024 14:03

test-backend-ops : add more pool_2d tests

04f10a2

cuda : fix warnings and formatting

caf2fc8

slaren reviewed Jan 30, 2024

View reviewed changes

slaren added 2 commits January 30, 2024 14:49

ggml : check types in release builds too in pool_2d

bdf3b8a

test-backend-ops : remove f16 pool_2d tests

8824e42

cuda : more style fixes

0d94da7

JidongZhang-THU closed this Jan 30, 2024

JidongZhang-THU reopened this Jan 30, 2024

Add assert in ggml_cuda_op_pool2d

ca4ec6d

JidongZhang-THU and others added 2 commits January 31, 2024 10:22

pool2d float padding fallback

66dd123

test-backend-ops : add dst_type to im2col

18fd0b0

slaren approved these changes Jan 31, 2024

View reviewed changes

ggerganov approved these changes Jan 31, 2024

View reviewed changes

ggerganov merged commit 1560630 into ggerganov:master Jan 31, 2024
46 of 51 checks passed

ggerganov added a commit that referenced this pull request Jan 31, 2024

metal : add im2col F32 dst support (#5132)

efb7bdb

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024

metal : add im2col F32 dst support (ggerganov#5132)

90f3055

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

metal : add im2col F32 dst support (ggerganov#5132)

1fa2378

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Features to run MobileVLM on orin #5132

New Features to run MobileVLM on orin #5132

JidongZhang-THU commented Jan 26, 2024 •

edited

Loading

ggerganov Jan 26, 2024

slaren Jan 29, 2024

ggerganov Jan 29, 2024

slaren Jan 29, 2024

slaren commented Jan 29, 2024 •

edited

Loading

JidongZhang-THU commented Jan 30, 2024

JidongZhang-THU commented Jan 30, 2024 •

edited

Loading

slaren commented Jan 30, 2024

JidongZhang-THU commented Jan 30, 2024 •

edited

Loading

JidongZhang-THU commented Jan 30, 2024

slaren commented Jan 30, 2024

JidongZhang-THU commented Jan 30, 2024 •

edited

Loading

slaren commented Jan 30, 2024

JidongZhang-THU commented Jan 30, 2024

slaren commented Jan 30, 2024

slaren Jan 30, 2024

ggerganov Jan 30, 2024

JidongZhang-THU Jan 31, 2024

JidongZhang-THU commented Jan 30, 2024

JidongZhang-THU commented Jan 30, 2024

New Features to run MobileVLM on orin #5132

New Features to run MobileVLM on orin #5132

Conversation

JidongZhang-THU commented Jan 26, 2024 • edited Loading

ggerganov Jan 26, 2024

Choose a reason for hiding this comment

slaren Jan 29, 2024

Choose a reason for hiding this comment

ggerganov Jan 29, 2024

Choose a reason for hiding this comment

slaren Jan 29, 2024

Choose a reason for hiding this comment

slaren commented Jan 29, 2024 • edited Loading

JidongZhang-THU commented Jan 30, 2024

JidongZhang-THU commented Jan 30, 2024 • edited Loading

slaren commented Jan 30, 2024

JidongZhang-THU commented Jan 30, 2024 • edited Loading

JidongZhang-THU commented Jan 30, 2024

slaren commented Jan 30, 2024

JidongZhang-THU commented Jan 30, 2024 • edited Loading

slaren commented Jan 30, 2024

JidongZhang-THU commented Jan 30, 2024

slaren commented Jan 30, 2024

slaren Jan 30, 2024

Choose a reason for hiding this comment

ggerganov Jan 30, 2024

Choose a reason for hiding this comment

JidongZhang-THU Jan 31, 2024

Choose a reason for hiding this comment

JidongZhang-THU commented Jan 30, 2024

JidongZhang-THU commented Jan 30, 2024

JidongZhang-THU commented Jan 26, 2024 •

edited

Loading

slaren commented Jan 29, 2024 •

edited

Loading

JidongZhang-THU commented Jan 30, 2024 •

edited

Loading

JidongZhang-THU commented Jan 30, 2024 •

edited

Loading

JidongZhang-THU commented Jan 30, 2024 •

edited

Loading