[Wait for #2568] [ Tensor ] add is_NaN check in Tensor @open sesame 05/10 14:17 #2574

jijoongmoon · 2024-05-08T10:39:20Z

In this PR

This PR adds the is_NaN function to check if the tensor has a NaN value. This
is for the check of NaN during mixed precision training.

Self evaluation:

Build test: [X]Passed [ ]Failed [ ]Skipped
Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon [email protected]

We will add Var32 Tensor if the Variable Weight is not Full precision (FP32). This eables the Weight Update with full precision and only Apply Gradient Process ueses this Tensor. Therefore, the lifespan of this tensor should be "ApplyGradient". . Modify TensorPool to generate Weigth considering Mixed Precsion. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This pr create the variable fp32 tensor when we create the Weight and Optimizer Weight. . update the manager to create Weight with var32 tensor which requested to weight pool. . update the weight requests with Weight Spec and var, grad and var32 tensors which created already. . add clone Tensor with specific type in tensor.h Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This PR enables the FP16 support for the layers below: . input layer . mse loss layer Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

This PR includes the mixed precision test case. . Input - FC - MSE : "batch_size=2", "model_tensor_type=FP16-FP16", "loss_scale=128" **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

taos-ci · 2024-05-08T10:39:23Z

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2574. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

taos-ci · 2024-05-08T10:49:58Z

cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405081939230.98045897483826-c26fdde8fd852939e23804ed95904be398fd97e4/.

taos-ci · 2024-05-08T12:00:44Z

cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405082050250.96678900718689-4ecc13b19fc3fc3a9a17591d5cdb4a3abd6f4df1/.

taos-ci · 2024-05-08T13:06:47Z

cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405082151470.54591298103333-ce40657c7b1ae15cfc71f9055823a8969ac60727/.

taos-ci · 2024-05-08T14:00:34Z

cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405082246480.76972699165344-4d7e908bce0b6fd5d08b64f166e67ef68d6f4354/.

skykongkong8 · 2024-05-09T00:10:26Z

nntrainer/tensor/blas_interface.cpp

@@ -1090,4 +1090,37 @@ void ele_div(const unsigned int N, const float *X, const float *Y, float *Z,
    ele_div_fallback(N, X, Y, Z, alpha, beta, i_stride, o_stride);
 }

+bool has_nan(const size_t N, ml::train::TensorDim::DataType d_type,


did not check for other wait-for PRs yet, but commit in THIS pr looks fine

This commit modify apply gradient in optimizer. We do not need to save optimizer variables in weight type. Only Optimizer needs the optimizer variables and we should update the weight with full precision to maintain the accuracy. Therefore, remove the var32 tensors for optimizer variables. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

taos-ci · 2024-05-09T06:21:34Z

cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405091505060.75402593612671-5d60df8de64131bb06587f7ae54df5dba46019c6/.

taos-ci · 2024-05-10T05:31:55Z

cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405101417540.56086111068726-5d60df8de64131bb06587f7ae54df5dba46019c6/.

taos-ci · 2024-05-10T06:54:40Z

cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405101527120.35327792167664-1c1f3432fbaf988d7167f2f41b6790dd5b832344/.

taos-ci · 2024-05-10T07:54:11Z

cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405101622190.99789905548096-34f1ba77efdc7bdb5c1ea6b1bb4da4fecec028ab/.

This PR add is_NaN function to check if the tensor has NaN value. This is for the check NaN during mixed precision training. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <[email protected]>

taos-ci

@jijoongmoon, 💯 All CI checkers are successfully verified. Thanks.

djeong20 · 2024-05-22T02:05:09Z

meson.build

+if get_option('enable-avx')
+   extra_defines += '-DUSE_AVX=1'
+   if get_option('platform') == 'tizen'
+      add_project_arguments(['-mavx2'], language: ['c','cpp'])


would the Tizen platform always support AVX2 instructions?

djeong20 · 2024-05-22T02:05:27Z

nntrainer/tensor/blas_avx.cpp

+  int temp = 0;
+  size_t idx = 0;
+
+  // 16 single-precision check : ( X != X )


Suggested change

// 16 single-precision check : ( X != X )

// 16 half-precision check : ( X != X )

djeong20 · 2024-05-22T02:05:40Z

nntrainer/tensor/blas_avx.cpp

+      return true;
+  }
+
+  // 8 single-precision check : ( X != X )


Suggested change

// 8 single-precision check : ( X != X )

// 8 half-precision check : ( X != X )

djeong20 · 2024-05-22T02:06:30Z

packaging/nntrainer.spec

+%define avx_support -Denable-avx=true
+%else
+%define avx_support -Denable-avx=false
+%endif # arch aarch64


Suggested change

%endif # arch aarch64

%endif # arch x86_64

jijoongmoon · 2024-11-11T07:06:27Z

closed by #2663

jijoongmoon added 4 commits May 7, 2024 13:38

jijoongmoon requested review from myungjoo, again4you, jaeyun-jung, leemgs, wooksong, helloahn, kparichay, gichan-jang, anyj0527, zhoonit, lhs8928, songgot, jihochu, DonghakPark, SeoHyungjun, baek2sm, skykongkong8, djeong20, EunjuYang and a team as code owners May 8, 2024 10:39

github-actions bot added the Need Review label May 8, 2024

jijoongmoon force-pushed the is_nan branch from c26fdde to 4ecc13b Compare May 8, 2024 11:50

jijoongmoon force-pushed the is_nan branch from 3a6ff34 to ce40657 Compare May 8, 2024 12:51

jijoongmoon force-pushed the is_nan branch from ce40657 to 4d7e908 Compare May 8, 2024 13:46

skykongkong8 approved these changes May 9, 2024

View reviewed changes

jijoongmoon force-pushed the is_nan branch from 4d7e908 to 5d60df8 Compare May 9, 2024 06:05

jijoongmoon changed the title ~~[Wait for #2568] [ Tensor ] add is_NaN check in Tensor~~ [Wait for #2568] [ Tensor ] add is_NaN check in Tensor @open sesame 05/10 14:17 May 10, 2024

jijoongmoon force-pushed the is_nan branch from 5d60df8 to 1c1f343 Compare May 10, 2024 06:27

jijoongmoon force-pushed the is_nan branch from 1c1f343 to 34f1ba7 Compare May 10, 2024 07:22

jijoongmoon force-pushed the is_nan branch 2 times, most recently from b1dd77e to 366c357 Compare May 10, 2024 08:16

jijoongmoon force-pushed the is_nan branch from 366c357 to 59b7c2e Compare May 10, 2024 08:41

taos-ci approved these changes May 10, 2024

View reviewed changes

djeong20 reviewed May 22, 2024

View reviewed changes

DonghakPark mentioned this pull request Oct 30, 2024

[Wait for #2615] Enable Mixed Precision Training in NNTrainer @open sesame 11/09 15:18 #2663

Merged

jijoongmoon closed this Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Wait for #2568] [ Tensor ] add is_NaN check in Tensor @open sesame 05/10 14:17 #2574

[Wait for #2568] [ Tensor ] add is_NaN check in Tensor @open sesame 05/10 14:17 #2574

jijoongmoon commented May 8, 2024

taos-ci commented May 8, 2024

taos-ci commented May 8, 2024

taos-ci commented May 8, 2024

taos-ci commented May 8, 2024

taos-ci commented May 8, 2024

skykongkong8 May 9, 2024

taos-ci commented May 9, 2024

taos-ci commented May 10, 2024

taos-ci commented May 10, 2024

taos-ci commented May 10, 2024

taos-ci left a comment

djeong20 May 22, 2024

djeong20 May 22, 2024

djeong20 May 22, 2024

djeong20 May 22, 2024

jijoongmoon commented Nov 11, 2024

	// 16 single-precision check : ( X != X )
	// 16 half-precision check : ( X != X )

	// 8 single-precision check : ( X != X )
	// 8 half-precision check : ( X != X )

[Wait for #2568] [ Tensor ] add is_NaN check in Tensor @open sesame 05/10 14:17 #2574

[Wait for #2568] [ Tensor ] add is_NaN check in Tensor @open sesame 05/10 14:17 #2574

Conversation

jijoongmoon commented May 8, 2024

In this PR

taos-ci commented May 8, 2024

taos-ci commented May 8, 2024

taos-ci commented May 8, 2024

taos-ci commented May 8, 2024

taos-ci commented May 8, 2024

skykongkong8 May 9, 2024

Choose a reason for hiding this comment

taos-ci commented May 9, 2024

taos-ci commented May 10, 2024

taos-ci commented May 10, 2024

taos-ci commented May 10, 2024

taos-ci left a comment

Choose a reason for hiding this comment

djeong20 May 22, 2024

Choose a reason for hiding this comment

djeong20 May 22, 2024

Choose a reason for hiding this comment

djeong20 May 22, 2024

Choose a reason for hiding this comment

djeong20 May 22, 2024

Choose a reason for hiding this comment

jijoongmoon commented Nov 11, 2024