Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Wait for #2568] [ Tensor ] add is_NaN check in Tensor @open sesame 05/10 14:17 #2574

Closed
wants to merge 6 commits into from

Conversation

jijoongmoon
Copy link
Collaborator

In this PR

This PR adds the is_NaN function to check if the tensor has a NaN value. This
is for the check of NaN during mixed precision training.

Self evaluation:

  1. Build test: [X]Passed [ ]Failed [ ]Skipped
  2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon [email protected]

We will add Var32 Tensor if the Variable Weight is not Full
precision (FP32). This eables the Weight Update with full precision
and only Apply Gradient Process ueses this Tensor. Therefore, the
lifespan of this tensor should be "ApplyGradient".

. Modify TensorPool to generate Weigth considering Mixed Precsion.

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This pr create the variable fp32 tensor when we create the Weight and
Optimizer Weight.

. update the manager to create Weight with  var32 tensor which
requested to weight pool.
. update the weight requests with Weight Spec and var, grad and var32
tensors which created already.
. add clone Tensor with specific type in tensor.h

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This PR enables the FP16 support for the layers below:

. input layer
. mse loss layer

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
This PR includes the mixed precision test case.

. Input - FC - MSE
 : "batch_size=2", "model_tensor_type=FP16-FP16", "loss_scale=128"

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
@taos-ci
Copy link
Collaborator

taos-ci commented May 8, 2024

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2574. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

@taos-ci
Copy link
Collaborator

taos-ci commented May 8, 2024

:octocat: cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405081939230.98045897483826-c26fdde8fd852939e23804ed95904be398fd97e4/.

@taos-ci
Copy link
Collaborator

taos-ci commented May 8, 2024

:octocat: cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405082050250.96678900718689-4ecc13b19fc3fc3a9a17591d5cdb4a3abd6f4df1/.

@taos-ci
Copy link
Collaborator

taos-ci commented May 8, 2024

:octocat: cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405082151470.54591298103333-ce40657c7b1ae15cfc71f9055823a8969ac60727/.

@taos-ci
Copy link
Collaborator

taos-ci commented May 8, 2024

:octocat: cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405082246480.76972699165344-4d7e908bce0b6fd5d08b64f166e67ef68d6f4354/.

@@ -1090,4 +1090,37 @@ void ele_div(const unsigned int N, const float *X, const float *Y, float *Z,
ele_div_fallback(N, X, Y, Z, alpha, beta, i_stride, o_stride);
}

bool has_nan(const size_t N, ml::train::TensorDim::DataType d_type,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not check for other wait-for PRs yet, but commit in THIS pr looks fine

This commit modify apply gradient in optimizer.
We do not need to save optimizer variables in weight type. Only
Optimizer needs the optimizer variables and we should update the
weight with full precision to maintain the accuracy. Therefore,
remove the var32 tensors for optimizer variables.

Resolves:

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
@taos-ci
Copy link
Collaborator

taos-ci commented May 9, 2024

:octocat: cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405091505060.75402593612671-5d60df8de64131bb06587f7ae54df5dba46019c6/.

@jijoongmoon jijoongmoon changed the title [Wait for #2568] [ Tensor ] add is_NaN check in Tensor [Wait for #2568] [ Tensor ] add is_NaN check in Tensor @open sesame 05/10 14:17 May 10, 2024
@taos-ci
Copy link
Collaborator

taos-ci commented May 10, 2024

:octocat: cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405101417540.56086111068726-5d60df8de64131bb06587f7ae54df5dba46019c6/.

@taos-ci
Copy link
Collaborator

taos-ci commented May 10, 2024

:octocat: cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405101527120.35327792167664-1c1f3432fbaf988d7167f2f41b6790dd5b832344/.

@taos-ci
Copy link
Collaborator

taos-ci commented May 10, 2024

:octocat: cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2574-202405101622190.99789905548096-34f1ba77efdc7bdb5c1ea6b1bb4da4fecec028ab/.

@jijoongmoon jijoongmoon force-pushed the is_nan branch 2 times, most recently from b1dd77e to 366c357 Compare May 10, 2024 08:16
This PR add is_NaN function to check if the tensor has NaN value. This
is for the check NaN during mixed precision training.

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <[email protected]>
Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jijoongmoon, 💯 All CI checkers are successfully verified. Thanks.

if get_option('enable-avx')
extra_defines += '-DUSE_AVX=1'
if get_option('platform') == 'tizen'
add_project_arguments(['-mavx2'], language: ['c','cpp'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the Tizen platform always support AVX2 instructions?

int temp = 0;
size_t idx = 0;

// 16 single-precision check : ( X != X )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// 16 single-precision check : ( X != X )
// 16 half-precision check : ( X != X )

return true;
}

// 8 single-precision check : ( X != X )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// 8 single-precision check : ( X != X )
// 8 half-precision check : ( X != X )

%define avx_support -Denable-avx=true
%else
%define avx_support -Denable-avx=false
%endif # arch aarch64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
%endif # arch aarch64
%endif # arch x86_64

@jijoongmoon
Copy link
Collaborator Author

closed by #2663

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants