Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mixed Precision] Apply Gradient Clipping on Mixed Precision #2746

Open
DonghakPark opened this issue Oct 7, 2024 · 2 comments
Open

[Mixed Precision] Apply Gradient Clipping on Mixed Precision #2746

DonghakPark opened this issue Oct 7, 2024 · 2 comments

Comments

@DonghakPark
Copy link
Member

Currently, mixed precision training is implemented in NNTrainer, but gradient clipping considering loss scale has not been implemented yet.

In Torch's example, it is implemented as follows, and there is a need to implement this in NNTrainer too.

scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast(device_type='cuda', dtype=torch.float16):
            output = model(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(optimizer)

        # Since the gradients of optimizer's assigned params are unscaled, clips as usual:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

        # optimizer's gradients are already unscaled, so scaler.step does not unscale them,
        # although it still skips optimizer.step() if the gradients contain infs or NaNs.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()
@taos-ci
Copy link
Collaborator

taos-ci commented Oct 7, 2024

:octocat: cibot: Thank you for posting issue #2746. The person in charge will reply soon.

@DonghakPark
Copy link
Member Author

Training Sequence

  1. Make an FP16 copy of the weights :
  2. Forward propagate using FP16 weights and activations
  3. Multiply the Resulting loss by the scale factor
  4. Backward propagate using FP16 weights, activations, and gradients
  5. Multiply the weight gradients by 1/sacle_factor
  6. Option process (gradient clipping, weight decay)
  7. Update the master copy of weights in FP32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants