Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for A100 #1

Open
suhasjs opened this issue Aug 8, 2022 · 4 comments
Open

Support for A100 #1

suhasjs opened this issue Aug 8, 2022 · 4 comments
Assignees

Comments

@suhasjs
Copy link

suhasjs commented Aug 8, 2022

I forked a copy of habitat to test and add support for A100s here -- https://github.com/suhasjs/habitat
I've appended specs for A100-SXM4-40GB in analyzer/habitat/data/devices.yml:

A100:
  compute_major: 8
  compute_minor: 0
  max_threads_per_block: 1024
  max_threads_per_multiprocessor: 2048
  regs_per_block: 65536
  regs_per_multiprocessor: 65536
  warp_size: 32
  shared_mem_per_block: 49152
  shared_mem_per_multiprocessor: 163840
  num_sms: 108
  shared_mem_per_block_optin: 0
  mem_bandwidth_gb: 1555
  base_clock_mhz: 1410
  peak_gflops_per_second: 8341

I've also added A100 to analyzer/habitat/analysis/mlp/devices.csv:

A100,40,HBM2,1555,108,9.5,19.0,300.0

In addition, I've added A100 to list of DEVICES in experiments/run_experiment.py.

I'm unable to run profiling/gather raw data on A100s with these changes.
I tried CUDA_VISIBLE_DEVICES=0 bash experiments/gather_raw_data.sh A100 and it spit out the following errors:

root@phodgx1:/mnt/new_habitat# CUDA_VISIBLE_DEVICES=0 bash experiments/gather_raw_data.sh A100
Processing: dcgan+64
Cross-device prediction: A100 -> P4000
Predicting requires_grad_ on P4000
Predicting zero_ on P4000
0, 448, 0, 32
Traceback (most recent call last):
  File "run_experiment.py", line 247, in <module>
    main()
  File "run_experiment.py", line 239, in main
    run_dcgan_experiments(context)
  File "run_experiment.py", line 153, in run_dcgan_experiments
    run_experiment_config(
  File "run_experiment.py", line 110, in run_experiment_config
    predicted_trace = trace.to_device(device)
  File "/mnt/new_habitat/analyzer/habitat/analysis/trace.py", line 48, in to_device
    operations = [
  File "/mnt/new_habitat/analyzer/habitat/analysis/trace.py", line 49, in <listcomp>
    operation.to_device(dest_device, actual_predictor)
  File "/mnt/new_habitat/analyzer/habitat/analysis/operation.py", line 85, in to_device
    return predictor.predict_operation(self, dest_device)
  File "/mnt/new_habitat/analyzer/habitat/analysis/predictor.py", line 89, in predict_operation
    self._wave_scale(operation.forward, dest_device),
  File "/mnt/new_habitat/analyzer/habitat/analysis/predictor.py", line 117, in _wave_scale
    predicted_kernels = list(map(
  File "/mnt/new_habitat/analyzer/habitat/analysis/predictor.py", line 118, in <lambda>
    lambda kernel: self._wave_scaling_strategy(
  File "/mnt/new_habitat/analyzer/habitat/analysis/wave_scaling/unified.py", line 30, in unified_wave_scaling
    return resimplified_wave_scaling(
  File "/mnt/new_habitat/analyzer/habitat/analysis/wave_scaling/resimplified.py", line 26, in resimplified_wave_scaling
    if (kernel.num_blocks // origin_wave_size == 0 and
ZeroDivisionError: integer division or modulo by zero

The errors are a result of origin_wave_size and origin_occupancy being set to zero in calculate_wave_info() in resimplified_wave_scaling.

Do I need to re-train the MLP predictors with A100 profiling data mixed in? I noticed there exists a file analyzer/habitat/analysis/mlp/train.py. Is this relevant?

Is this the right way to go about adding a new device type to Habitat? It would be great if you had a write-up on how to add a new device :) I would like to try out habitat with a couple more device types

@jimgao1
Copy link
Collaborator

jimgao1 commented Aug 9, 2022

  1. Division by 0 error
    The first issue is caused by an outdated version of cuda_occupancy.h being included in cpp/src/cuda/cuda_occupancy.h. This version of the header file does not recognize devices with computeMajor=8 and returns CUDA_OCC_ERROR_UNKNOWN_DEVICE. This causes the call to cudaOccMaxActiveBlocksPerMultiprocessor in calculate_wave_info to error and return 0, resulting in the division by 0 error.
    The solution would be to replace cpp/src/cuda/cuda_occupancy.h with /usr/local/cuda/include/cuda_occupancy.h. I symlinked the file and was then able to run gather_raw_data.sh.

  2. Yeah. You would have to collect runtime data for linear, bmm, conv2d, and lstm using the scripts in tools/recording. After that, combine the data with the existing measurements then retrain the MLP.
    @geoffxy Do you still have the measurements for our existing devices?

EDIT: changed /usr/share -> /usr/local for cuda_occupancy.h.

@jimgao1 jimgao1 self-assigned this Aug 9, 2022
@geoffxy
Copy link

geoffxy commented Aug 9, 2022

Thanks @jimgao1 for helping with this! 😄

@geoffxy Do you still have the measurements for our existing devices?

Yep - I responded to our email thread.

@suhasjs
Copy link
Author

suhasjs commented Aug 15, 2022

@jimgao1 Thanks a ton for your help! I was able to compile Habitat successfully on a DGX-A100 system. Additionally, I'm able to run tools/recording/record*.py and run_experiment.py on two different GPU types (RTX2080Ti and Quadro RTX6000) and am able to generate predictions for all the models on these two GPU types.

@geoffxy I got your training data for the MLPs and added in training data for Quadro RTX 6000 and was able to train the MLPs to completion. However, this is where things start breaking down.

I tried to run a naive bash gather_data.sh A100 with the compiled habitat package and got the following error when processing GNMT:

Cross-device prediction: A100 -> QRTX6000
Processing: gnmt+16
Cross-device prediction: A100 -> P4000
Traceback (most recent call last):
  File "run_experiment.py", line 248, in <module>
    main()
  File "run_experiment.py", line 243, in main
    run_gnmt_experiments(context)
  File "run_experiment.py", line 193, in run_gnmt_experiments
    run_experiment_config(
  File "run_experiment.py", line 111, in run_experiment_config
    predicted_trace = trace.to_device(device)
  File "/mnt/new_habitat/analyzer/habitat/analysis/trace.py", line 48, in to_device
    operations = [
  File "/mnt/new_habitat/analyzer/habitat/analysis/trace.py", line 49, in <listcomp>
    operation.to_device(dest_device, actual_predictor)
  File "/mnt/new_habitat/analyzer/habitat/analysis/operation.py", line 85, in to_device
    return predictor.predict_operation(self, dest_device)
  File "/mnt/new_habitat/analyzer/habitat/analysis/predictor.py", line 89, in predict_operation
    self._wave_scale(operation.forward, dest_device),
  File "/mnt/new_habitat/analyzer/habitat/analysis/predictor.py", line 117, in _wave_scale
    predicted_kernels = list(map(
  File "/mnt/new_habitat/analyzer/habitat/analysis/predictor.py", line 118, in <lambda>
    lambda kernel: self._wave_scaling_strategy(
  File "/mnt/new_habitat/analyzer/habitat/analysis/wave_scaling/unified.py", line 30, in unified_wave_scaling
    return resimplified_wave_scaling(
  File "/mnt/new_habitat/analyzer/habitat/analysis/wave_scaling/resimplified.py", line 26, in resimplified_wave_scaling
    if (kernel.num_blocks // origin_wave_size == 0 and
ZeroDivisionError: integer division or modulo by zero

The script finished all the 3 configs for resnet, dcgan and inception, but errors out for gnmt (I was able to run the script without any errors on both the RTX2080Ti and Quadro RTX6000 GPU types).
I thought this was an issue with the MLPs not being trained for lstm kernels on A100s, so I tried to run python3 record_lstm.py A100 and that runs into a deadlock (process enters a sleep state and makes no progress after 10s of runtime, uses 6.5GB RAM). Here's the output of record_lstm.py after it was killed (manual kill signal) on the A100s:

root@phodgx1:/mnt/new_habitat/tools/recording# CUDA_VISIBLE_DEVICES=2 python3 record_lstm.py A100                                                                                                                                             
2022-08-15 23:01 INFO     Total configurations: 322122547200
2022-08-15 23:01 INFO     Total configurations after filtering: 200000
2022-08-15 23:01 INFO     Slice size: 200000
2022-08-15 23:01 INFO     --- Found 2 recordings in lstm-A100-0.sqlite, so skipping the first 1 configurations ---
2022-08-15 23:01 INFO     Warming up...
2022-08-15 23:01 INFO     Starting to record. This process records slice 1 of 1.
Killed

I was able to profile bmm, conv2d and linear on A100s using the supplied scripts, so lstm is the only one to fail. Any idea why this might be happening? I think there's an issue with the LSTM backend that PyTorch might be using on A100s since the issues only show up for GNMT. I ran strace on the script and found that the main process waits on a futex that it never receives a wake up call on.

P.S: Do I need to run record_conv2d.py multiple times for a given GPU type? I see it samples a random 200k arg combinations and profiles them.. but not sure if I need to profile more than the 200k single slice..

@jimgao1
Copy link
Collaborator

jimgao1 commented Aug 16, 2022

Hi Suhas,

Division by 0:

  • Did you update the cuda_occupancy.h file in your habitat installation? I was able to get things to run after updating this file.

Unable to record LSTM:

  • What version of CUDA/PyTorch are you running? You can obtain the former with grep -A1 "CUDA SDK" /usr/local/cuda/version.json and the latter with pip freeze | grep torch. It might also help to attach the whole /usr/local/cuda/version.json file, since that includes versions for CUPTI, CUBLAS, etc.
  • I recall being able to record all 4 kernel types on an A100 instance on GCP. Are you running anything else simultaneously?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants