Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSCCL Multithreaded regression alternative state management #1352

Draft
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

corey-derochie-amd
Copy link
Collaborator

@corey-derochie-amd corey-derochie-amd commented Sep 26, 2024

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: Internal

What were the changes?
Address MT MSCCL issue and reenable MSCCL in MT mode through enabling MSCCL single process mode
This is change to the original implementation that hands off the state from thread-local to rank appropriately, and also uses vector instead of unordered_map.

Why were the changes made?
Severe regression when multiple devices are used per thread in rccl-tests allreduce.

How was the outcome achieved?
Set the device id correctly so that scratch memory and sync buffers are allocated on the right device

Additional Details:
Single-threaded (original | fix)

#                                                              out-of-place                       in-place            |  #                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong  |  #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)         |  #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576        262144     float     sum      -1    74.21   14.13   24.73      0    73.38   14.29   25.01      0  |       1048576        262144     float     sum      -1    72.71   14.42   25.24      0    72.84   14.40   25.19      0
     2097152        524288     float     sum      -1    73.45   28.55   49.97      0    73.24   28.64   50.11      0  |       2097152        524288     float     sum      -1    72.86   28.78   50.37      0    73.29   28.62   50.08      0
     3145728        786432     float     sum      -1    75.96   41.41   72.47      0    74.08   42.46   74.31      0  |       3145728        786432     float     sum      -1    76.17   41.30   72.27      0    74.21   42.39   74.18      0
     4194304       1048576     float     sum      -1    73.74   56.88   99.55      0    73.60   56.99   99.73      0  |       4194304       1048576     float     sum      -1    74.15   56.57   98.99      0    73.64   56.96   99.67      0
     5242880       1310720     float     sum      -1    74.04   70.82  123.93      0    76.57   68.47  119.83      0  |       5242880       1310720     float     sum      -1    73.98   70.87  124.02      0    75.36   69.57  121.75      0
     6291456       1572864     float     sum      -1    74.68   84.24  147.43      0    74.58   84.35  147.62      0  |       6291456       1572864     float     sum      -1    73.39   85.73  150.03      0    73.49   85.61  149.82      0
     7340032       1835008     float     sum      -1    79.56   92.26  161.46      0    82.47   89.00  155.76      0  |       7340032       1835008     float     sum      -1    79.33   92.53  161.92      0    82.20   89.30  156.27      0
     8388608       2097152     float     sum      -1    87.23   96.16  168.29      0    88.17   95.14  166.49      0  |       8388608       2097152     float     sum      -1    87.09   96.32  168.57      0    87.85   95.49  167.10      0
     9437184       2359296     float     sum      -1    94.98   99.36  173.88      0    97.67   96.62  169.08      0  |       9437184       2359296     float     sum      -1    94.96   99.38  173.92      0    97.63   96.66  169.16      0
    10485760       2621440     float     sum      -1    100.9  103.92  181.87      0    104.1  100.75  176.30      0  |      10485760       2621440     float     sum      -1    100.9  103.94  181.90      0    104.0  100.87  176.52      0
    11534336       2883584     float     sum      -1    111.3  103.68  181.43      0    113.9  101.25  177.18      0  |      11534336       2883584     float     sum      -1    111.2  103.75  181.56      0    113.8  101.38  177.42      0
    12582912       3145728     float     sum      -1    117.0  107.57  188.24      0    129.5   97.20  170.10      0  |      12582912       3145728     float     sum      -1    117.0  107.53  188.19      0    129.6   97.09  169.91      0
    13631488       3407872     float     sum      -1    127.4  107.00  187.26      0    132.6  102.82  179.93      0  |      13631488       3407872     float     sum      -1    127.3  107.11  187.45      0    132.6  102.83  179.95      0
    14680064       3670016     float     sum      -1    133.3  110.16  192.78      0    136.9  107.26  187.70      0  |      14680064       3670016     float     sum      -1    133.2  110.22  192.89      0    136.9  107.26  187.70      0
    15728640       3932160     float     sum      -1    143.3  109.74  192.05      0    149.1  105.50  184.63      0  |      15728640       3932160     float     sum      -1    143.4  109.70  191.97      0    148.9  105.60  184.80      0
    16777216       4194304     float     sum      -1    149.8  112.00  196.00      0    155.9  107.63  188.35      0  |      16777216       4194304     float     sum      -1    149.7  112.04  196.07      0    156.0  107.52  188.16      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.                                     |  # Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK                                                                                         |  # Out of bounds values : 0 OK
# Avg bus bandwidth    : 144.171                                                                                      |  # Avg bus bandwidth    : 144.471 

Multi-threaded (original | fix)

#                                                              out-of-place                       in-place            |  #                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong  |  #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)         |  #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576        262144     float     sum      -1    28.15   37.25   65.19      0    27.52   38.10   66.67      0  |       1048576        262144     float     sum      -1    27.84   37.66   65.91      0    27.41   38.26   66.95      0
     2097152        524288     float     sum      -1    37.03   56.64   99.11      0    35.22   59.55  104.21      0  |       2097152        524288     float     sum      -1    34.81   60.25  105.43      0    35.08   59.79  104.63      0
     3145728        786432     float     sum      -1    44.72   70.34  123.09      0    45.77   68.73  120.27      0  |       3145728        786432     float     sum      -1    44.53   70.64  123.62      0    45.51   69.13  120.97      0
     4194304       1048576     float     sum      -1    52.56   79.79  139.64      0    54.18   77.42  135.48      0  |       4194304       1048576     float     sum      -1    52.31   80.18  140.31      0    54.21   77.38  135.41      0
     5242880       1310720     float     sum      -1    62.37   84.06  147.10      0    64.35   81.47  142.58      0  |       5242880       1310720     float     sum      -1    62.10   84.42  147.74      0    64.26   81.59  142.78      0
     6291456       1572864     float     sum      -1    68.41   91.97  160.95      0    70.77   88.90  155.58      0  |       6291456       1572864     float     sum      -1    68.37   92.03  161.04      0    70.56   89.16  156.03      0
     7340032       1835008     float     sum      -1    78.57   93.42  163.49      0    81.28   90.31  158.04      0  |       7340032       1835008     float     sum      -1    78.33   93.70  163.98      0    81.07   90.54  158.45      0
     8388608       2097152     float     sum      -1    84.57   99.19  173.59      0    87.37   96.01  168.02      0  |       8388608       2097152     float     sum      -1    84.42   99.37  173.89      0    87.33   96.05  168.09      0
     9437184       2359296     float     sum      -1    94.91   99.43  174.01      0    97.37   96.92  169.60      0  |       9437184       2359296     float     sum      -1    94.73   99.63  174.35      0    97.13   97.16  170.04      0
    10485760       2621440     float     sum      -1    100.7  104.14  182.25      0    103.6  101.22  177.13      0  |      10485760       2621440     float     sum      -1    100.7  104.15  182.26      0    103.4  101.46  177.55      0
    11534336       2883584     float     sum      -1    111.1  103.82  181.69      0    113.7  101.47  177.57      0  |      11534336       2883584     float     sum      -1    110.9  103.96  181.93      0    113.6  101.55  177.71      0
    12582912       3145728     float     sum      -1    117.0  107.59  188.28      0    130.4   96.49  168.86      0  |      12582912       3145728     float     sum      -1    116.8  107.69  188.46      0    130.4   96.51  168.89      0
    13631488       3407872     float     sum      -1    127.2  107.12  187.47      0    133.7  101.96  178.43      0  |      13631488       3407872     float     sum      -1    127.1  107.23  187.66      0    133.9  101.80  178.15      0
    14680064       3670016     float     sum      -1    133.1  110.31  193.04      0    138.5  105.96  185.44      0  |      14680064       3670016     float     sum      -1    133.0  110.40  193.21      0    138.1  106.27  185.98      0
    15728640       3932160     float     sum      -1    143.0  110.00  192.49      0    148.6  105.83  185.21      0  |      15728640       3932160     float     sum      -1    143.0  110.03  192.54      0    148.6  105.81  185.17      0
    16777216       4194304     float     sum      -1    149.5  112.23  196.40      0    155.8  107.67  188.42      0  |      16777216       4194304     float     sum      -1    149.3  112.36  196.63      0    155.9  107.63  188.36      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.                                     |  # Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK                                                                                         |  # Out of bounds values : 0 OK
# Avg bus bandwidth    : 157.791                                                                                      |  # Avg bus bandwidth    : 158.254 

Multi-process (original | fix)

#                                                              out-of-place                       in-place            |  #                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong  |  #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)         |  #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576        262144     float     sum      -1    25.84   40.58   71.01      0    25.47   41.17   72.05      0  |       1048576        262144     float     sum      -1    25.88   40.52   70.91      0    25.62   40.93   71.62      0
     2097152        524288     float     sum      -1    32.79   63.95  111.92      0    33.11   63.35  110.86      0  |       2097152        524288     float     sum      -1    32.73   64.07  112.12      0    33.08   63.39  110.94      0
     3145728        786432     float     sum      -1    42.67   73.72  129.01      0    43.88   71.69  125.47      0  |       3145728        786432     float     sum      -1    42.66   73.75  129.06      0    43.86   71.72  125.52      0
     4194304       1048576     float     sum      -1    50.19   83.57  146.25      0    51.95   80.74  141.29      0  |       4194304       1048576     float     sum      -1    50.18   83.58  146.26      0    52.05   80.58  141.01      0
     5242880       1310720     float     sum      -1    60.03   87.34  152.85      0    62.49   83.91  146.84      0  |       5242880       1310720     float     sum      -1    60.04   87.32  152.82      0    62.38   84.04  147.08      0
     6291456       1572864     float     sum      -1    66.15   95.10  166.43      0    68.45   91.91  160.85      0  |       6291456       1572864     float     sum      -1    66.13   95.13  166.49      0    68.59   91.73  160.53      0
     7340032       1835008     float     sum      -1    76.43   96.03  168.05      0    79.18   92.71  162.24      0  |       7340032       1835008     float     sum      -1    76.36   96.12  168.21      0    79.16   92.73  162.27      0
     8388608       2097152     float     sum      -1    82.29  101.93  178.39      0    85.47   98.15  171.76      0  |       8388608       2097152     float     sum      -1    82.25  101.99  178.48      0    85.57   98.03  171.56      0
     9437184       2359296     float     sum      -1    92.68  101.82  178.19      0    95.21   99.12  173.45      0  |       9437184       2359296     float     sum      -1    92.59  101.93  178.37      0    95.19   99.14  173.49      0
    10485760       2621440     float     sum      -1    98.28  106.69  186.70      0    101.5  103.31  180.80      0  |      10485760       2621440     float     sum      -1    98.29  106.69  186.70      0    101.4  103.46  181.05      0
    11534336       2883584     float     sum      -1    108.8  105.99  185.48      0    111.3  103.67  181.42      0  |      11534336       2883584     float     sum      -1    108.8  106.01  185.52      0    111.5  103.47  181.07      0
    12582912       3145728     float     sum      -1    114.5  109.86  192.25      0    128.7   97.75  171.06      0  |      12582912       3145728     float     sum      -1    114.6  109.82  192.18      0    128.8   97.70  170.98      0
    13631488       3407872     float     sum      -1    125.0  109.02  190.78      0    131.7  103.49  181.10      0  |      13631488       3407872     float     sum      -1    125.1  108.99  190.73      0    131.8  103.45  181.04      0
    14680064       3670016     float     sum      -1    130.9  112.18  196.32      0    135.8  108.11  189.20      0  |      14680064       3670016     float     sum      -1    130.9  112.16  196.28      0    135.7  108.17  189.30      0
    15728640       3932160     float     sum      -1    141.0  111.56  195.23      0    146.7  107.23  187.65      0  |      15728640       3932160     float     sum      -1    141.1  111.47  195.07      0    146.4  107.42  187.99      0
    16777216       4194304     float     sum      -1    147.4  113.83  199.20      0    153.9  109.02  190.79      0  |      16777216       4194304     float     sum      -1    147.3  113.89  199.30      0    153.7  109.17  191.04      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.                                     |  # Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK                                                                                         |  # Out of bounds values : 0 OK
# Avg bus bandwidth    : 162.34                                                                                       |  # Avg bus bandwidth    : 162.344 

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

@corey-derochie-amd corey-derochie-amd changed the title MSCCL Multithreaded regression root cause fix alternative MSCCL Multithreaded regression alternative state allocation Sep 26, 2024
@corey-derochie-amd corey-derochie-amd changed the title MSCCL Multithreaded regression alternative state allocation MSCCL Multithreaded regression alternative state management Sep 26, 2024
@corey-derochie-amd corey-derochie-amd force-pushed the msccl_regression_root_cause_fix branch from a2e190c to e88e0ec Compare October 23, 2024 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant