Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Use naive decompress for SM<8.0 #32

Merged
merged 5 commits into from
Feb 21, 2024
Merged

Use naive decompress for SM<8.0 #32

merged 5 commits into from
Feb 21, 2024

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented Feb 20, 2024

A warning will be printed out if this case is triggered:

WARNING 02-20 22:21:27 sparse_w16a16.py:32] Unstructured sparse kernels are not optimized for NVIDIA SM < 8.0. Naive decompress kernels will be used and can be slower than dense models

Works on a T4 with:

from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/opt-125m-pruned2.4", 
    sparsity="sparse_w16a16",
    enforce_eager=True,
    dtype="float16",
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
outputs[0].outputs[0].text

Test within colab: https://colab.research.google.com/drive/15xRvWX5gNaTb00BcaXhxwMm6yxavIKGN?usp=sharing

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for doing this!

@mgoin mgoin merged commit b61bc82 into main Feb 21, 2024
2 checks passed
@mgoin mgoin deleted the support-bitmask-fallback branch February 21, 2024 00:11
robertgshaw2-neuralmagic pushed a commit that referenced this pull request Feb 21, 2024
A warning will be printed out if this case is triggered:
```
WARNING 02-20 22:21:27 sparse_w16a16.py:32] Unstructured sparse kernels are not optimized for NVIDIA SM < 8.0. Naive decompress kernels will be used and can be slower than dense models
```

Works on a T4 with:
```python
from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/opt-125m-pruned2.4", 
    sparsity="sparse_w16a16",
    enforce_eager=True,
    dtype="float16",
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
outputs[0].outputs[0].text
```

Test within colab:
https://colab.research.google.com/drive/15xRvWX5gNaTb00BcaXhxwMm6yxavIKGN?usp=sharing
tlrmchlsmth pushed a commit that referenced this pull request Feb 21, 2024
A warning will be printed out if this case is triggered:
```
WARNING 02-20 22:21:27 sparse_w16a16.py:32] Unstructured sparse kernels are not optimized for NVIDIA SM < 8.0. Naive decompress kernels will be used and can be slower than dense models
```

Works on a T4 with:
```python
from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/opt-125m-pruned2.4", 
    sparsity="sparse_w16a16",
    enforce_eager=True,
    dtype="float16",
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
outputs[0].outputs[0].text
```

Test within colab:
https://colab.research.google.com/drive/15xRvWX5gNaTb00BcaXhxwMm6yxavIKGN?usp=sharing
robertgshaw2-neuralmagic pushed a commit that referenced this pull request Feb 21, 2024
A warning will be printed out if this case is triggered:
```
WARNING 02-20 22:21:27 sparse_w16a16.py:32] Unstructured sparse kernels are not optimized for NVIDIA SM < 8.0. Naive decompress kernels will be used and can be slower than dense models
```

Works on a T4 with:
```python
from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/opt-125m-pruned2.4", 
    sparsity="sparse_w16a16",
    enforce_eager=True,
    dtype="float16",
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
outputs[0].outputs[0].text
```

Test within colab:
https://colab.research.google.com/drive/15xRvWX5gNaTb00BcaXhxwMm6yxavIKGN?usp=sharing
robertgshaw2-neuralmagic pushed a commit that referenced this pull request Feb 22, 2024
A warning will be printed out if this case is triggered:
```
WARNING 02-20 22:21:27 sparse_w16a16.py:32] Unstructured sparse kernels are not optimized for NVIDIA SM < 8.0. Naive decompress kernels will be used and can be slower than dense models
```

Works on a T4 with:
```python
from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/opt-125m-pruned2.4", 
    sparsity="sparse_w16a16",
    enforce_eager=True,
    dtype="float16",
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
outputs[0].outputs[0].text
```

Test within colab:
https://colab.research.google.com/drive/15xRvWX5gNaTb00BcaXhxwMm6yxavIKGN?usp=sharing
robertgshaw2-neuralmagic pushed a commit that referenced this pull request Feb 22, 2024
A warning will be printed out if this case is triggered:
```
WARNING 02-20 22:21:27 sparse_w16a16.py:32] Unstructured sparse kernels are not optimized for NVIDIA SM < 8.0. Naive decompress kernels will be used and can be slower than dense models
```

Works on a T4 with:
```python
from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/opt-125m-pruned2.4", 
    sparsity="sparse_w16a16",
    enforce_eager=True,
    dtype="float16",
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
outputs[0].outputs[0].text
```

Test within colab:
https://colab.research.google.com/drive/15xRvWX5gNaTb00BcaXhxwMm6yxavIKGN?usp=sharing
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants