-
Notifications
You must be signed in to change notification settings - Fork 38
Issues: microsoft/MInference
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[Question]: CUDA error: an illegal memory access was encountered when running benchmark_e2e.py
question
Further information is requested
#86
opened Nov 20, 2024 by
lepangdan
[Feature Request]: Is it possible to get the returned logsumexp in streamingllm forward?
feature request
New feature or request
#85
opened Nov 17, 2024 by
311dada
[Question]: Discrepancy in Pre-filling Time and Memory Consumption on Single A100
question
Further information is requested
#84
opened Nov 15, 2024 by
lepangdan
[Question]: Am I using minference correctly?
question
Further information is requested
#83
opened Oct 30, 2024 by
YLGH
[Question]: analysis of attention scores (too sparse)
question
Further information is requested
#82
opened Oct 19, 2024 by
wiluen
[Question]: sparsity of minference
question
Further information is requested
#78
opened Sep 23, 2024 by
susu1210
[Bug]: Torch not found: can't install with pip install (Python 3.12, CUDA 12.6 Update 1, PyTorch 2.4.1)
bug
Something isn't working
#77
opened Sep 20, 2024 by
atemerev
[Question]: Could you provide more examples about other attention usage, e.g., dilated1, streaming, snapkv
question
Further information is requested
#76
opened Sep 18, 2024 by
gaow0007
[Bug]: loc("Minference/minference/ops/pit_sparse_flash_attention_v2.py":110:23): error: operation scheduled before its operands
bug
Something isn't working
#75
opened Sep 18, 2024 by
leoyuppieqnew
[Feature Request]: Support LLaVA Model feature request / Low generation speed
feature request
New feature or request
#74
opened Sep 18, 2024 by
ThisisBillhe
[Question]: what is the speedup of attention kernel of current implemetation?
question
Further information is requested
#73
opened Sep 10, 2024 by
foreverpiano
Performance Degradation when Using MInference with Qwen2-7B-Instruct Model
question
Further information is requested
#71
opened Aug 26, 2024 by
yumingfan-0219
[Bug]: vllm executor.driver_worker. 'RayWorkerWrapper' object has no attribute 'model_runner'
bug
Something isn't working
#67
opened Aug 8, 2024 by
TPLink32
[Question]: Confusion about Optimal Search Pattern Configuration
question
Further information is requested
#64
opened Aug 6, 2024 by
Dianaia
[Question]: It seems that minference does not currently support tensor parallelism under vllm, right? Because in a multi-card environment, the head_id here is incorrect compared to a single card
feature request
New feature or request
question
Further information is requested
#62
opened Aug 4, 2024 by
zh2333
[Question]: Why is every head config saved with "vertical_and_slash"?
question
Further information is requested
#57
opened Jul 29, 2024 by
fmmoret
Does MInference supports CUDA11.8?
question
Further information is requested
#56
opened Jul 29, 2024 by
hensiesp32
Shape of slash mismatch when input batchsize > 1
bug
Something isn't working
#53
opened Jul 23, 2024 by
polarispw
[Question]: attn_type="minference" and attn_type= "hf" got different result
question
Further information is requested
#52
opened Jul 21, 2024 by
qiling1345
[Question]: Question about the settings of vertical_size and slash_size in vertical_and_slash pattern
question
Further information is requested
#47
opened Jul 17, 2024 by
ALUKErnel
[Question]: Does vertical_slash_sparse_attention supported to concatenate all batches into a single row for operation like flash_attn_2_cuda.varlen_fwd?
question
Further information is requested
#46
opened Jul 17, 2024 by
Amanda-Barbara
[Question]: ModuleNotFoundError: No module named 'minference.cuda'
question
Further information is requested
#45
opened Jul 16, 2024 by
lai-serena
[Question]: Why is running MInference/examples/run_vllm.py not as fast as running vllm alone?
question
Further information is requested
#43
opened Jul 16, 2024 by
zjjznw123
[Question]: How does VLLM use MInference through OpenAI Compatible Server?
question
Further information is requested
#40
opened Jul 15, 2024 by
jueming0312
Previous Next
ProTip!
Type g i on any issue or pull request to go back to the issue listing page.