Add Attention Sinks (TVM portion) #300

kmn1024 · 2023-12-07T05:44:19Z

The TVM component to implementing Attention Sinks (https://arxiv.org/abs/2309.17453). See mlc-ai/mlc-llm#1357

This API allows caller to choose 1. how many slots to use as sinks, and 2. how much to trim the cache to.

Callers can pick a low number like in the paper, or something to keep the entire system command.
The typical sliding window approach would call this function after every append, and trim to max_window_size. For better performance, callers can trim more frequently and aggressively.

MLC local ci setup.

This PR adds CI for Windows and macOS building, which may take 90-100 mins. Co-authored-by: Siyuan Feng <[email protected]>

kmn1024 · 2023-12-07T06:07:48Z

Sorry bad PR.

tqchen and others added 11 commits November 22, 2023 10:29

[MLC][CI] Do not upstream

b484996

MLC local ci setup.

[MLC][CI] Do not upstream - Win/Mac Building CI (#137)

3ede084

This PR adds CI for Windows and macOS building, which may take 90-100 mins. Co-authored-by: Siyuan Feng <[email protected]>

[CI] Add GitHub Action to Trigger Jenkins

189412e

attn sink tvm portion

27f54ad

Merge branch 'mlc-ai:mlc' into main

00b61fd

Update EvictWithSinks API

64028c9

Merge branch 'main' of https://github.com/kmn1024/relax_attention_sinks

6d98ddd

Better EvictForSink API to allow clearing of all caches

81d12c4

Bugfix: cannot use memcpy/memset, use NDArray funcs.

25cdf3e

Better formatting

9f20a3a

Better comments on choosing sink tokens

fa45eaf

kmn1024 closed this Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Attention Sinks (TVM portion) #300

Add Attention Sinks (TVM portion) #300

kmn1024 commented Dec 7, 2023

kmn1024 commented Dec 7, 2023

Add Attention Sinks (TVM portion) #300

Add Attention Sinks (TVM portion) #300

Conversation

kmn1024 commented Dec 7, 2023

kmn1024 commented Dec 7, 2023