Add heuristic algorithm for speculative #3006

leng-yue · 2023-09-04T09:21:39Z

Based on Hugging Face's assisted generation blog, we've implemented a simple heuristic to determine the number of draft tokens. Specifically, if all draft tokens are accepted, we increase n_draft_token by 2; otherwise, we decrease it by 1. Check out some examples using the original 3 samples from issue #2926.

target model: Code Llama 34B Q4_K_M
draft model: Code Llama 7B Q4_K_M
device: 2x3090, NVLINK, cublas

# example 0
./speculative \
-m models/codellama-34b.Q4_K_M.gguf \
-md models/codellama-7b.Q4_K_M.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include" \
-e -ngl 1000 -t 4 -n 256 -c 4096 -s 8 --top_k 1 --draft 16

## With heuristic
n_draft   = 53
n_predict = 266
n_drafted = 245
n_accept  = 231
accept    = 94.286%

decoded  266 tokens in   11.990 seconds, speed:   22.185 t/s

## Without heuristic
n_draft   = 16
n_predict = 266
n_drafted = 254
n_accept  = 223
accept    = 87.795%

decoded  266 tokens in   13.234 seconds, speed:   20.099 t/s

# example 1
./speculative \
-m models/codellama-34b.Q4_K_M.gguf \
-md models/codellama-7b.Q4_K_M.gguf \
-p "// Dijkstra algorithm in C++ (4 spaces indentation + detailed comments) + sample usage:\n\n" \
-e -ngl 1000 -t 4 -n 4096 -c 4096 -s 20 --top_k 1 --draft 16

## With heuristic
n_draft   = 75
n_predict = 1546
n_drafted = 2261
n_accept  = 1263
accept    = 55.860%
decoded 1546 tokens in  109.811 seconds, speed:   14.079 t/s

## Without heuristic
n_draft   = 16
n_predict = 2071
n_drafted = 2400
n_accept  = 1749
accept    = 72.875%

decoded 2071 tokens in  126.508 seconds, speed:   16.370 t/s

# example 2
./speculative \
-m models/codellama-34b.Q4_K_M.gguf \
-md models/codellama-7b.Q4_K_M.gguf \
-p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" \
-e -ngl 1000 -t 4 -n 512 -c 4096 -s 20 --top_k 1 --draft 16

## With heuristic
n_draft   = 38
n_predict = 219
n_drafted = 203
n_accept  = 193
accept    = 95.074%

decoded  219 tokens in    9.449 seconds, speed:   23.177 t/s

## Without heuristic
n_draft   = 16
n_predict = 219
n_drafted = 229
n_accept  = 187
accept    = 81.659%

decoded  219 tokens in   11.285 seconds, speed:   19.406 t/s

In Example 2, you'll see that varying n_draft produces different output tokens. This suggests a potential issue, which you can find detailed here:

llama.cpp/examples/speculative/speculative.cpp

Line 196 in e4386f4

    
           llama_eval(ctx_tgt, drafted.data(), drafted.size(), n_past_tgt, params.n_threads);

To my mind, the optimal approach would be to batch the drafts in groups (e.g., a batch size of 16) and feed them all at once to the target model, rather than concatenating single draft with the original prompt and feeding it as a sequence.

bobqianic · 2023-09-04T09:37:34Z

examples/speculative/speculative.cpp

+        } else {
+            n_draft -= 1;
+            LOG("drafted token rejected, n_draft = %d\n", n_draft);
+        }


Why does n_draft go up by 2 when all drafted tokens are accepted, but decrease by 1 when a drafted token is rejected? Is there a more efficient algorithm to handle this? The current approach seems similar to a simplified version of TCP Friendly Rate Control algorithm.

That's pretty much borrowed from Hugging Face's code. We could fine-tune it by tweaking some parameters. Since getting all tokens right is challenging, it seems reasonable to bump up n_draft by 2 when everything aligns and decrease it by 1 otherwise.

ggerganov · 2023-09-04T09:50:30Z

In Example 2, you'll see that varying n_draft produces different output tokens.

Yes, this is concerning - it should not happen, so probably a bug somewhere.

To my mind, the optimal approach would be to batch the drafts in groups (e.g., a batch size of 16) and feed them all at once to the target model, rather than concatenating single draft with the original prompt and feeding it as a sequence.

I don't understand - currently the entire sequence of drafted tokens is submitted as a single batch to the target model.

ggerganov

I don't reproduce the issue

Can you send the logs where you observe the different output between speculation and regular sampling?

ggerganov · 2023-09-04T10:01:41Z

examples/speculative/speculative.cpp

+        if (drafted.size() > 0 && all_accepted) {
+            n_draft += 2;
+            LOG("all drafted tokens accepted, n_draft = %d\n", n_draft);
+        } else {
+            n_draft -= 1;
+            LOG("drafted token rejected, n_draft = %d\n", n_draft);
+        }


n_draft should be constrained to not go below 2 for example and

Here is what I got after restricting the minimum n_draft to 2:

Outputs

// Dijkstra algorithm in C++ (4 spaces indentation + detailed comments) + sample usage: // 1. Add all nodes to the graph, and add edges between them with distances/weights. // 2. Call dijkstra(start_node) to get shortest paths from start_node to all other nodes. // It returns a map of <node, distance>. // 3. Use path_exists(node) to check if there is a path between start_node and node. // 4. Use get_shortest_distance(node) to get the shortest distance from start_node to node. // If no path exists, it returns -1. // 5. Use reconstruct_path(node) to reconstruct the shortest path between start_node and node. // It returns a vector of nodes that forms the path (including both start_node and node). // If no path exists, it returns an empty vector. #include <iostream> #include <vector> #include <map> #include <set> #include <queue> using namespace std; class Graph { public: struct Edge { int node; int distance; }; private: // adjacency list of the graph (to store all edges) vector<vector<Edge>> adj_list; // map to keep track of visited nodes during dijkstra() map<int, bool> visited; // map to keep track of distances from start node to other nodes during dijkstra() map<int, int> distance; public: Graph(int n) { adj_list.resize(n); } void add_edge(int src, int dest, int dist) { Edge edge = {dest, dist}; adj_list[src].push_back(edge); } map<int, int> dijkstra(int start) { // initialize all distances as -1 (inf). 0 is the starting node. for (int i = 0; i < adj_list.size(); ++i) { distance[i] = -1; } distance[start] = 0; // create a priority queue to get the minimum distance node, // visited helps avoid processing same node again priority_queue<pair<int, int>, vector<pair<int, int>>, greater<pair<int, int>>> pq; pq.push({0, start}); while (!pq.empty()) { // get the node with minimum distance from priority queue auto top = pq.top(); int u = top.second; pq.pop(); if (visited[u]) continue; // skip already visited nodes // mark the node as visited visited[u] = true; // process all neighbours of the current node 'u' for (auto edge : adj_list[u]) { int v = edge.node; int dist_v = distance[u] + edge.distance; // distance from 'u' to 'v' // check if there is shorted path to 'v' through 'u'. if (distance[v] == -1 || distance[v] > dist_v) { // update the distance of 'v' only if it is not in visited, // or can be reached with shorter distance from 'u'. distance[v] = dist_v; pq.push({dist_v, v}); // add 'v' to priority queue } } } return distance; } bool path_exists(int dest) { // return true if the destination node is in visited map. return visited[dest]; } int get_shortest_distance(int dest) { // return the shortest distance from start to destination node. return distance[dest]; } vector<int> reconstruct_path(int dest) { // vector to store the path vector<int> path; if (!visited[dest]) return path; // no path exists from start node to destination node. for (int v = dest; v != 0; v = distance[v]) { path.push_back(v); } reverse(path.begin(), path.end()); return path; } }; int main() { // create a graph with 9 nodes (0 to 8) Graph g(9); g.add_edge(0, 1, 4); g.add_edge(0, 7, 8); g.add_edge(1, 2, 8); g.add_edge(1, 7, 11); g.add_edge(2, 3, 7); g.add_edge(2, 8, 2); g.add_edge(2, 5, 4); g.add_edge(3, 4, 9); g.add_edge(3, 5, 14); g.add_edge(4, 5, 10); g.add_edge(5, 6, 2); g.add_edge(6, 7, 1); g.add_edge(6, 8, 6); g.add_edge(7, 8, 7); // call dijkstra() with start node as 0 auto distance = g.dijkstra(0); for (int i = 1; i < distance.size(); ++i) { cout << "Distance from 0 to " << i << ": "; if (!g.path_exists(i)) cout << "No path exists."; else cout << g.get_shortest_distance(i); cout << endl; } // print the shortest path from 0 to 8 auto path = g.reconstruct_path(8); for (int node : path) { cout << node << " "; } cout << endl; return 0; } encoded 24 tokens in 0.418 seconds, speed: 57.356 t/s decoded 1546 tokens in 109.633 seconds, speed: 14.102 t/s n_draft = 75 n_predict = 1546 n_drafted = 2261 n_accept = 1263 accept = 55.860% draft: llama_print_timings: load time = 725.46 ms llama_print_timings: sample time = 3855.13 ms / 1 runs ( 3855.13 ms per token, 0.26 tokens per second) llama_print_timings: prompt eval time = 101.84 ms / 24 tokens ( 4.24 ms per token, 235.67 tokens per second) llama_print_timings: eval time = 47541.13 ms / 2529 runs ( 18.80 ms per token, 53.20 tokens per second) llama_print_timings: total time = 110050.98 ms target: llama_print_timings: load time = 2122.55 ms llama_print_timings: sample time = 501.15 ms / 1546 runs ( 0.32 ms per token, 3084.92 tokens per second) llama_print_timings: prompt eval time = 54614.28 ms / 2495 tokens ( 21.89 ms per token, 45.68 tokens per second) llama_print_timings: eval time = 2831.66 ms / 72 runs ( 39.33 ms per token, 25.43 tokens per second) llama_print_timings: total time = 110779.95 ms

Full log: speculative.139912996806656.log
It shows that n_draft never goes under 10 in this case.

As a comparison, this one doesn't include heuristic algorithm:

Outputs

// Dijkstra algorithm in C++ (4 spaces indentation + detailed comments) + sample usage: // 1. Add all nodes to the graph, and add edges between them with distances/weights. // 2. Call dijkstra(start_node) to get shortest paths from start_node to all other nodes. // It returns a map of <node, distance>. // 3. Use path_exists(node) to check if there is a path between start_node and node. // 4. Use get_shortest_distance(node) to get the shortest distance from start_node to node. // If no path exists, it returns -1. // 5. Use reconstruct_path(node) to get the shortest path from start_node to node. // It returns a vector of nodes that make up the path. #include <iostream> #include <vector> #include <map> #include <set> #include <queue> using namespace std; class Graph { public: struct Edge { int node, distance; }; // Adds a directed edge between "from" and "to" with the given "distance". void add_edge(int from, int to, int distance) { edges[from].push_back({to, distance}); } // Returns true if there is an edge between "from" and "to", false otherwise. bool has_edge(int from, int to) const { for (const Edge& e : edges.at(from)) { if (e.node == to) return true; } return false; } // Returns the distance between "from" and "to". If there is no edge, returns -1. int get_distance(int from, int to) const { for (const Edge& e : edges.at(from)) { if (e.node == to) return e.distance; } return -1; } // Returns a map of <node, distance> representing the shortest paths from "start_node" to all other nodes. map<int, int> dijkstra(int start_node) const { priority_queue<pair<int, int>, vector<pair<int, int>>, greater<pair<int, int>>> pq; // (distance, node) map<int, bool> visited; map<int, int> distances; // <node, distance> pq.push({0, start_node}); distances[start_node] = 0; while (!pq.empty()) { auto top = pq.top(); int node = top.second; int distance = top.first; pq.pop(); if (visited[node]) continue; // already visited this node visited[node] = true; // update distances of neighbors for (const Edge& edge : edges.at(node)) { int neighbor_node = edge.node; int new_distance = distance + edge.distance; if (!distances.count(neighbor_node) || distances[neighbor_node] > new_distance) { pq.push({new_distance, neighbor_node}); distances[neighbor_node] = new_distance; } } } return distances; } // Returns true if there is a path between "start_node" and "node", false otherwise. bool path_exists(int start_node, int node) const { map<int, bool> visited; queue<int> q; q.push(start_node); while (!q.empty()) { int current = q.front(); q.pop(); if (current == node) return true; visited[current] = true; for (const Edge& edge : edges.at(current)) { int neighbor_node = edge.node; if (!visited[neighbor_node]) q.push(neighbor_node); } } return false; } // Returns the shortest distance from "start_node" to "node". If there is no path, returns -1. int get_shortest_distance(int start_node, int node) const { map<int, bool> visited; queue<pair<int, int>> q; // (distance, node) q.push({0, start_node}); while (!q.empty()) { auto top = q.front(); int distance = top.first; int current = top.second; q.pop(); if (current == node) return distance; visited[current] = true; for (const Edge& edge : edges.at(current)) { int neighbor_node = edge.node; int new_distance = distance + edge.distance; if (!visited[neighbor_node]) q.push({new_distance, neighbor_node}); } } return -1; } // Returns the shortest path from "start_node" to "node". If there is no path, returns an empty vector. vector<int> reconstruct_path(int start_node, int node) const { map<int, bool> visited; queue<pair<int, int>> q; // (distance, node) q.push({0, start_node}); // parents[i] is the parent of i in the shortest path from start_node to i. map<int, int> parents; while (!q.empty()) { auto top = q.front(); int distance = top.first; int current = top.second; q.pop(); if (current == node) break; visited[current] = true; for (const Edge& edge : edges.at(current)) { int neighbor_node = edge.node; int new_distance = distance + edge.distance; if (!visited[neighbor_node]) { q.push({new_distance, neighbor_node}); parents[neighbor_node] = current; } } } vector<int> path; for (int n = node; n != start_node; n = parents.at(n)) { path.push_back(n); } path.push_back(start_node); reverse(path.begin(), path.end()); return path; } private: // Map of <node, list of edges> representing the graph. map<int, vector<Edge>> edges; }; int main() { Graph g; g.add_edge(0, 1, 2); g.add_edge(0, 3, 4); g.add_edge(1, 2, 5); g.add_edge(1, 3, 6); g.add_edge(2, 3, 7); g.add_edge(2, 4, 8); g.add_edge(3, 4, 9); map<int, int> distances = g.dijkstra(0); for (auto [node, distance] : distances) { cout << "Distance from 0 to " << node << ": " << distance << endl; } cout << boolalpha; cout << "Path exists between 0 and 4: " << g.path_exists(0, 4) << endl; cout << "Shortest distance from 0 to 4: " << g.get_shortest_distance(0, 4) << endl; vector<int> path = g.reconstruct_path(0, 4); for (int node : path) { cout << node << " "; } cout << endl; } encoded 24 tokens in 0.419 seconds, speed: 57.252 t/s decoded 2071 tokens in 126.799 seconds, speed: 16.333 t/s n_draft = 16 n_predict = 2071 n_drafted = 2400 n_accept = 1749 accept = 72.875% draft: llama_print_timings: load time = 723.91 ms llama_print_timings: sample time = 4027.20 ms / 1 runs ( 4027.20 ms per token, 0.25 tokens per second) llama_print_timings: prompt eval time = 101.86 ms / 24 tokens ( 4.24 ms per token, 235.61 tokens per second) llama_print_timings: eval time = 51436.25 ms / 2638 runs ( 19.50 ms per token, 51.29 tokens per second) llama_print_timings: total time = 127217.59 ms target: llama_print_timings: load time = 2103.87 ms llama_print_timings: sample time = 687.69 ms / 2071 runs ( 0.33 ms per token, 3011.53 tokens per second) llama_print_timings: prompt eval time = 67933.80 ms / 2687 tokens ( 25.28 ms per token, 39.55 tokens per second) llama_print_timings: eval time = 2321.62 ms / 58 runs ( 40.03 ms per token, 24.98 tokens per second) llama_print_timings: total time = 127945.07 ms

Full log: speculative.140657253044224.log

The models I am using are: codellama-7b.Q4_K_M.gguf and codellama-34b.Q4_K_M.gguf.

Huh, very strange. After ~140 tokens of identical results, the target model samples a different token with 100% probability:

What GPU backend do you use? Is this CUDA?

Edit: the 100% probability is actually expected since we are doing --top_k 1 sampling

Also, can you try instead of using --top_k 1, to use --temp -1 and see if the problem persists.

I am using cuBLAS backend, and I got same output after changing --top_k 1 to --temp -1.
With heuristic: speculative-a.140629255983104.log
Without heuristic: speculative-b.140469514129408.log

leng-yue · 2023-09-04T10:54:32Z

I don't understand - currently the entire sequence of drafted tokens is submitted as a single batch to the target model.

Thanks for clarifying! I initially thought the current setup created a sequence using N_PREV + N_DRAFT, rather than running batches of N_PREV, N_PREV + 1, N_PREV + 2, ..., N_PREV + N_DRAFT - 1 in parallel.

JohannesGaessler · 2023-09-04T13:02:10Z

I should mention that as of right now the CUDA code is very much not optimized for the use of draft models so it may be better to benchmark the performance using other backends. Also I would intuitively assume that given enough optimization there will be no benefit to using a draft shorter than 8 tokens because that is the minimum length to fully utilize tensor cores (as of right now tensor cores are not used at all).

ggerganov · 2023-09-04T13:31:51Z

I did the following experiment:

Run perplexity with the same input, but changing the batch size via the -b parameter.
Here are the results for the first few iterations on different backends:

# Q4_0 7B
# batch sizes: 16, 32, 64, 128, 256, 512

# Metal:

[1]4.3263,[2]4.8290,[3]5.4475,[4]6.0514,[5]6.1813,[6]6.0808,[7]6.2560,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3263,[2]4.8291,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2560,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3261,[2]4.8290,[3]5.4475,[4]6.0514,[5]6.1813,[6]6.0808,[7]6.2560,[8]6.3669,[9]6.7256,[10]6.9356
[1]4.3263,[2]4.8291,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2561,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3263,[2]4.8290,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2560,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3264,[2]4.8291,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2561,[8]6.3670,[9]6.7256,[10]6.9356

# CPU (M2, LLAMA_ACCELERATE=OFF):

[1]4.3233,[2]4.8256,[3]5.4456,[4]6.0456,[5]6.1772,[6]6.0762  # SIMD is off for n_batch = 16 (ggml_vec_dot_f16)
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800

# CPU (M2, LLAMA_ACCELERATE=ON):

[1]4.3233,[2]4.8256,[3]5.4456,[4]6.0456,[5]6.1772
[1]4.3256,[2]4.8287,[3]5.4475,[4]6.0515,[5]6.1813
[1]4.3258,[2]4.8288,[3]5.4475,[4]6.0515,[5]6.1813
[1]4.3253,[2]4.8284,[3]5.4470,[4]6.0511,[5]6.1810
[1]4.3256,[2]4.8286,[3]5.4472,[4]6.0511,[5]6.1810
[1]4.3257,[2]4.8286,[3]5.4471,[4]6.0511,[5]6.1810

# CUDA:

[1]4.3283,[2]4.8268,[3]5.4451,[4]6.0526,[5]6.1871,[6]6.0874,[7]6.2609,[8]6.3685,[9]6.7238
[1]4.3329,[2]4.8348,[3]5.4534,[4]6.0545,[5]6.1855,[6]6.0867,[7]6.2617,[8]6.3744,[9]6.7305
[1]4.3303,[2]4.8109,[3]5.4355,[4]6.0431,[5]6.1755,[6]6.0727,[7]6.2414,[8]6.3526,[9]6.7111
[1]4.3264,[2]4.8292,[3]5.4521,[4]6.0559,[5]6.1865,[6]6.0894,[7]6.2580,[8]6.3652,[9]6.7194
[1]4.3666,[2]4.8513,[3]5.4581,[4]6.0586,[5]6.1911,[6]6.0899,[7]6.2577,[8]6.3674,[9]6.7188
[1]4.3307,[2]4.8364,[3]5.4609,[4]6.0671,[5]6.1965,[6]6.0940,[7]6.2651,[8]6.3749,[9]6.7282

The CUDA results are much more sensitive to the batch size. Any ideas why is that?

I'm also a bit surprised that the CPU results are not identical across batch size.
Where is this variation coming from?

In any case, this explains the effect that @leng-yue observes with CUDA speculative decoding.

JohannesGaessler · 2023-09-04T13:42:58Z

I don't know why the results vary depending on batch size. Unless I'm missing something the values for different tokens should not be interacting with each other.

ggerganov · 2023-09-04T13:50:20Z

I guess this op makes the graph not invariant to the batch size, since the accumulated values in the dot products depends on the n_batch

llama.cpp/llama.cpp

Line 2430 in bd33e5a

struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V, KQ_soft_max);

Is this the only source of the variation and can we improve it somehow?

JohannesGaessler · 2023-09-04T14:12:56Z

Wait, now that I think about it the difference in perplexity may not come from a difference in logits after all. In the perplexity binary the tokens that are used for the calculation depend on batch size (because for the first few tokens there is no context to infer the tokens from). So to conclude that the batch size changes results we need to first control for that.

The tokens used depend on context size, not batch size.

ggerganov · 2023-09-04T15:27:42Z

CPU results are actually invariant to the batch size. I just forgot to disable LLAMA_ACCELERATE which makes ggml fallback to CBLAS for different sets of ops depending on the batch size. Building with LLAMA_ACCELERATE=OFF results in identical perplexity.

The question now is why the GPU results are not invariant.
Will open a separate issue later to continue the discussion there.

leng-yue · 2023-09-04T22:19:43Z

After changing to the CPU backend and disabling LLAMA_ACCELERATE (it likely doesn't work for me because I'm using Linux, not MacOS), I still encountered varying results.
With heuristic: speculative-c.140400117237568.log
Without heuristic: speculative-d.140252638918464.log

ggerganov · 2023-09-05T05:03:07Z

After changing to the CPU backend

If you limit n_draft to not go above 30, can you confirm the CPU results are identical with and without heuristic?

leng-yue · 2023-09-05T05:24:07Z

It's still non-identical. speculative-e.140190271293248.log

ggerganov · 2023-09-05T06:00:36Z

examples/speculative/speculative.cpp

+                    n_draft = std::max(2, n_draft - 1);
+                }
+            }
+


@leng-yue

I've refactored the implementation to be more contained.
Also, we were rewarding the draft even when it hasn't sampled all n_draft tokens, which does not seem correct.

For example, let's say n_draft is 16 currently, but the draft samples just 3 tokens because the "low-probability" check has been triggered. Even if all 3 tokens were accepted, we should not reward the draft model, because this is just a small part of what we asked it to do.

Regarding the reproducibility - we will study this more in #3014
My guess is that this behavior would occur even without the heuristic - probably it's just less likely to happen for some reason when the heuristic is disabled.

In my earlier implementation, rewards were given only when all tokens were accepted. So if only 3 out of 16 tokens are accepted, the n_tokens value would be decreased by 1.

Not exactly. This check does not guarantee you have n_draft tokens accepted:

llama.cpp/examples/speculative/speculative.cpp

Lines 146 to 148 in 98230ef

if (i_dft == (int) drafted.size()) {

all_accepted = true;

}

The reason is because drafted.size() <= n_draft due to another heuristic of not drafting more tokens if the drafter becomes "unsure":

llama.cpp/examples/speculative/speculative.cpp

Lines 192 to 196 in 98230ef

// too low probability, stop drafting

if (cur_p.data[0].p < 2*cur_p.data[1].p) {

break;

}

So in the majority of cases, when all_accepted == true, you would have accepted less than n_draft tokens. That's why your n_draft would increase so much even beyond 70 in some cases.

JohannesGaessler · 2023-09-05T06:54:00Z

After changing to the CPU backend and disabling LLAMA_ACCELERATE (it likely doesn't work for me because I'm using Linux, not MacOS), I still encountered varying results.

Should you be expecting the exact same results in the first place? If I understand the paper correctly they only guarantee equivalent results within hardware numerics. This is a direct quote from the paper:

Even with greedy sampling, a single token deviating due to numerics could result in two sequences
diverging wildly. Since pseudo-random seeds are processed differently between ArS and SpS, and
because the different computation graphs lead to different numerics, we cannot not expect identical
outputs. However, we expect the samples to come from the same distribution within numerics and
we empirically verify this by evaluating these benchmarks.

leng-yue · 2023-09-05T07:12:40Z

The non-identical issue is discussing in #3014, I think it's not an issue of heuristic algorithm.

ggerganov · 2023-09-05T07:15:26Z

Yes, if our results are different for different batch size, we cannot expect 100% reproducible when using speculative decoding with greedy sampling, because the target model processes the sequence in different batch sizes based on what the draft model provides.

The only source of numerical differences for different batch sizes that I currently see (and I'm still not 100% sure about it) is the KQV op because we are accumulating n_batch number of floats in the dot products, while in all other ops the number of elements being accumulated does not depend on n_batch. So I expect small variations due to this and even less variations with F32 cache. However, the GPU results seem a bit more variable than the expected numerical variation, so I think we should look more into it and understand where this is coming from.

Let's move the discussion to #3014

JohannesGaessler · 2023-09-05T07:20:04Z

If I understand the previous discussion correctly the issue is that when limiting the draft length the results change. My point is that this is to be expected independently from any potential bugs in the code that change results depending on batch size. @charliexchen can you weigh in on this?

charliexchen · 2023-09-05T09:11:47Z

Assuming magically precise and deterministic hardware, Spec Sampling should be identical for greedy target models (this is true even if the drafter is stochastic!). However, if the target model is stochastic then it will deterministically give a different result to normal sampling since it processes the pseudorandom numbers differently.

In practice, 8-16bit numerics are annoying and things will likely diverge after a few tokens. If you have any numeric deltas between batches, that will show up for both spec sampling and normal sampling.

leng-yue · 2023-09-14T11:29:05Z

Should we merge this PR then?

* Add heuristic algo for speculative * Constrain minimum n_draft to 2 * speculative : improve heuristic impl * speculative : be more rewarding upon guessing max drafted tokens * speculative : fix typos --------- Co-authored-by: Georgi Gerganov <[email protected]>

Add heuristic algo for speculative

98230ef

bobqianic reviewed Sep 4, 2023

View reviewed changes

ggerganov requested changes Sep 4, 2023

View reviewed changes

Constrain minimum n_draft to 2

9248528

ggerganov mentioned this pull request Sep 4, 2023

llama : understand why GPU results are different for different batch sizes #3014

Closed

ggerganov added 4 commits September 5, 2023 08:49

speculative : improve heuristic impl

dddd784

Merge branch 'master' into HEAD

4e6e951

speculative : be more rewarding upon guessing max drafted tokens

d9559b7

speculative : fix typos

b5efa62

ggerganov reviewed Sep 5, 2023

View reviewed changes

ggerganov approved these changes Sep 14, 2023

View reviewed changes

ggerganov merged commit 35f7304 into ggerganov:master Sep 14, 2023
26 checks passed

leng-yue deleted the add-speculative-heuristic branch September 15, 2023 04:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add heuristic algorithm for speculative #3006

Add heuristic algorithm for speculative #3006

leng-yue commented Sep 4, 2023

bobqianic Sep 4, 2023 •

edited

Loading

leng-yue Sep 4, 2023 •

edited

Loading

ggerganov commented Sep 4, 2023

ggerganov left a comment

ggerganov Sep 4, 2023

leng-yue Sep 4, 2023

leng-yue Sep 4, 2023

ggerganov Sep 4, 2023 •

edited

Loading

ggerganov Sep 4, 2023

leng-yue Sep 4, 2023

leng-yue commented Sep 4, 2023 •

edited

Loading

JohannesGaessler commented Sep 4, 2023

ggerganov commented Sep 4, 2023 •

edited

Loading

JohannesGaessler commented Sep 4, 2023

ggerganov commented Sep 4, 2023

JohannesGaessler commented Sep 4, 2023 •

edited

Loading

ggerganov commented Sep 4, 2023

leng-yue commented Sep 4, 2023

ggerganov commented Sep 5, 2023

leng-yue commented Sep 5, 2023

ggerganov Sep 5, 2023

leng-yue Sep 5, 2023

ggerganov Sep 5, 2023

leng-yue Sep 5, 2023

JohannesGaessler commented Sep 5, 2023

leng-yue commented Sep 5, 2023

ggerganov commented Sep 5, 2023

JohannesGaessler commented Sep 5, 2023

charliexchen commented Sep 5, 2023 •

edited

Loading

leng-yue commented Sep 14, 2023


	// too low probability, stop drafting
	if (cur_p.data[0].p < 2*cur_p.data[1].p) {
	break;
	}

Add heuristic algorithm for speculative #3006

Add heuristic algorithm for speculative #3006

Conversation

leng-yue commented Sep 4, 2023

bobqianic Sep 4, 2023 • edited Loading

Choose a reason for hiding this comment

leng-yue Sep 4, 2023 • edited Loading

Choose a reason for hiding this comment

ggerganov commented Sep 4, 2023

ggerganov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggerganov Sep 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leng-yue commented Sep 4, 2023 • edited Loading

JohannesGaessler commented Sep 4, 2023

ggerganov commented Sep 4, 2023 • edited Loading

JohannesGaessler commented Sep 4, 2023

ggerganov commented Sep 4, 2023

JohannesGaessler commented Sep 4, 2023 • edited Loading

ggerganov commented Sep 4, 2023

leng-yue commented Sep 4, 2023

ggerganov commented Sep 5, 2023

leng-yue commented Sep 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JohannesGaessler commented Sep 5, 2023

leng-yue commented Sep 5, 2023

ggerganov commented Sep 5, 2023

JohannesGaessler commented Sep 5, 2023

charliexchen commented Sep 5, 2023 • edited Loading

leng-yue commented Sep 14, 2023

bobqianic Sep 4, 2023 •

edited

Loading

leng-yue Sep 4, 2023 •

edited

Loading

ggerganov Sep 4, 2023 •

edited

Loading

leng-yue commented Sep 4, 2023 •

edited

Loading

ggerganov commented Sep 4, 2023 •

edited

Loading

JohannesGaessler commented Sep 4, 2023 •

edited

Loading

charliexchen commented Sep 5, 2023 •

edited

Loading