Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar optimization: eliminate redundant grammar trees (~4x faster grammar sampling) #6616

Merged
merged 1 commit into from
Apr 12, 2024

Conversation

HanClinto
Copy link
Collaborator

@HanClinto HanClinto commented Apr 11, 2024

tl;dr

~4x-5x speedup on processing complex grammars.

Previously discussed in #4218 (comment)

Motivation:

The grammar stacks have a tendency to explode exponentially in the case of redundant ambiguities (see #4218 (comment) ). In these cases, the grammar engine winds up duplicating effort by evaluating the feasibility of multiple redundant copies of grammars, even though they are equivalent.

Fix:

After each token, the grammar engine builds stacks representing the possible directions the parser could go. When building these stacks, this change takes care to not add new grammars to the stack if they are already there. Despite std::find() being an O(N) operation, the savings gained from not adding trees of redundant grammars vast outweigh this minor check.

Results:

Integration Benchmark

I expanded the grammar integration tests to add some crude timing metrics.

Before:
./tests/test-grammar-integration
Expected error:  parse: error parsing grammar: Undefined rule identifier 'numero'
End of expected error. Test successful.

Timings:
Simple grammar: 18 us
Complex grammar: 939 us
Chained ambiguity: 238687 us
Chained ambiguity (grouped): 45 us
Failure missing root: 5 us
Failure missing reference: 188 us
After:
./tests/test-grammar-integration
Expected error:  parse: error parsing grammar: Undefined rule identifier 'numero'
End of expected error. Test successful.

Timings:
Simple grammar: 38 us
Complex grammar: 1032 us
Chained ambiguity: 71 us
Chained ambiguity (grouped): 19 us
Failure missing root: 4 us
Failure missing reference: 156 us

Note the significant improvement in the chained ambiguity case, which is what this PR was targeting most directly. Most of the other speed differences seem to be hovering around in the noise floor, and don't seem to be consistently faster or slower one way or the other.

"Full" benchmark

Thanks to @ochafik for teaching me how to use Hyperfine, here are the results of the benchmark cribbed from #6609 for a ~9x speedup vs. master:

Hyperfine Results
Benchmark 1: ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = master)
  Time (mean ± σ):     19.028 s ±  0.887 s    [User: 11.001 s, System: 5.890 s]
  Range (min … max):   18.092 s … 20.205 s    5 runs

Benchmark 2: ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = gbnf-optimize-ambiguity2)
  Time (mean ± σ):      2.095 s ±  0.051 s    [User: 0.861 s, System: 0.143 s]
  Range (min … max):    2.058 s …  2.175 s    5 runs

Summary
  ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = gbnf-optimize-ambiguity2) ran
    9.08 ± 0.48 times faster than ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = master)

It's interesting to me that even though this test's grammar seeks to remove redundant ambiguities, this PR still offers enough improvement to be quite significant.

Copy link
Contributor

github-actions bot commented Apr 11, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 448 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=10513.65ms p(95)=27209.89ms fails=, finish reason: stop=396 truncated=52
  • Prompt processing (pp): avg=116.48tk/s p(95)=513.75tk/s
  • Token generation (tg): avg=28.08tk/s p(95)=36.3tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gbnf-optimize-ambiguity2 commit=80553e53c075a2d4e1936735eb06368eba5f599d

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 448 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712865211 --> 1712865839
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 249.62, 249.62, 249.62, 249.62, 249.62, 593.66, 593.66, 593.66, 593.66, 593.66, 490.24, 490.24, 490.24, 490.24, 490.24, 477.01, 477.01, 477.01, 477.01, 477.01, 499.06, 499.06, 499.06, 499.06, 499.06, 540.25, 540.25, 540.25, 540.25, 540.25, 566.55, 566.55, 566.55, 566.55, 566.55, 567.74, 567.74, 567.74, 567.74, 567.74, 571.43, 571.43, 571.43, 571.43, 571.43, 586.9, 586.9, 586.9, 586.9, 586.9, 588.01, 588.01, 588.01, 588.01, 588.01, 600.7, 600.7, 600.7, 600.7, 600.7, 602.41, 602.41, 602.41, 602.41, 602.41, 616.59, 616.59, 616.59, 616.59, 616.59, 614.91, 614.91, 614.91, 614.91, 614.91, 599.58, 599.58, 599.58, 599.58, 599.58, 614.54, 614.54, 614.54, 614.54, 614.54, 611.61, 611.61, 611.61, 611.61, 611.61, 619.61, 619.61, 619.61, 619.61, 619.61, 619.78, 619.78, 619.78, 619.78, 619.78, 620.23, 620.23, 620.23, 620.23, 620.23, 635.23, 635.23, 635.23, 635.23, 635.23, 633.96, 633.96, 633.96, 633.96, 633.96, 633.16, 633.16, 633.16, 633.16, 633.16, 632.69, 632.69, 632.69, 632.69, 632.69, 638.15, 638.15, 638.15, 638.15, 638.15, 638.38, 638.38, 638.38, 638.38, 638.38, 635.89, 635.89, 635.89, 635.89, 635.89, 632.9, 632.9, 632.9, 632.9, 632.9, 631.18, 631.18, 631.18, 631.18, 631.18, 635.17, 635.17, 635.17, 635.17, 635.17, 637.13, 637.13, 637.13, 637.13, 637.13, 646.05, 646.05, 646.05, 646.05, 646.05, 644.85, 644.85, 644.85, 644.85, 644.85, 644.7, 644.7, 644.7, 644.7, 644.7, 644.51, 644.51, 644.51, 644.51, 644.51, 647.12, 647.12, 647.12, 647.12, 647.12, 650.33, 650.33, 650.33, 650.33, 650.33, 650.02, 650.02, 650.02, 650.02, 650.02, 653.19, 653.19, 653.19, 653.19, 653.19, 653.43, 653.43, 653.43, 653.43, 653.43, 661.65, 661.65, 661.65, 661.65, 661.65, 665.81, 665.81, 665.81, 665.81, 665.81, 667.75, 667.75, 667.75, 667.75, 667.75, 671.97, 671.97, 671.97, 671.97, 671.97, 671.29, 671.29, 671.29, 671.29, 671.29, 672.31, 672.31, 672.31, 672.31, 672.31, 675.12, 675.12, 675.12, 675.12, 675.12, 676.85, 676.85, 676.85, 676.85, 676.85, 677.42, 677.42, 677.42, 677.42, 677.42, 673.35, 673.35, 673.35, 673.35, 673.35, 643.28, 643.28, 643.28, 643.28, 643.28, 640.09, 640.09, 640.09, 640.09, 640.09, 638.91, 638.91, 638.91, 638.91, 638.91, 638.32, 638.32, 638.32, 638.32, 638.32, 637.64, 637.64, 637.64, 637.64, 637.64, 640.32, 640.32, 640.32, 640.32, 640.32, 638.43, 638.43, 638.43, 638.43, 638.43, 638.89, 638.89, 638.89, 638.89, 638.89, 641.66, 641.66, 641.66, 641.66, 641.66, 643.85, 643.85, 643.85, 643.85, 643.85]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 448 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712865211 --> 1712865839
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 35.96, 35.96, 35.96, 35.96, 35.96, 30.48, 30.48, 30.48, 30.48, 30.48, 29.74, 29.74, 29.74, 29.74, 29.74, 20.96, 20.96, 20.96, 20.96, 20.96, 21.92, 21.92, 21.92, 21.92, 21.92, 22.44, 22.44, 22.44, 22.44, 22.44, 22.56, 22.56, 22.56, 22.56, 22.56, 23.1, 23.1, 23.1, 23.1, 23.1, 24.01, 24.01, 24.01, 24.01, 24.01, 24.77, 24.77, 24.77, 24.77, 24.77, 24.82, 24.82, 24.82, 24.82, 24.82, 25.02, 25.02, 25.02, 25.02, 25.02, 25.02, 25.02, 25.02, 25.02, 25.02, 24.85, 24.85, 24.85, 24.85, 24.85, 24.49, 24.49, 24.49, 24.49, 24.49, 24.21, 24.21, 24.21, 24.21, 24.21, 23.72, 23.72, 23.72, 23.72, 23.72, 23.1, 23.1, 23.1, 23.1, 23.1, 23.08, 23.08, 23.08, 23.08, 23.08, 22.94, 22.94, 22.94, 22.94, 22.94, 23.18, 23.18, 23.18, 23.18, 23.18, 23.28, 23.28, 23.28, 23.28, 23.28, 22.97, 22.97, 22.97, 22.97, 22.97, 22.71, 22.71, 22.71, 22.71, 22.71, 22.41, 22.41, 22.41, 22.41, 22.41, 22.41, 22.41, 22.41, 22.41, 22.41, 22.17, 22.17, 22.17, 22.17, 22.17, 22.33, 22.33, 22.33, 22.33, 22.33, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.57, 22.57, 22.57, 22.57, 22.57, 22.7, 22.7, 22.7, 22.7, 22.7, 22.72, 22.72, 22.72, 22.72, 22.72, 22.49, 22.49, 22.49, 22.49, 22.49, 22.43, 22.43, 22.43, 22.43, 22.43, 22.31, 22.31, 22.31, 22.31, 22.31, 22.39, 22.39, 22.39, 22.39, 22.39, 22.45, 22.45, 22.45, 22.45, 22.45, 22.62, 22.62, 22.62, 22.62, 22.62, 22.71, 22.71, 22.71, 22.71, 22.71, 22.74, 22.74, 22.74, 22.74, 22.74, 22.76, 22.76, 22.76, 22.76, 22.76, 22.72, 22.72, 22.72, 22.72, 22.72, 22.58, 22.58, 22.58, 22.58, 22.58, 22.51, 22.51, 22.51, 22.51, 22.51, 22.46, 22.46, 22.46, 22.46, 22.46, 22.49, 22.49, 22.49, 22.49, 22.49, 22.6, 22.6, 22.6, 22.6, 22.6, 22.68, 22.68, 22.68, 22.68, 22.68, 22.82, 22.82, 22.82, 22.82, 22.82, 22.93, 22.93, 22.93, 22.93, 22.93, 22.85, 22.85, 22.85, 22.85, 22.85, 22.59, 22.59, 22.59, 22.59, 22.59, 22.34, 22.34, 22.34, 22.34, 22.34, 22.2, 22.2, 22.2, 22.2, 22.2, 22.2, 22.2, 22.2, 22.2, 22.2, 21.32, 21.32, 21.32, 21.32, 21.32, 21.31, 21.31, 21.31, 21.31, 21.31, 21.32, 21.32, 21.32, 21.32, 21.32, 21.4, 21.4, 21.4, 21.4, 21.4, 21.45, 21.45, 21.45, 21.45, 21.45]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 448 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712865211 --> 1712865839
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08, 0.08, 0.08, 0.08, 0.08, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.32, 0.32, 0.32, 0.32, 0.32, 0.26, 0.26, 0.26, 0.26, 0.26, 0.27, 0.27, 0.27, 0.27, 0.27, 0.12, 0.12, 0.12, 0.12, 0.12, 0.24, 0.24, 0.24, 0.24, 0.24, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.31, 0.31, 0.31, 0.31, 0.31, 0.25, 0.25, 0.25, 0.25, 0.25, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.08, 0.08, 0.08, 0.08, 0.08, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.31, 0.31, 0.31, 0.31, 0.31, 0.46, 0.46, 0.46, 0.46, 0.46, 0.51, 0.51, 0.51, 0.51, 0.51, 0.42, 0.42, 0.42, 0.42, 0.42, 0.47, 0.47, 0.47, 0.47, 0.47, 0.49, 0.49, 0.49, 0.49, 0.49, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.09, 0.09, 0.09, 0.09, 0.09]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 448 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712865211 --> 1712865839
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0]
                    
Loading

@HanClinto HanClinto force-pushed the gbnf-optimize-ambiguity2 branch from 3bbceb5 to 80553e5 Compare April 11, 2024 19:09
@HanClinto
Copy link
Collaborator Author

HanClinto commented Apr 11, 2024

Was getting cross-platform build errors on the new benchmark reporting that I put into test-grammar-integration, so removed it for now. The tests (along with my "cool" benchmarking utility functions) are still available to view in 490d06f, and maybe I'll figure out the %lld vs. %ld fprintf() issue at some point to bring it back in.

Regardless, taking out the changes to the integration tests makes this a much cleaner change, and hopefully easier to review.

Also rebased on top of latest master to take advantage of speed improvements in #6609, and the speed improvements from this PR are still there. Benching with 10 iterations showed a roughly 4x-5x speedup:

`gbnf-optimize-ambiguity2` ran 5.41 ± 0.23 times faster than `master`
Benchmark 1: ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = master)
  Time (mean ± σ):     11.198 s ±  0.372 s    [User: 8.113 s, System: 1.386 s]
  Range (min … max):   10.675 s … 11.607 s    10 runs

Benchmark 2: ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = gbnf-optimize-ambiguity2)
  Time (mean ± σ):      2.070 s ±  0.053 s    [User: 0.845 s, System: 0.119 s]
  Range (min … max):    2.026 s …  2.213 s    10 runs

Summary
  ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = gbnf-optimize-ambiguity2) ran
    5.41 ± 0.23 times faster than ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = master)

Edit: Was accidentally comparing against old master, so was mistakenly adding the improvements from #6009 to my numbers, instead of correctly separating them. After I updated to use latest head, improvements dropped from 9x to ~5x. Still good, but not that good. :)

Overall I'm feeling optimistic about this PR, but would love your review whenever you have time, @ochafik

@HanClinto HanClinto changed the title Grammar Optimization: Eliminate Redundant Grammar Trees Grammar optimization: eliminate redundant grammar trees (~9x faster grammar sampling) Apr 11, 2024
@HanClinto HanClinto changed the title Grammar optimization: eliminate redundant grammar trees (~9x faster grammar sampling) Grammar optimization: eliminate redundant grammar trees (~4x faster grammar sampling) Apr 11, 2024
@ochafik
Copy link
Collaborator

ochafik commented Apr 12, 2024

This is amazing (getting 8.5x speedup on my M2 Max), and such a small change, great catch!

llama.cpp Show resolved Hide resolved
Copy link
Collaborator

@ochafik ochafik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me!!

@HanClinto HanClinto merged commit 04a5ac2 into ggerganov:master Apr 12, 2024
61 of 62 checks passed
tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants