JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length #6555

ochafik · 2024-04-08T22:45:23Z

Another followup on #5978

Bug fixes:

Grammars generated for the following schemas will no longer trigger combinatorial explosions during inference (also see llama : speed-up grammar sampling #4218 (comment)):

{"items": {"type": "number"}, "minItems": 10, "maxItems": 1000}: this used to hang forever, it's now running smoothly

Show command

Before:

 ./main -m --grammar-file \
   <( echo '{"items": {"type": "number"}, "minItems": 10, "maxItems": 100}' | \
   python examples/json-schema-to-grammar.py - \
 ) -p "List of 50 numbers"           
 > [0,1,2,3,4,5,6,7,8,9, <...hangs...>

After (notice python script uses underscores now):

 ./main -m --grammar-file \
   <( echo '{"items": {"type": "number"}, "minItems": 10, "maxItems": 100}' | \
   python examples/json_schema_to_grammar.py - \
 ) -p "List of 50 numbers"           
 > [1234, 5678, 1010, 1111, 1212, 1313, 1414, 1515, 1616, 1717, 1818, 1919, 2020, 2121, 2222, 2323, 2424, 2525, 2626, 2727, 2828, 2929, 3030, 3131, 3232, 3333, 3434, 3535, 3636, 3737, 3838, 3939, 4040, 4141, 4242, 4343, 4444, 4545, 4646, 4747, 4848, 4949, 5050]

{"type": "string", "pattern": "^a{10,100}$"}

Numbers & integers now have a capped precision (JSON itself allows arbitrary precisions numbers but there's no point in exceeding JavaScript's - roughly 15 digits; zealous LLMs may otherwise generate an infinite sequence 0.33333333333... when prompted for "one third")
Allow null in untyped JSON objects

New features:
- Support string length constraints: {"type": "string", "minLength": 10, "maxLength": 100}
- Python converter can be imported more easily (underscored name)

I've hopefully simplified the code by adding a simple dependencies mechanism for primitive rules, and unifying all repetition code.

I've also updated the GBNF doc to mention the performance gotchas, and have documented the server response_format parameter for schema-constrained JSON output

… goes from `"a"? "a"? "a"?` (explosive combos) to `(a (a (a)?)?)?`

(avoids infinite gen, e.g. "one third" -> `0.333333333333...`)

github-actions · 2024-04-08T23:41:17Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 450 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=10489.27ms p(95)=26481.41ms fails=, finish reason: stop=394 truncated=56
Prompt processing (pp): avg=110.66tk/s p(95)=487.97tk/s
Token generation (tg): avg=26.2tk/s p(95)=36.03tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=json-faster-repetitions2 commit=9c33ee99302caac14c79f12c43e7a61462dc0730

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 450 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712674161 --> 1712674793
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 321.53, 321.53, 321.53, 321.53, 321.53, 485.41, 485.41, 485.41, 485.41, 485.41, 529.45, 529.45, 529.45, 529.45, 529.45, 552.0, 552.0, 552.0, 552.0, 552.0, 604.49, 604.49, 604.49, 604.49, 604.49, 608.04, 608.04, 608.04, 608.04, 608.04, 608.31, 608.31, 608.31, 608.31, 608.31, 612.43, 612.43, 612.43, 612.43, 612.43, 633.63, 633.63, 633.63, 633.63, 633.63, 636.1, 636.1, 636.1, 636.1, 636.1, 646.97, 646.97, 646.97, 646.97, 646.97, 648.96, 648.96, 648.96, 648.96, 648.96, 666.35, 666.35, 666.35, 666.35, 666.35, 680.62, 680.62, 680.62, 680.62, 680.62, 680.7, 680.7, 680.7, 680.7, 680.7, 652.35, 652.35, 652.35, 652.35, 652.35, 580.94, 580.94, 580.94, 580.94, 580.94, 590.48, 590.48, 590.48, 590.48, 590.48, 590.92, 590.92, 590.92, 590.92, 590.92, 590.87, 590.87, 590.87, 590.87, 590.87, 584.36, 584.36, 584.36, 584.36, 584.36, 586.9, 586.9, 586.9, 586.9, 586.9, 585.92, 585.92, 585.92, 585.92, 585.92, 592.71, 592.71, 592.71, 592.71, 592.71, 594.02, 594.02, 594.02, 594.02, 594.02, 597.3, 597.3, 597.3, 597.3, 597.3, 601.87, 601.87, 601.87, 601.87, 601.87, 589.42, 589.42, 589.42, 589.42, 589.42, 592.68, 592.68, 592.68, 592.68, 592.68, 595.43, 595.43, 595.43, 595.43, 595.43, 592.96, 592.96, 592.96, 592.96, 592.96, 592.14, 592.14, 592.14, 592.14, 592.14, 593.55, 593.55, 593.55, 593.55, 593.55, 595.16, 595.16, 595.16, 595.16, 595.16, 597.23, 597.23, 597.23, 597.23, 597.23, 600.47, 600.47, 600.47, 600.47, 600.47, 600.92, 600.92, 600.92, 600.92, 600.92, 600.79, 600.79, 600.79, 600.79, 600.79, 605.15, 605.15, 605.15, 605.15, 605.15, 610.38, 610.38, 610.38, 610.38, 610.38, 616.15, 616.15, 616.15, 616.15, 616.15, 617.35, 617.35, 617.35, 617.35, 617.35, 605.88, 605.88, 605.88, 605.88, 605.88, 606.59, 606.59, 606.59, 606.59, 606.59, 606.36, 606.36, 606.36, 606.36, 606.36, 607.03, 607.03, 607.03, 607.03, 607.03, 609.2, 609.2, 609.2, 609.2, 609.2, 612.17, 612.17, 612.17, 612.17, 612.17, 613.15, 613.15, 613.15, 613.15, 613.15, 621.8, 621.8, 621.8, 621.8, 621.8, 624.42, 624.42, 624.42, 624.42, 624.42, 626.45, 626.45, 626.45, 626.45, 626.45, 625.56, 625.56, 625.56, 625.56, 625.56, 624.53, 624.53, 624.53, 624.53, 624.53, 623.06, 623.06, 623.06, 623.06, 623.06, 622.76, 622.76, 622.76, 622.76, 622.76, 624.87, 624.87, 624.87, 624.87, 624.87, 627.09, 627.09, 627.09, 627.09, 627.09, 627.7, 627.7, 627.7, 627.7, 627.7, 629.52, 629.52, 629.52, 629.52, 629.52, 633.23, 633.23, 633.23, 633.23, 633.23, 633.23, 633.23]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 450 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712674161 --> 1712674793
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 36.04, 36.04, 36.04, 36.04, 36.04, 38.26, 38.26, 38.26, 38.26, 38.26, 19.58, 19.58, 19.58, 19.58, 19.58, 21.12, 21.12, 21.12, 21.12, 21.12, 22.71, 22.71, 22.71, 22.71, 22.71, 23.05, 23.05, 23.05, 23.05, 23.05, 23.72, 23.72, 23.72, 23.72, 23.72, 24.78, 24.78, 24.78, 24.78, 24.78, 25.76, 25.76, 25.76, 25.76, 25.76, 25.88, 25.88, 25.88, 25.88, 25.88, 25.86, 25.86, 25.86, 25.86, 25.86, 25.65, 25.65, 25.65, 25.65, 25.65, 25.59, 25.59, 25.59, 25.59, 25.59, 25.5, 25.5, 25.5, 25.5, 25.5, 24.59, 24.59, 24.59, 24.59, 24.59, 24.7, 24.7, 24.7, 24.7, 24.7, 23.95, 23.95, 23.95, 23.95, 23.95, 23.66, 23.66, 23.66, 23.66, 23.66, 23.69, 23.69, 23.69, 23.69, 23.69, 23.8, 23.8, 23.8, 23.8, 23.8, 23.86, 23.86, 23.86, 23.86, 23.86, 23.62, 23.62, 23.62, 23.62, 23.62, 23.44, 23.44, 23.44, 23.44, 23.44, 22.96, 22.96, 22.96, 22.96, 22.96, 22.79, 22.79, 22.79, 22.79, 22.79, 23.0, 23.0, 23.0, 23.0, 23.0, 23.1, 23.1, 23.1, 23.1, 23.1, 23.19, 23.19, 23.19, 23.19, 23.19, 23.27, 23.27, 23.27, 23.27, 23.27, 23.31, 23.31, 23.31, 23.31, 23.31, 23.34, 23.34, 23.34, 23.34, 23.34, 22.97, 22.97, 22.97, 22.97, 22.97, 23.02, 23.02, 23.02, 23.02, 23.02, 23.09, 23.09, 23.09, 23.09, 23.09, 23.21, 23.21, 23.21, 23.21, 23.21, 23.22, 23.22, 23.22, 23.22, 23.22, 23.26, 23.26, 23.26, 23.26, 23.26, 23.35, 23.35, 23.35, 23.35, 23.35, 23.38, 23.38, 23.38, 23.38, 23.38, 23.4, 23.4, 23.4, 23.4, 23.4, 23.38, 23.38, 23.38, 23.38, 23.38, 23.31, 23.31, 23.31, 23.31, 23.31, 23.3, 23.3, 23.3, 23.3, 23.3, 23.06, 23.06, 23.06, 23.06, 23.06, 23.08, 23.08, 23.08, 23.08, 23.08, 23.06, 23.06, 23.06, 23.06, 23.06, 23.08, 23.08, 23.08, 23.08, 23.08, 23.11, 23.11, 23.11, 23.11, 23.11, 23.12, 23.12, 23.12, 23.12, 23.12, 23.19, 23.19, 23.19, 23.19, 23.19, 23.17, 23.17, 23.17, 23.17, 23.17, 23.12, 23.12, 23.12, 23.12, 23.12, 22.73, 22.73, 22.73, 22.73, 22.73, 22.57, 22.57, 22.57, 22.57, 22.57, 21.87, 21.87, 21.87, 21.87, 21.87, 21.86, 21.86, 21.86, 21.86, 21.86, 21.24, 21.24, 21.24, 21.24, 21.24, 21.23, 21.23, 21.23, 21.23, 21.23, 21.32, 21.32, 21.32, 21.32, 21.32, 21.37, 21.37, 21.37, 21.37, 21.37, 21.46, 21.46, 21.46, 21.46, 21.46, 21.5, 21.5]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 450 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712674161 --> 1712674793
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.33, 0.33, 0.33, 0.33, 0.33, 0.22, 0.22, 0.22, 0.22, 0.22, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.09, 0.09, 0.09, 0.09, 0.09, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.28, 0.28, 0.28, 0.28, 0.28, 0.26, 0.26, 0.26, 0.26, 0.26, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.29, 0.29, 0.29, 0.29, 0.29, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.18, 0.18, 0.18, 0.18, 0.18, 0.34, 0.34, 0.34, 0.34, 0.34, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.28, 0.28, 0.28, 0.28, 0.28, 0.37, 0.37, 0.37, 0.37, 0.37, 0.43, 0.43, 0.43, 0.43, 0.43, 0.53, 0.53, 0.53, 0.53, 0.53, 0.55, 0.55, 0.55, 0.55, 0.55, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.27, 0.27, 0.27, 0.27, 0.27, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 450 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712674161 --> 1712674793
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0]

… in /v1/chat/completions)

tests/test-json-schema-to-grammar.cpp

HanClinto · 2024-04-10T19:22:24Z

How much more effort would it be to benchmark these improvements?

ochafik · 2024-04-10T20:33:35Z

How much more effort would it be to benchmark these improvements?

@HanClinto It can quickly become an unfair fight depending on the schema and the model's generation choices (in some cases the speed may appear the same, but the worst case is now bounded).

If you expand the "Show command" drawer in the PR's description, the simple example I gave goes from being essentially stuck until the end of the universe to something ~~very smooth~~ (Edit) still sluggish but making interactive progress.

Here's how to benchmark any schema you'd like (using hyperfine):

git clone https://github.com/ochafik/llama.cpp --branch json-faster-repetitions2 llama.cpp-faster-rep
cd llama.cpp-faster-rep && git pull

echo '{"items": {"type": "number"}, "maxItems": 10}' > schema.json && \
  git checkout json-faster-repetitions2 && \
  python examples/json_schema_to_grammar.py schema.json > fast.grammar && \
  git checkout master && \
  python examples/json-schema-to-grammar.py schema.json > slow.grammar && \
  make clean && make -j LLAMA_CURL=1 main && \
  mkdir -p models/7B && \
  hyperfine --warmup 1 -L speed fast,slow './main -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q5_K_M.gguf --grammar-file {speed}.grammar -p "List of 10 numbers" --seed 1234'

# The warmup run will download the model, will take a while & use up 5.2GB

It gives 8x speedup for that specific seed & model (other values may not show improvements, or the master branch may timeout)

Show output

Benchmark 1: ./main --grammar-file fast.grammar -p "List of 10 numbers" --seed 1234
  Time (mean ± σ):      2.645 s ±  0.058 s    [User: 1.113 s, System: 0.232 s]
  Range (min … max):    2.604 s …  2.800 s    10 runs
 
Benchmark 2: ./main --grammar-file slow.grammar -p "List of 10 numbers" --seed 1234
  Time (mean ± σ):     20.999 s ±  0.285 s    [User: 16.764 s, System: 2.612 s]
  Range (min … max):   20.656 s … 21.480 s    10 runs
 
Summary
  ./main --grammar-file fast.grammar -p "List of 10 numbers" --seed 1234 ran
    7.94 ± 0.20 times faster than ./main --grammar-file slow.grammar -p "List of 10 numbers" --seed 1234

Lemme know if you'd like me to test any specific kind of schema

HanClinto · 2024-04-10T23:57:19Z

examples/json_schema_to_grammar.py

Not sure I'm on board with the rename to use underscores -- while there are a few other files with underscores (such as pydantic_models_to_grammar.py), most seem to use hyphens (pydantic-models-to-grammar-examples.py, etc), and it seems like the old filename is possibly better?

Originally I wanted all filenames in the repo to use hyphens. But later I found out that Python does not work well when there are hyphens in the filenames (e.g. I think you cannot include a Python file that has hyphens). So I think it's better to eventually rename all Python files to use underscores in their filenames

Tbh I did all this as a prerequisite for #6389, in which i need to import the converter from Python. I also found out llama-cpp-python inlines that file in their codebase, since it's hard / not trivial to import (short of using importlib, which feels dirty).

HanClinto · 2024-04-11T00:12:28Z

examples/json_schema_to_grammar.py

 }

-RESERVED_NAMES = set(["root", *PRIMITIVE_RULES.keys(), *DATE_RULES.keys()])
+DOTALL = '[\\U00000000-\\U0010FFFF]'
+DOT = '[\\U00000000-\\x09\\x0B\\x0C\\x0E-\\U0010FFFF]'


Not sure if it would be more performant or not, but I'm curious if:
DOT = '[^\\x0A\\x0D]'
would be easier / faster to process.

Updated to simpler negative range, thanks!!

HanClinto · 2024-04-11T00:13:36Z

grammars/README.md

@@ -89,3 +89,13 @@ This guide provides a brief overview. Check out the GBNF files in this directory
 ```
 ./main -m <model> --grammar-file grammars/some-grammar.gbnf -p 'Some prompt'
 ```
+
+## Troubleshooting


I like this section. After this gets merged in, I'll write a section on the dangers of left-recursion.

We should probably also document the json->grammar converters here, I'll send that separately

…ions2

ochafik · 2024-04-12T09:13:10Z

After (#6609 &) @HanClinto's ✨ magical #6616 ✨, this PR is still required, and we can dramatically increase the # of repetitions without impacting sampling performance.

At 200 repetitions the PR is 18x faster (switched to phi-2), and from 500 reps master is astronomically slow. Since the bottleneck then became finite stack in the recursive repetition rule generator, I've rewritten it and we can now go to 10k repetitions smoothly 🤯 (at 100k, the C++ server segfaults, which I say we keep as a follow up investigation).

Show benchmark commands for 10k reps

echo '{"items": {"type": "number"}, "maxItems": 10000}' > schema.json && \
  git checkout json-faster-repetitions2 && \
  python examples/json_schema_to_grammar.py schema.json > fast.grammar && \
  git checkout master && \
  python examples/json-schema-to-grammar.py schema.json > slow.grammar && \
  make clean && make -j LLAMA_CURL=1 main && \
  mkdir -p models/7B && \
  hyperfine --warmup 1 -L speed fast,slow './main -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf --grammar-file {speed}.grammar -p "List of 10 numbers" --seed 1234'

grammars/README.md

HanClinto

Looks good to me! Great work on this PR -- this is really stellar work!

Only other suggestion I can think is that it might be worth adding integration tests to compare the output of Python json_schema_to_grammar.py vs. json-scham-to-grammar.cpp -- but not sure if we care about ensure lockstep equivalency that much to care about wrapping it.

Overall this looks great, and I'm very impressed with it -- GREAT work on all of this @ochafik !

ochafik · 2024-04-12T18:43:17Z

it might be worth adding integration tests to compare the output of Python json_schema_to_grammar.py vs. json-scham-to-grammar.cpp -- but not sure if we care about ensure lockstep equivalency that much to care about wrapping it.

Fully agree, so much I've done this already in #5978 :-D: https://github.com/ggerganov/llama.cpp/blob/master/tests/test-json-schema-to-grammar.cpp (Also tests the JS version)

Overall this looks great, and I'm very impressed with it -- GREAT work on all of this @ochafik !

Thank you so much for your help, ideas & proactive reviews! (and your own speedups) Love this team work 👍

HanClinto · 2024-04-12T19:07:57Z

Fully agree, so much I've done this already in #5978 :-D: https://github.com/ggerganov/llama.cpp/blob/master/tests/test-json-schema-to-grammar.cpp (Also tests the JS version)

haha -- very nicely done. 😄

Again, awesome job -- super happy to see this merged in!

…ngs, cap number length (ggerganov#6555) * json: rename python schema converter to make import easier * server: skip null json_schema / grammar fields * json: deps management for primitive rules (+ allow null values) * json: optimize repetitions for minItems/maxItems and regexps: `a{,3}` goes from `"a"? "a"? "a"?` (explosive combos) to `(a (a (a)?)?)?` * grammars: add troubleshooting section to readme * json: cap length of numbers to 15 digits before/after decimal point (avoids infinite gen, e.g. "one third" -> `0.333333333333...`) * json: unify all repetition code (w/ or w/o sep) * json: support string minLength/maxLength * server+json: update server/README w/ result_format * nits * json: fix type error w/ python 3.8 * json: fix server/README (json_schema in /completion vs. result_format in /v1/chat/completions) * json: simplify DOT `{"type": "string", "pattern": "^.$"}` * json: remove recursion in opt_repetitions (avoids Python stack overflow) * json: rm dead code * json: rm useless assert & ggml.h import

ochafik added 9 commits April 8, 2024 19:32

json: rename python schema converter to make import easier

2148f24

server: skip null json_schema / grammar fields

f771a8f

json: deps management for primitive rules (+ allow null values)

159b883

json: optimize repetitions for minItems/maxItems and regexps: a{,3}…

a59e943

… goes from `"a"? "a"? "a"?` (explosive combos) to `(a (a (a)?)?)?`

grammars: add troubleshooting section to readme

07163fb

json: cap length of numbers to 15 digits before/after decimal point

dcf5d32

(avoids infinite gen, e.g. "one third" -> `0.333333333333...`)

json: unify all repetition code (w/ or w/o sep)

181f984

json: support string minLength/maxLength

de4e60e

server+json: update server/README w/ result_format

6c885dc

This was referenced Apr 8, 2024

json-schema-to-grammar improvements (+ added to server) #5978

Merged

[WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389

Closed

ochafik added 2 commits April 9, 2024 00:02

nits

3c81e94

json: fix type error w/ python 3.8

67a5184

ochafik marked this pull request as ready for review April 9, 2024 08:44

ochafik changed the title ~~JSON schema conversion: faster repetitions, min/maxLength for strings, cap number length~~ JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length Apr 9, 2024

json: fix server/README (json_schema in /completion vs. result_format…

9c33ee9

… in /v1/chat/completions)

HanClinto reviewed Apr 10, 2024

View reviewed changes

tests/test-json-schema-to-grammar.cpp Show resolved Hide resolved

HanClinto reviewed Apr 10, 2024

View reviewed changes

HanClinto reviewed Apr 11, 2024

View reviewed changes

ochafik mentioned this pull request Apr 11, 2024

grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) #6609

Merged

ochafik added 3 commits April 12, 2024 02:39

json: simplify DOT {"type": "string", "pattern": "^.$"}

ed13d47

Merge remote-tracking branch 'origin/master' into json-faster-repetit…

958bdda

…ions2

json: remove recursion in opt_repetitions (avoids Python stack overflow)

64e3059

ochafik added 2 commits April 12, 2024 10:59

json: rm dead code

ba90d5b

json: rm useless assert & ggml.h import

dfd4eb3

ochafik mentioned this pull request Apr 12, 2024

grammars: x{min,max} repetition operator #6640

Merged

5 tasks

HanClinto reviewed Apr 12, 2024

View reviewed changes

grammars/README.md Show resolved Hide resolved

HanClinto approved these changes Apr 12, 2024

View reviewed changes

ochafik merged commit ab9a324 into ggerganov:master Apr 12, 2024
53 of 59 checks passed

ochafik mentioned this pull request Apr 13, 2024

main: add --json-schema / -j flag #6659

Merged

USGZVReB9cJ6Crcp8CXHcR mentioned this pull request Jun 3, 2024

Python servlet appears to hang after small number of sequential requests. abetlen/llama-cpp-python#1500

Closed

4 tasks

ochafik mentioned this pull request Jun 10, 2024

json: document schema conversion in GBNF readme, align manual grammar examples & converters #7841

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length #6555

JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length #6555

ochafik commented Apr 8, 2024 •

edited

Loading

github-actions bot commented Apr 8, 2024 •

edited

Loading

HanClinto commented Apr 10, 2024

ochafik commented Apr 10, 2024 •

edited

Loading

HanClinto Apr 10, 2024

ggerganov Apr 11, 2024

ochafik Apr 11, 2024

HanClinto Apr 11, 2024

ochafik Apr 12, 2024

HanClinto Apr 11, 2024

ochafik Apr 11, 2024

ochafik commented Apr 12, 2024 •

edited

Loading

HanClinto left a comment

ochafik commented Apr 12, 2024 •

edited

Loading

HanClinto commented Apr 12, 2024

JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length #6555

JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length #6555

Conversation

ochafik commented Apr 8, 2024 • edited Loading

github-actions bot commented Apr 8, 2024 • edited Loading

HanClinto commented Apr 10, 2024

ochafik commented Apr 10, 2024 • edited Loading

HanClinto Apr 10, 2024

Choose a reason for hiding this comment

ggerganov Apr 11, 2024

Choose a reason for hiding this comment

ochafik Apr 11, 2024

Choose a reason for hiding this comment

HanClinto Apr 11, 2024

Choose a reason for hiding this comment

ochafik Apr 12, 2024

Choose a reason for hiding this comment

HanClinto Apr 11, 2024

Choose a reason for hiding this comment

ochafik Apr 11, 2024

Choose a reason for hiding this comment

ochafik commented Apr 12, 2024 • edited Loading

HanClinto left a comment

Choose a reason for hiding this comment

ochafik commented Apr 12, 2024 • edited Loading

HanClinto commented Apr 12, 2024

ochafik commented Apr 8, 2024 •

edited

Loading

github-actions bot commented Apr 8, 2024 •

edited

Loading

ochafik commented Apr 10, 2024 •

edited

Loading

ochafik commented Apr 12, 2024 •

edited

Loading

ochafik commented Apr 12, 2024 •

edited

Loading