Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Jinja template support #11016

Merged
merged 47 commits into from
Jan 21, 2025
Merged

Add Jinja template support #11016

merged 47 commits into from
Jan 21, 2025

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Dec 30, 2024

Subset of #9639 with just the Jinja templating support.

Proper tool support (grammar constraints, lazy grammar triggering, tool call parsing & stop reason) will come in a follow up PR.

  • Copies minja.hpp & chat-template.hpp from google/minja (created for this 😅) at this commit
  • Adds --jinja flag to llama-server, llama-cli, llama-run
  • Adds --chat-template-file flag to llama-server, llama-cli (related: Added chat template support to llama-run #11215 )
  • Loads tokenizer.chat_template (or tokenizer.chat_template.tool_use if defined, only when the request has tools).
  • Dual testing in test-chat-template.cpp of legacy adhoc templating & jinja route. Wherever the expected outputs diverge, the jinja expectations should be more correct (note that templates are run w/ trim_blocks = true, lstrip_blocks = true)

Example usage:

# Launch in background
./build/bin/llama-server \
  -hfr bartowski/Qwen2.5-7B-Instruct-GGUF \
  -hff Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  --jinja &

curl http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "ipython",
          "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
          "parameters": {
            "type": "object",
            "properties": {
              "code": {
                "type": "string",
                "description": "The code to run in the ipython interpreter."
              }
            },
            "required": ["code"]
          }
        }
      }
    ],
    "messages": [
      {
        "role": "user",
        "content": "Print a hello world message with python (using single quotes '"'"' for strings)."
      }
    ]
  }'
show output
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "<tool_call>\n{\"name\": \"ipython\", \"arguments\": {\"code\": \"print('Hello world!')\"}}\n</tool_call>",
        "role": "assistant"
      }
    }
  ],
  "created": 1736811609,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b4494-a57bb94e",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 25,
    "prompt_tokens": 205,
    "total_tokens": 230
  },
  "id": "chatcmpl-5YJXFVhvjoMDlLx1asuWNdSO3JVWWsUF",
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 155.151,
    "prompt_per_token_ms": 155.151,
    "prompt_per_second": 6.445333900522716,
    "predicted_n": 25,
    "predicted_ms": 419.714,
    "predicted_per_token_ms": 16.78856,
    "predicted_per_second": 59.56437002339688
  }
}

TODO:

  • Add cross-testing in test-chat-template.cpp (note that minja is tested against a lot of templates in its own repo)
  • Add some instructions here
  • Add more server tests to exercise the template overrides.

@github-actions github-actions bot added script Script related examples python python script changes server labels Dec 30, 2024
@ericcurtin
Copy link
Collaborator

Feel free to add the option to llama-run for basic testing also @ochafik

@github-actions github-actions bot added the testing Everything test related label Jan 13, 2025
common/arg.cpp Outdated Show resolved Hide resolved
common/arg.cpp Outdated Show resolved Hide resolved

namespace minja {

class chat_template {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea to be able to #include "chat-template.hpp" in main is to forward declare json here without #include <json.hpp>, only define the prototype of class chat_template here. Then we will need a new file chat-template.cpp that hold the actual implementation, including #include <json.hpp>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not sure if this even works, but we can do in another PR, just noting my idea here)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to keep minja header-only for now, but happy to explore options as follow up :-)

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work, thanks! Let's wait for @ggerganov to do another pass, then I think it's good to go!

@@ -4,22 +4,26 @@

server = ServerPreset.tinyllama2()


@pytest.fixture(scope="module", autouse=True)
@pytest.fixture(autouse=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exceptionally good at pytest so maybe I'm missing something. Could you explain why scope="module" is removed here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scope=module was making the ServerProcess server instance shared between all the test in the module (file). Even though it's stopped in stop_server_after_each_test, it carried previous settings over to subsequent tests, spilling server flags over (became more important w/ jinja & chat_template params)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for the explanation. Seem like module=scope is not what I wanted. I want the fixture to only affect single file, since the idea is that one test unit uses one model

]
)
def test_chat_completion(model, system_prompt, user_prompt, max_tokens, re_content, n_prompt, n_predicted, finish_reason):
def test_chat_completion(model, system_prompt, user_prompt, max_tokens, re_content, n_prompt, n_predicted, finish_reason, jinja, chat_template):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: we can also add a "slow" test that can test tool call with a big model like Hermes or Qwen (see an example in test_infill.py). I'll have a look in the next few days.

Copy link
Collaborator Author

@ochafik ochafik Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe absolutely, this is coming in #9639 or another subpart of it (tool call parsing + conditional grammars)

https://github.com/ochafik/llama.cpp/blob/76893f588019ba09c5f4726a97994ffac91ecf34/examples/server/tests/unit/test_chat_completion.py#L300

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I wonder if there's any reason to override the LLAMA_CACHE to tmp in server tests? I've been struggling with disk space on my MBP 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly to provide a way to isolate tests if user have multiple clones of llama.cpp source code on the machine. Maybe you can symlink that tmp directory to an external storage ?

common/minja.hpp Outdated
Comment on lines 41 to 48
static std::string normalize_newlines(const std::string & s) {
#ifdef _WIN32
static const std::regex nl_regex("\r\n");
return std::regex_replace(s, nl_regex, "\n");
#else
return s;
#endif
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what was the original purpose for this, but I think it can be removed, as well as the definition of ENDL to \r\n in win32. It shouldn't make a difference with stringstream.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped ENDL + 1 usage of this function (at end of rendering; one is still needed to shield the parser from CRLFs), thanks!

examples/server/server.cpp Outdated Show resolved Hide resolved
@ngxson
Copy link
Collaborator

ngxson commented Jan 21, 2025

Small thing to note is that some jinja templates are not "linear", meaning each conversation turn is not self-contained, but can modify the content before it.

For example, the new deepseek-r1 distilled has {% set content = content.split('</think>')[-1] %} to remove the thinking process from conversation history. I also once saw a template that adds EOS token after each formatted chat, which also breaks this logic.

The consequence is that it will break common_chat_format_single (used in llama-cli) and apply_chat_template (used by llama-run) since they assume that each new message is self-contained (i.e. is addition, but not modification)

A solution is to also track the cached token at token level (not conversation level), which I introduced here #11203 , @ericcurtin feel free to port this to llama-run if you want. This approach is kinda like server implementation.

@ochafik ochafik merged commit 6171c9d into ggerganov:master Jan 21, 2025
47 checks passed
@ochafik
Copy link
Collaborator Author

ochafik commented Jan 21, 2025

Thanks everyone for the insightful reviews! More from #9639 to come soon :-)

@fairydreaming
Copy link
Collaborator

Not sure if this is a special case or the template is broken, but when I load minimax-text-01 (my work-in-progress) with the following template:

{% for message in messages %}{% if message['role'] == 'system' %}{{ '<beginning_of_sentence>system ai_setting=assistant\\n' + message['content'][0]['text'] + '<end_of_sentence>\\n'}}{% elif message['role'] == 'user' %}{{ '<beginning_of_sentence>user name=user\\n' + message['content'][0]['text'] + '<end_of_sentence>\\n'}}{% elif message['role'] == 'assistant' %}{{ '<beginning_of_sentence>ai name=assistant\\n' }}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{% generation %}{{ content['text'] }}{% endgeneration %}{% endfor %}{{ '<end_of_sentence>\\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<beginning_of_sentence>ai name=assistant\\n' }}{% endif %}

with this PR llama.cpp crashes during model loading:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Expected block keyword at row 1, column 492:
{% for message in messages %}{% if message['role'] == 'system' %}{{ '<beginning_of_sentence>system ai_setting=assistant\n' + message['content'][0]['text'] + '<end_of_sentence>\n'}}{% elif message['role'] == 'user' %}{{ '<beginning_of_sentence>user name=user\n' + message['content'][0]['text'] + '<end_of_sentence>\n'}}{% elif message['role'] == 'assistant' %}{{ '<beginning_of_sentence>ai name=assistant\n' }}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{% generation %}{{ content['text'] }}{% endgeneration %}{% endfor %}{{ '<end_of_sentence>\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<beginning_of_sentence>ai name=assistant\n' }}{% endif %}
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ^

@ochafik
Copy link
Collaborator Author

ochafik commented Jan 21, 2025

Not sure if this is a special case or the template is broken, but when I load minimax-text-01 (my work-in-progress) with the following template:

{% for message in messages %}{% if message['role'] == 'system' %}{{ '<beginning_of_sentence>system ai_setting=assistant\\n' + message['content'][0]['text'] + '<end_of_sentence>\\n'}}{% elif message['role'] == 'user' %}{{ '<beginning_of_sentence>user name=user\\n' + message['content'][0]['text'] + '<end_of_sentence>\\n'}}{% elif message['role'] == 'assistant' %}{{ '<beginning_of_sentence>ai name=assistant\\n' }}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{% generation %}{{ content['text'] }}{% endgeneration %}{% endfor %}{{ '<end_of_sentence>\\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<beginning_of_sentence>ai name=assistant\\n' }}{% endif %}

Hey @fairydreaming , thanks for testing & reporting! Your template contain an exotic {% generation %}...{% endgeneration %} syntax that doesn't seem supported by, say, this online jinja parser either.

terminate called after throwing an instance of 'std::runtime_error'
what(): Expected block keyword at row 1, column 492:

I could certainly make the error more informative though, feel free to file something on https://github.com/google/minja to that end (and/or any feature request).

Looking forward to testing your model, good luck with it!

@fairydreaming
Copy link
Collaborator

@ochafik I did some research and it seems to be a custom keyword introduced in HF transformers: huggingface/transformers#30650

Fortunately among all the models I have currently on disk only MiniMax-Text-01 uses this.

@ochafik
Copy link
Collaborator Author

ochafik commented Jan 22, 2025

@ochafik I did some research and it seems to be a custom keyword introduced in HF transformers: huggingface/transformers#30650

Fortunately among all the models I have currently on disk only MiniMax-Text-01 uses this.

@fairydreaming thanks for researching that, will track support in google/minja#28

@ochafik ochafik mentioned this pull request Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes script Script related server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants