[Feature]: Integrate with lm-format-enforcer #3713

simon-mo · 2024-03-29T03:28:10Z

🚀 The feature, motivation and pitch

While existing Outline state machine provide great state of the art performance, it is trading off a one-off compile time when working with the schema. For endpoint products running model as a service with customers supplying many different schemas, the cost might not be acceptable. In that case, we should integrate with lm-format-enforcer from @noamgat.

We already have an existing logits processor interface and guided decoding tested. It should be quite straightforward to add it integration for it. In the end it should be some flag choosing --guided-decoding-backend=....

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

noamgat · 2024-03-29T19:35:32Z

I think I'll be able to execute this integration rather quickly, if we agree on the way the user chooses which decoding backend to use. Are you OK with the flag that you suggested (guided-decoding-backend)?

simon-mo · 2024-03-30T00:20:31Z

Yes the flag sounds natural to me. A more complicate change here will be while outlines fsm is compiling, use lmformatenforcer.

But just using flags should be fine as a first step.

noamgat · 2024-03-31T14:50:01Z

I've started working on it, this commit shows the way information propagates.
Can you have a short look to verify that I am going according to what you envisioned?

jamestwhedbee · 2024-04-04T18:32:44Z

very excited for this

simon-mo · 2024-04-04T22:53:05Z

@noamgat engine args is the right place to put it. Do we need it in model config still?

noamgat · 2024-04-05T05:43:11Z

@noamgat engine args is the right place to put it. Do we need it in model config still?

The reason I also put it in ModelConfig and not just in EngineArgs is because AsyncLLMEngine.__init__() receives the configs tuple, and not the EngineArgs, and I wanted to pass the information to it. Is there a better way you think of?

In the meantime I'm continuing to the actual LMFE integration.

noamgat · 2024-04-05T07:05:33Z

Submitted a pull request!

noamgat · 2024-04-05T16:32:42Z

Note: I ended up removing the argument from ModelConfig and added a new DecodingConfig class

noamgat · 2024-04-08T20:34:39Z

Updated the pull request with a per-request decoding backend param, to make testing easier (both for unit tests and for people evaluating the different options)

ksjadeja · 2024-04-13T23:29:59Z

Can we have this support for local serving using AsyncLLMEngine. My scenario/use-case is as follows: I am using Mixtral 8x7b on g5.12xlarge ec2 instance and serving it locally using Python. I want my model to generate strict json schema using guided_json feature and also use the AsyncLLMEngine for faster responses. Can someone tell me how can I do that?

This is what I do:

  from vllm.engine.arg_utils import AsyncEngineArgs
  from vllm.engine.async_llm_engine import AsyncLLMEngine
  from vllm.sampling_params import SamplingParams
  
  kwargs = dict()
  kwargs["disable_log_stats"] = True
  kwargs["disable_log_requests"] = True
  engine_args = AsyncEngineArgs(
      model="casperhansen/mixtral-instruct-awq",
      tokenizer=None,
      tokenizer_mode="auto",
      trust_remote_code=True,
      tensor_parallel_size=4,
      dtype="auto",
      quantization="awq",
      revision=None,
      tokenizer_revision=None,
      seed=0,
      gpu_memory_utilization=0.65,
      swap_space=4,
      enforce_eager=False,
      max_context_len_to_capture=8192,
      disable_custom_all_reduce=False,
      **kwargs,
  )
  
  mixtral_model = AsyncLLMEngine.from_engine_args(engine_args)
  sampling_params=SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024)

And then for serving:
`
request_id = random_uuid()

        results_generator = mixtral_model.generate(prompt, sampling_params, request_id)
        
        final_output = None
        async for request_output in results_generator:
            final_output = request_output
        
        assert final_output is not None
        prompt = final_output.prompt
        text_outputs = [output.text for output in final_output.outputs]

`
Where can I include guided_json feature and provide the schema?

noamgat · 2024-04-24T09:57:33Z

Can we have this support for local serving using AsyncLLMEngine. My scenario/use-case is as follows: I am using Mixtral 8x7b on g5.12xlarge ec2 instance and serving it locally using Python. I want my model to generate strict json schema using guided_json feature and also use the AsyncLLMEngine for faster responses. Can someone tell me how can I do that?

This is what I do:

  from vllm.engine.arg_utils import AsyncEngineArgs
  from vllm.engine.async_llm_engine import AsyncLLMEngine
  from vllm.sampling_params import SamplingParams
  
  kwargs = dict()
  kwargs["disable_log_stats"] = True
  kwargs["disable_log_requests"] = True
  engine_args = AsyncEngineArgs(
      model="casperhansen/mixtral-instruct-awq",
      tokenizer=None,
      tokenizer_mode="auto",
      trust_remote_code=True,
      tensor_parallel_size=4,
      dtype="auto",
      quantization="awq",
      revision=None,
      tokenizer_revision=None,
      seed=0,
      gpu_memory_utilization=0.65,
      swap_space=4,
      enforce_eager=False,
      max_context_len_to_capture=8192,
      disable_custom_all_reduce=False,
      **kwargs,
  )
  
  mixtral_model = AsyncLLMEngine.from_engine_args(engine_args)
  sampling_params=SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024)

And then for serving: ` request_id = random_uuid()

        results_generator = mixtral_model.generate(prompt, sampling_params, request_id)
        
        final_output = None
        async for request_output in results_generator:
            final_output = request_output
        
        assert final_output is not None
        prompt = final_output.prompt
        text_outputs = [output.text for output in final_output.outputs]

` Where can I include guided_json feature and provide the schema?

The code you are using is not 'serving', but rather 'script-based inference'.
The way to modify it is to add logits-processors to the sampling params that do the structured decoding.
An example reference for this can be found in the LMFE vllm integration sample notebook.

simon-mo added the feature request label Mar 29, 2024

simon-mo mentioned this issue Mar 29, 2024

[Feature]: Integrate with AICI #3714

Open

simon-mo mentioned this issue Apr 4, 2024

[Roadmap] vLLM Roadmap Q2 2024 #3861

Closed

65 tasks

noamgat mentioned this issue Apr 5, 2024

LM Format Enforcer Guided Decoding Support #3868

Merged

lucasavila00 mentioned this issue Apr 5, 2024

Model grammar support via BNF EricLBuehler/mistral.rs#59

Closed

simon-mo closed this as completed in #3868 Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Integrate with lm-format-enforcer #3713

[Feature]: Integrate with lm-format-enforcer #3713

simon-mo commented Mar 29, 2024

noamgat commented Mar 29, 2024

simon-mo commented Mar 30, 2024

noamgat commented Mar 31, 2024

jamestwhedbee commented Apr 4, 2024

simon-mo commented Apr 4, 2024

noamgat commented Apr 5, 2024

noamgat commented Apr 5, 2024

noamgat commented Apr 5, 2024

noamgat commented Apr 8, 2024

ksjadeja commented Apr 13, 2024 •

edited

Loading

noamgat commented Apr 24, 2024

[Feature]: Integrate with lm-format-enforcer #3713

[Feature]: Integrate with lm-format-enforcer #3713

Comments

simon-mo commented Mar 29, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

noamgat commented Mar 29, 2024

simon-mo commented Mar 30, 2024

noamgat commented Mar 31, 2024

jamestwhedbee commented Apr 4, 2024

simon-mo commented Apr 4, 2024

noamgat commented Apr 5, 2024

noamgat commented Apr 5, 2024

noamgat commented Apr 5, 2024

noamgat commented Apr 8, 2024

ksjadeja commented Apr 13, 2024 • edited Loading

noamgat commented Apr 24, 2024

ksjadeja commented Apr 13, 2024 •

edited

Loading