Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Integrate with lm-format-enforcer #3713

Closed
simon-mo opened this issue Mar 29, 2024 · 11 comments · Fixed by #3868
Closed

[Feature]: Integrate with lm-format-enforcer #3713

simon-mo opened this issue Mar 29, 2024 · 11 comments · Fixed by #3868

Comments

@simon-mo
Copy link
Collaborator

🚀 The feature, motivation and pitch

While existing Outline state machine provide great state of the art performance, it is trading off a one-off compile time when working with the schema. For endpoint products running model as a service with customers supplying many different schemas, the cost might not be acceptable. In that case, we should integrate with lm-format-enforcer from @noamgat.

We already have an existing logits processor interface and guided decoding tested. It should be quite straightforward to add it integration for it. In the end it should be some flag choosing --guided-decoding-backend=....

Alternatives

No response

Additional context

No response

@noamgat
Copy link
Contributor

noamgat commented Mar 29, 2024

I think I'll be able to execute this integration rather quickly, if we agree on the way the user chooses which decoding backend to use. Are you OK with the flag that you suggested (guided-decoding-backend)?

@simon-mo
Copy link
Collaborator Author

Yes the flag sounds natural to me. A more complicate change here will be while outlines fsm is compiling, use lmformatenforcer.

But just using flags should be fine as a first step.

@noamgat
Copy link
Contributor

noamgat commented Mar 31, 2024

I've started working on it, this commit shows the way information propagates.
Can you have a short look to verify that I am going according to what you envisioned?

@jamestwhedbee
Copy link
Contributor

very excited for this

@simon-mo
Copy link
Collaborator Author

simon-mo commented Apr 4, 2024

@noamgat engine args is the right place to put it. Do we need it in model config still?

@noamgat
Copy link
Contributor

noamgat commented Apr 5, 2024

@noamgat engine args is the right place to put it. Do we need it in model config still?

The reason I also put it in ModelConfig and not just in EngineArgs is because AsyncLLMEngine.__init__() receives the configs tuple, and not the EngineArgs, and I wanted to pass the information to it. Is there a better way you think of?

In the meantime I'm continuing to the actual LMFE integration.

@noamgat
Copy link
Contributor

noamgat commented Apr 5, 2024

Submitted a pull request!

@noamgat
Copy link
Contributor

noamgat commented Apr 5, 2024

Note: I ended up removing the argument from ModelConfig and added a new DecodingConfig class

@noamgat
Copy link
Contributor

noamgat commented Apr 8, 2024

Updated the pull request with a per-request decoding backend param, to make testing easier (both for unit tests and for people evaluating the different options)

@ksjadeja
Copy link

ksjadeja commented Apr 13, 2024

Can we have this support for local serving using AsyncLLMEngine. My scenario/use-case is as follows: I am using Mixtral 8x7b on g5.12xlarge ec2 instance and serving it locally using Python. I want my model to generate strict json schema using guided_json feature and also use the AsyncLLMEngine for faster responses. Can someone tell me how can I do that?

This is what I do:

  from vllm.engine.arg_utils import AsyncEngineArgs
  from vllm.engine.async_llm_engine import AsyncLLMEngine
  from vllm.sampling_params import SamplingParams
  
  kwargs = dict()
  kwargs["disable_log_stats"] = True
  kwargs["disable_log_requests"] = True
  engine_args = AsyncEngineArgs(
      model="casperhansen/mixtral-instruct-awq",
      tokenizer=None,
      tokenizer_mode="auto",
      trust_remote_code=True,
      tensor_parallel_size=4,
      dtype="auto",
      quantization="awq",
      revision=None,
      tokenizer_revision=None,
      seed=0,
      gpu_memory_utilization=0.65,
      swap_space=4,
      enforce_eager=False,
      max_context_len_to_capture=8192,
      disable_custom_all_reduce=False,
      **kwargs,
  )
  
  mixtral_model = AsyncLLMEngine.from_engine_args(engine_args)
  sampling_params=SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024)

And then for serving:
`
request_id = random_uuid()

        results_generator = mixtral_model.generate(prompt, sampling_params, request_id)
        
        final_output = None
        async for request_output in results_generator:
            final_output = request_output
        
        assert final_output is not None
        prompt = final_output.prompt
        text_outputs = [output.text for output in final_output.outputs]

`
Where can I include guided_json feature and provide the schema?

@noamgat
Copy link
Contributor

noamgat commented Apr 24, 2024

Can we have this support for local serving using AsyncLLMEngine. My scenario/use-case is as follows: I am using Mixtral 8x7b on g5.12xlarge ec2 instance and serving it locally using Python. I want my model to generate strict json schema using guided_json feature and also use the AsyncLLMEngine for faster responses. Can someone tell me how can I do that?

This is what I do:

  from vllm.engine.arg_utils import AsyncEngineArgs
  from vllm.engine.async_llm_engine import AsyncLLMEngine
  from vllm.sampling_params import SamplingParams
  
  kwargs = dict()
  kwargs["disable_log_stats"] = True
  kwargs["disable_log_requests"] = True
  engine_args = AsyncEngineArgs(
      model="casperhansen/mixtral-instruct-awq",
      tokenizer=None,
      tokenizer_mode="auto",
      trust_remote_code=True,
      tensor_parallel_size=4,
      dtype="auto",
      quantization="awq",
      revision=None,
      tokenizer_revision=None,
      seed=0,
      gpu_memory_utilization=0.65,
      swap_space=4,
      enforce_eager=False,
      max_context_len_to_capture=8192,
      disable_custom_all_reduce=False,
      **kwargs,
  )
  
  mixtral_model = AsyncLLMEngine.from_engine_args(engine_args)
  sampling_params=SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024)

And then for serving: ` request_id = random_uuid()

        results_generator = mixtral_model.generate(prompt, sampling_params, request_id)
        
        final_output = None
        async for request_output in results_generator:
            final_output = request_output
        
        assert final_output is not None
        prompt = final_output.prompt
        text_outputs = [output.text for output in final_output.outputs]

` Where can I include guided_json feature and provide the schema?

The code you are using is not 'serving', but rather 'script-based inference'.
The way to modify it is to add logits-processors to the sampling params that do the structured decoding.
An example reference for this can be found in the LMFE vllm integration sample notebook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants