Skip to content

Latest commit

 

History

History
55 lines (49 loc) · 6.04 KB

inference-request.md

File metadata and controls

55 lines (49 loc) · 6.04 KB

(inference-request)=

Inference Request

The main class to describe requests to GptManager is InferenceRequest. This is structured as a map of tensors and a uint64_t requestId. The mandatory input tensors to create a valid InferenceRequest object are described below. Sampling config params are documented in the {ref}gpt-runtime section. Descriptions have been omitted in the table.

Name Shape Type Description
request_output_len [1,1] int32_t Max number of output tokens
input_ids [1, num_input_tokens] int32_t Tensor of input tokens

Optional tensors that can be supplied to InferenceRequest are shown below. Default values, where applicable are specified.:

Name Shape Type Description
streaming [1] bool (Default=false). When true, stream out tokens as they are generated. When false return only when the full generation has completed.
beam_width [1] int32_t (Default=1) Beam width for this request; set to 1 for greedy sampling
temperature [1] float Sampling Config param: temperature
runtime_top_k [1] int32_t Sampling Config param: topK
runtime_top_p [1] float Sampling Config param: topP
len_penalty [1] float Sampling Config param: lengthPenalty
early_stopping [1] int Sampling Config param: earlyStopping
repetition_penalty [1] float Sampling Config param: repetitionPenalty
min_length [1] int32_t Sampling Config param: minLength
presence_penalty [1] float Sampling Config param: presencePenalty
frequency_penalty [1] float Sampling Config param: frequencyPenalty
no_repeat_ngram_size [1] int32_t Sampling Config param: noRepeatNgramSize
random_seed [1] uint64_t Sampling Config param: randomSeed
end_id [1] int32_t End token Id. If not specified, defaults to -1
pad_id [1] int32_t Pad token Id
embedding_bias [1, vocab_size] float The bias is added to the logits for each token in the vocabulary before decoding occurs. Positive values in the bias encourage the sampling of tokens, while negative values discourage it. A value of 0.f leaves the logit value unchanged.
bad_words_list [1, 2, num_bad_words] int32_t Bad words list. Consider an example with two bad words, where the first word contains tokens [5, 7, 3] and the second one contains tokens [9, 2]. In total there are 5 tokens so the tensor shape should be [1, 2, 5]. The first row of the tensor must contain the token ids, while the second row must store the include-scan offsets of the word lengths (in number of tokens). Hence, the bad_word_list tensor would look like: [[[ 5, 7, 3, 9, 2][ 3, 5, -1, -1, -1]]]
stop_words_list [1, 2, num_stop_words] int32_t Stop words list. See bad_words_list for the description of the expected tensor shape and content
prompt_embedding_table [1] float16 P-tuning prompt embedding table
prompt_vocab_size [1] int32_t P-tuning prompt vocab size
lora_task_id [1] uint64_t Task ID for the given lora_weights. This ID is expected to be globally unique. To perform inference with a specific LoRA for the first time lora_task_id lora_weights and lora_config must all be given. The LoRA will be cached, so that subsequent requests for the same task only require lora_task_id. If the cache is full the oldest LoRA will be evicted to make space for new ones. An error is returned if lora_task_id is not cached
lora_weights [num_lora_modules_layers, D x Hi + Ho x D] float (model data type) weights for a LoRA adapter. Refer to {ref}lora for more information.
lora_config [num_lora_modules_layers, 3] int32_t LoRA configuration tensor. [ module_id, layer_idx, adapter_size (D aka R value) ] Refer to {ref}lora for more information.
return_log_probs [1] bool When true, include log probs in the output
return_context_logits [1] bool When true, include context logits in the output
return_generation_logits [1] bool When true, include generation logits in the output
draft_input_ids [num_draft_tokens] int32_t Draft tokens to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration
draft_logits [num_draft_tokens, vocab_size] float Draft logits associated with draft_input_ids to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration

Responses

Responses from GptManager are formatted as a list of tensors. The table below shows the set of output tensors returned by GptManager (via the SendResponseCallback):

Name Shape Type Description
output_ids [beam_width, num_output_tokens] int32_t Tensor of output tokens. When streaming is enabled, this is a single token.
sequence_length [beam_width] int32_t Number of output tokens. When streaming is set, this will be 1.
output_log_probs [1, beam_width, num_output_tokens] float Only if return_log_probs is set on input. Tensor of log probabilities of output token logits.
cum_log_probs [1, beam_width] float Only if return_log_probs is set on input. Cumulative log probability of the sequence generated.
context_logits [1, num_input_tokens, vocab_size] float Only if return_context_logits is set on input. Tensor of input token logits.
generation_logits [1, beam_width, num_output_tokens, vocab_size] float Only if return_generation_logits is set on input. Tensor of output token logits.