Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input tokens exceeded max_input_tokens #2638

Open
2 of 4 tasks
LanSnowZ opened this issue Oct 12, 2024 · 0 comments
Open
2 of 4 tasks

input tokens exceeded max_input_tokens #2638

LanSnowZ opened this issue Oct 12, 2024 · 0 comments

Comments

@LanSnowZ
Copy link

System Info

Docker

Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.80.0
Commit sha: 169178b
Docker label: sha-169178b
nvidia-smi

Args {
model_id: "/share/base_model/Mistral-Nemo-Instruct-2407-GPTQ",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: Some(
Gptq,
),
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: Some(
8192,
),
max_input_length: None,
max_total_tokens: Some(
10240,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "545eaf4c39af",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
}

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

I launch a TGI server on a A100 GPU machine and served Mistral-Nemo-Instruct-2407-GPTQ model.
As shown in config, I set max_input_tokens to 8192 and max_total_tokens to 10240. But when I sent a message contains more tokens than 8192, it seems not to be truncated. The error imf is shown below:

2024-10-11T11:27:58.527278Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/[mod.rs:105](http://mod.rs:105/): `inputs` tokens + `max_new_tokens` must be <= 10240. Given: 9266 `inputs` tokens and 1000 `max_new_tokens`

My question:

  1. Will TGI automatically do truncation for user_input according to max_input_tokens?
  2. Could I use some parameters to truncate input length to less than max_input_tokens?

Thanks a lot for help.

Expected behavior

Input tokens should be truncated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant