Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When there are multiple GPU, only one GPU is used #7664

Open
gyr66 opened this issue Sep 27, 2024 · 4 comments
Open

When there are multiple GPU, only one GPU is used #7664

gyr66 opened this issue Sep 27, 2024 · 4 comments
Labels
question Further information is requested verify to close Verifying if the issue can be closed

Comments

@gyr66
Copy link

gyr66 commented Sep 27, 2024

Description
When there are multiple GPU, only one GPU is used.

Triton Information
Container: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3

To Reproduce
Follow the instrcution at https://github.com/triton-inference-server/tutorials/blob/main/Popular_Models_Guide/Llama2/trtllm_guide.md

docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v /root/models/Meta-Llama-3.1-8B-Instruct:/root/.cache/huggingface \
    -v /mnt/data/engines:/engines \
    nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3

pip install git+https://github.com/triton-inference-server/[email protected]

triton import -m llama-3.1-8b-instruct --backend tensorrtllm

triton start

The model configuration file (/root/models/llama-3.1-8b-instruct/config.pbtxt) is:

backend: "python"
max_batch_size: 256

model_transaction_policy {
  decoupled: True
}

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "decoder_text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
    optional: true
  },
  {
    name: "image_input"
    data_type: TYPE_FP16
    dims: [ 3, -1, -1 ]
    optional: true
  },
  {
    name: "max_tokens"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
   name: "bad_words"
   data_type: TYPE_STRING
   dims: [ -1 ]
   optional: true
  },
  {
   name: "stop_words"
   data_type: TYPE_STRING
   dims: [ -1 ]
   optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "length_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    optional: true
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_context_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_generation_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "stream"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
      name: "embedding_bias_words"
      data_type: TYPE_STRING
      dims: [ -1 ]
      optional: true
  },
  {
      name: "embedding_bias_weights"
      data_type: TYPE_FP32
      dims: [ -1 ]
      optional: true
  },
  {
      name: "num_draft_tokens",
      data_type: TYPE_INT32,
      dims: [ 1 ]
      optional: true
  },
  {
      name: "use_draft_logits",
      data_type: TYPE_BOOL,
      dims: [ 1 ]
      reshape: { shape: [ ] }
      optional: true
  }
]
output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  },
  {
    name: "batch_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }
]

parameters: {
  key: "accumulate_tokens"
  value: {
    string_value: "${accumulate_tokens}"
  }
}
parameters: {
  key: "tensorrt_llm_model_name"
  value: {
    string_value: "tensorrt_llm"
  }
}
parameters: {
  key: "tensorrt_llm_draft_model_name"
  value: {
    string_value: ""
  }
}
parameters: {
  key: "multimodal_encoders_name"
  value: {
    string_value: "${multimodal_encoders_name}"
  }
}

instance_group [
  {
    count: 1
    kind : KIND_GPU
    gpus: [ 0 ]
  },
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 1 ]
  }
]

Clearly I have specified it to use gpu 0 and gpu 1.

The postprocessing, preprocessing and tensorrt_llm are left unchanged.

Expected behavior

The model should be loaded on gpu 0 and gpu 1, and can deal with requests based on load.

Here are what I got:

image

See the model only loaded on gpu 0.

When I do benchmark, there is also just gpu 0 is used:

image

Here is the running log of triton:

root@ubuntu22:/opt/tritonserver# triton start
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Reading server output...
I0927 07:14:19.053887 3017 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x73e104000000' with size 268435456"
I0927 07:14:19.060743 3017 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0927 07:14:19.060759 3017 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0927 07:14:19.290249 3017 model_lifecycle.cc:472] "loading: llama-3.1-8b-instruct:1"
I0927 07:14:19.290321 3017 model_lifecycle.cc:472] "loading: postprocessing:1"
I0927 07:14:19.290360 3017 model_lifecycle.cc:472] "loading: preprocessing:1"
I0927 07:14:19.290413 3017 model_lifecycle.cc:472] "loading: tensorrt_llm:1"
I0927 07:14:19.511929 3017 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I0927 07:14:19.511965 3017 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I0927 07:14:19.511970 3017 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I0927 07:14:19.511973 3017 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
I0927 07:14:19.530974 3017 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1048576
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1048576
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
I0927 07:14:22.499985 3017 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: llama-3.1-8b-instruct_0_0 (GPU device 0)"
I0927 07:14:22.500067 3017 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: llama-3.1-8b-instruct_1_0 (GPU device 1)"
I0927 07:14:23.339154 3017 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I0927 07:14:23.406562 3017 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I0927 07:14:24.257499 3017 model_lifecycle.cc:839] "successfully loaded 'llama-3.1-8b-instruct'"
I0927 07:14:25.953954 3017 model_lifecycle.cc:839] "successfully loaded 'preprocessing'"
I0927 07:14:26.009815 3017 model_lifecycle.cc:839] "successfully loaded 'postprocessing'"
[TensorRT-LLM][INFO] Loaded engine size: 15387 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 800.00 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 15380 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.12 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.63 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 39.50 GiB, available: 12.40 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1429
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][WARNING] maxAttentionWindow and maxSequenceLen are too large for at least one sequence to fit in kvCache. they are reduced to 91456
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 1429
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 11.16 GiB for max tokens in paged KV cache (91456).
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
I0927 07:14:41.151772 3017 libtensorrtllm.cc:184] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_0_0"
I0927 07:14:41.152077 3017 model_lifecycle.cc:839] "successfully loaded 'tensorrt_llm'"
I0927 07:14:41.152199 3017 server.cc:604]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0927 07:14:41.152246 3017 server.cc:631]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                                                        |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0927 07:14:41.152284 3017 server.cc:674]
+-----------------------+---------+--------+
| Model                 | Version | Status |
+-----------------------+---------+--------+
| llama-3.1-8b-instruct | 1       | READY  |
| postprocessing        | 1       | READY  |
| preprocessing         | 1       | READY  |
| tensorrt_llm          | 1       | READY  |
+-----------------------+---------+--------+

I0927 07:14:41.246701 3017 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA A100-PCIE-40GB"
I0927 07:14:41.246738 3017 metrics.cc:877] "Collecting metrics for GPU 1: NVIDIA A100-PCIE-40GB"
I0927 07:14:41.252099 3017 metrics.cc:770] "Collecting CPU metrics"
I0927 07:14:41.252238 3017 tritonserver.cc:2598]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.49.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /root/models                                                                                                                                                                                                    |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| model_config_name                |                                                                                                                                                                                                                 |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0927 07:14:41.254711 3017 grpc_server.cc:2463] "Started GRPCInferenceService at 0.0.0.0:8001"
I0927 07:14:41.254940 3017 http_server.cc:4694] "Started HTTPService at 0.0.0.0:8000"
I0927 07:14:41.296047 3017 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"
@oandreeva-nv
Copy link
Contributor

Hi @gyr66 , thanks for your question. I believe, this is because the trt-llm engine is composed for 1 GPU by default in triton-cli, @rmccorm4 will correct me, if I'm wrong.

@oandreeva-nv oandreeva-nv added the question Further information is requested label Sep 27, 2024
@rmccorm4
Copy link
Collaborator

Hi @gyr66, thanks for raising this issue and thanks for trying the Triton CLI!

As Olga mentioned, yes the default configs produced are currently for a "quickstart" path and are pre-defined as a single Triton model instance of KIND_MODEL (it can be a multi-gpu model, but only a single Triton instance). KIND_MODEL denotes to Triton that the backend (TRT-LLM) will handle the device placement/setup as needed, for example loading a TP=2 engine on 2 GPUs in a single Triton model instance.

For multiple model instances, it would require further knowledge and of the TRT-LLM backend, and may not work exactly the same as other backends due to its use of MPI for communication in the current implementation.

There is a guide with more comprehensive details and documentation on the various components involved to serve multiple TRT-LLM model instances, please check it out: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama_multi_instance.md#running-multiple-instances-of-llama-model-on-multiple-gpus.

Hopefully the Triton CLI generated configs give you a good functional starting point for a single instance, and then can be tweaked afterwards by following this guide to support multi-instance.

CC @Tabrizian for viz

@oandreeva-nv oandreeva-nv added the verify to close Verifying if the issue can be closed label Sep 30, 2024
@oandreeva-nv
Copy link
Contributor

@gyr66 , let us know if there's anything else we can help you with. Feel free to close this issue

@gyr66
Copy link
Author

gyr66 commented Oct 2, 2024

Thank you so much for your patient and detailed responses! I am wondering, if I don't use TP, could I simply start an independent server process for each GPU and place an NGINX load balancer in front? Would this be consistent with the leader mode?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested verify to close Verifying if the issue can be closed
Development

No branches or pull requests

3 participants