Not able to load llama 3 70b on inf2.24xlarge instance #92

sangraamp · 2024-07-10T09:16:15Z

Here is the code I am using, taken directly from the aws-neuron-samples repository:

import time
import os
import torch
from transformers import AutoTokenizer
from transformers_neuronx import LlamaForSampling, NeuronAutoModelForCausalLM
from transformers import LlamaForCausalLM, LlamaTokenizer, PreTrainedTokenizerFast
from transformers_neuronx import LlamaForSampling, NeuronConfig, GQA, QuantizationConfig
from transformers_neuronx.config import GenerationConfig 

# Set this to the Hugging Face model ID
model_id = "meta-llama/Meta-Llama-3-70B-Instruct"

# Only include if accuracy mismatch is observed.
# https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/torch/transformers-neuronx/index.html#release-0-10-0-332
os.environ['NEURON_CC_FLAGS'] = '--enable-mixed-precision-accumulation -O1'
os.environ['FI_EFA_FORK_SAFE'] = '1'
os.environ['NCCL_DEBUG'] = 'INFO'

neuron_config = NeuronConfig(  
        attention_layout="BSH",
        group_query_attention=GQA.REPLICATED_HEADS,
)

!huggingface-cli login --token <TOKEN>

# load meta-llama/Llama-3-70B to the NeuronCores with 12-way tensor parallelism and run compilation
# LlamaForSampling
neuron_model = LlamaForSampling.from_pretrained(model_id, neuron_config=neuron_config, batch_size=1, tp_degree=12, amp='f16', n_positions=2048)
neuron_model.to_neuron()

And this is the error I am getting:

2024-Jul-10 09:14:33.741595 12878:14360 ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 45429760
2024-Jul-10 09:14:33.794880 12878:14360 ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_2_668e50f9.csv
2024-Jul-10 09:14:33.846959 12878:14360 ERROR  TDRV:log_dev_mem                             Failed to allocate 43.325MB (usage: dma rings) on ND 2:NC 0, current utilization:
	* total: 15.972GB
	* model code: 482.430MB
	* model constants: 1.518MB
	* tensors: 13.170GB
	* shared scratchpad: 512.000MB
	* runtime: 7.352KB
	* dma rings: 1.471GB
	* collectives: 367.002MB

2024-Jul-10 09:14:34.041531 12878:14360 ERROR  TDRV:dma_ring_alloc                          Failed to allocate TX ring
2024-Jul-10 09:14:34.080369 12878:14360 ERROR  TDRV:dma_ring_create_static_rings_for_queue_bundle_instanceFailed to allocate static rx ring for queue qSPPIOParam0_5
2024-Jul-10 09:14:34.129462 12878:14360 ERROR  TDRV:io_create_queues                        Failed to create static rings for qSPPIOParam0
2024-Jul-10 09:14:34.172891 12878:14360 ERROR  TDRV:kbl_model_add                           Failed to allocated io queues and dbtc hashes for io
2024-Jul-10 09:14:34.235940 12878:14360 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2024-Jul-10 09:14:34.273739 12878:14360 ERROR  NMGR:kmgr_load_nn_internal_v2                Failed to stage graph: kelf-0.json to NeuronCore
2024-Jul-10 09:14:34.317379 12878:14360 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/ec2-user/neuroncc_compile_workdir/05680919-7207-45a6-a42a-7acb62519043/model.MODULE_06822009809fe50b37aa+334ad6c6.neff, err: 4
2024-Jul-10 09:14:34.390428 12878:14360 ERROR   NRT:nrt_infodump                            Neuron runtime information - please include in any support request:
2024-Jul-10 09:14:34.438518 12878:14360 ERROR   NRT:nrt_infodump                            ------------->8------------[ cut here ]------------>8-------------
2024-Jul-10 09:14:34.485780 12878:14360 ERROR   NRT:nrt_infodump                            NRT version: 2.21.41.0 (fb1705f5f26a084084cc75d6f4201472a1aa8ff1)
2024-Jul-10 09:14:34.532483 12878:14360 ERROR   NRT:nrt_infodump                            CCOM version: 2.0.0.0- (compat 41)
2024-Jul-10 09:14:34.572602 12878:14360 ERROR   NRT:nrt_infodump                            Instance ID: i-07d4ba8b6e9fb9052
2024-Jul-10 09:14:34.612319 12878:14360 ERROR   NRT:nrt_infodump                            Cluster ID: N/A
2024-Jul-10 09:14:34.648575 12878:14360 ERROR   NRT:nrt_infodump                            Kernel: Linux 5.10.219-208.866.amzn2.x86_64 #1 SMP Tue Jun 18 14:00:06 UTC 2024
2024-Jul-10 09:14:34.699198 12878:14360 ERROR   NRT:nrt_infodump                            Nodename: ip-172-16-4-94.ec2.internal
2024-Jul-10 09:14:34.739835 12878:14360 ERROR   NRT:nrt_infodump                            Driver version: 2.17.17.0

2024-Jul-10 09:14:34.788188 12878:14360 ERROR   NRT:nrt_infodump                            Failure: NRT_RESOURCE in nrt_load()
2024-Jul-10 09:14:34.829030 12878:14360 ERROR   NRT:nrt_infodump                            Visible cores: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
2024-Jul-10 09:14:34.873548 12878:14360 ERROR   NRT:nrt_infodump                            Environment:
2024-Jul-10 09:14:34.908542 12878:14360 ERROR   NRT:nrt_infodump                                NEURON_RT_ROOT_COMM_ID=localhost:51709
2024-Jul-10 09:14:34.949721 12878:14360 ERROR   NRT:nrt_infodump                                NEURON_CC_FLAGS=--enable-mixed-precision-accumulation -O1
2024-Jul-10 09:14:34.996342 12878:14360 ERROR   NRT:nrt_infodump                                NEURON_RT_DISABLE_EXECUTION_BARRIER=1
2024-Jul-10 09:14:35.041222 12878:14362 ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 45429760
2024-Jul-10 09:14:35.083685 12878:14360 ERROR   NRT:nrt_infodump                                NEURON_RT_IO_RING_CACHE_SIZE=1
2024-Jul-10 09:14:35.181352 12878:14356 ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 45429760
2024-Jul-10 09:14:35.120533 12878:14362 ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_3_668e50fb.csv
2024-Jul-10 09:14:35.192126 12878:14356 ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_0_668e50fb.csv
2024-Jul-10 09:14:35.150819 12878:14360 ERROR   NRT:nrt_infodump                            -------------8<-----------[ cut to here ]-----------8<------------
2024-Jul-10 09:14:35.191118 12878:14362 ERROR  TDRV:log_dev_mem                             Failed to allocate 43.325MB (usage: dma rings) on ND 3:NC 0, current utilization:
	* total: 15.972GB
	* model code: 482.430MB
	* model constants: 1.518MB
	* tensors: 13.170GB
	* shared scratchpad: 512.000MB
	* runtime: 7.352KB
	* dma rings: 1.471GB
	* collectives: 367.002MB

2024-Jul-10 09:14:35.595755 12878:14356 ERROR  TDRV:log_dev_mem                             Failed to allocate 43.325MB (usage: dma rings) on ND 0:NC 0, current utilization:
	* total: 15.972GB
	* model code: 482.430MB
	* model constants: 1.518MB
	* tensors: 13.170GB
	* shared scratchpad: 512.000MB
	* runtime: 7.352KB
	* dma rings: 1.471GB
	* collectives: 367.002MB

2024-Jul-10 09:14:35.595767 12878:14356 ERROR  TDRV:dma_ring_alloc                          Failed to allocate TX ring
2024-Jul-10 09:14:35.595775 12878:14356 ERROR  TDRV:dma_ring_create_static_rings_for_queue_bundle_instanceFailed to allocate static rx ring for queue qSPPIOParam0_5
2024-Jul-10 09:14:35.595783 12878:14356 ERROR  TDRV:io_create_queues                        Failed to create static rings for qSPPIOParam0
2024-Jul-10 09:14:35.595792 12878:14356 ERROR  TDRV:kbl_model_add                           Failed to allocated io queues and dbtc hashes for io
2024-Jul-10 09:14:35.595659 12878:14362 ERROR  TDRV:dma_ring_alloc                          Failed to allocate TX ring
2024-Jul-10 09:14:35.646487 12878:14361 ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 45429760
2024-Jul-10 09:14:35.692808 12878:14362 ERROR  TDRV:dma_ring_create_static_rings_for_queue_bundle_instanceFailed to allocate static rx ring for queue qSPPIOParam0_5
2024-Jul-10 09:14:35.743583 12878:14361 ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_2_668e50fb.csv
2024-Jul-10 09:14:35.929348 12878:14362 ERROR  TDRV:io_create_queues                        Failed to create static rings for qSPPIOParam0
2024-Jul-10 09:14:35.969899 12878:14361 ERROR  TDRV:log_dev_mem                             Failed to allocate 43.325MB (usage: dma rings) on ND 2:NC 1, current utilization:
	* total: 15.972GB
	* model code: 482.430MB
	* model constants: 1.518MB
	* tensors: 13.170GB
	* shared scratchpad: 512.000MB
	* runtime: 7.352KB
	* dma rings: 1.471GB
	* collectives: 367.002MB

2024-Jul-10 09:14:36.038292 12878:14362 ERROR  TDRV:kbl_model_add                           Failed to allocated io queues and dbtc hashes for io
2024-Jul-10 09:14:36.083961 12878:14361 ERROR  TDRV:dma_ring_alloc                          Failed to allocate TX ring
2024-Jul-10 09:14:36.645417 12878:14361 ERROR  TDRV:dma_ring_create_static_rings_for_queue_bundle_instanceFailed to allocate static rx ring for queue qSPPIOParam0_5
2024-Jul-10 09:14:36.696237 12878:14361 ERROR  TDRV:io_create_queues                        Failed to create static rings for qSPPIOParam0
2024-Jul-10 09:14:36.740940 12878:14361 ERROR  TDRV:kbl_model_add                           Failed to allocated io queues and dbtc hashes for io
2024-Jul-10 09:14:36.941805 12878:14356 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2024-Jul-10 09:14:37.064098 12878:14356 ERROR  NMGR:kmgr_load_nn_internal_v2                Failed to stage graph: kelf-0.json to NeuronCore
2024-Jul-10 09:14:37.114592 12878:14356 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/ec2-user/neuroncc_compile_workdir/05680919-7207-45a6-a42a-7acb62519043/model.MODULE_06822009809fe50b37aa+334ad6c6.neff, err: 4
2024-Jul-10 09:14:37.200973 12878:14362 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2024-Jul-10 09:14:37.264635 12878:14362 ERROR  NMGR:kmgr_load_nn_internal_v2                Failed to stage graph: kelf-0.json to NeuronCore
2024-Jul-10 09:14:37.326906 12878:14362 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/ec2-user/neuroncc_compile_workdir/05680919-7207-45a6-a42a-7acb62519043/model.MODULE_06822009809fe50b37aa+334ad6c6.neff, err: 4
2024-Jul-10 09:14:37.496254 12878:14361 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2024-Jul-10 09:14:37.537924 12878:14361 ERROR  NMGR:kmgr_load_nn_internal_v2                Failed to stage graph: kelf-0.json to NeuronCore
2024-Jul-10 09:14:37.694363 12878:14361 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/ec2-user/neuroncc_compile_workdir/05680919-7207-45a6-a42a-7acb62519043/model.MODULE_06822009809fe50b37aa+334ad6c6.neff, err: 4
2024-Jul-10 09:14:37.962790 12878:14359 ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 45429760
2024-Jul-10 09:14:38.016498 12878:14359 ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_1_668e50fe.csv
2024-Jul-10 09:14:38.101532 12878:14359 ERROR  TDRV:log_dev_mem                             Failed to allocate 43.325MB (usage: dma rings) on ND 1:NC 1, current utilization:
	* total: 15.972GB
	* model code: 482.430MB
	* model constants: 1.518MB
	* tensors: 13.170GB
	* shared scratchpad: 512.000MB
	* runtime: 7.352KB
	* dma rings: 1.471GB
	* collectives: 367.002MB

2024-Jul-10 09:14:38.126653 12878:14357 ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 45429760
2024-Jul-10 09:14:38.308905 12878:14359 ERROR  TDRV:dma_ring_alloc                          Failed to allocate TX ring
2024-Jul-10 09:14:38.355977 12878:14357 ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_0_668e50fe.csv
2024-Jul-10 09:14:38.384665 12878:14359 ERROR  TDRV:dma_ring_create_static_rings_for_queue_bundle_instanceFailed to allocate static rx ring for queue qSPPIOParam0_5
2024-Jul-10 09:14:38.431822 12878:14357 ERROR  TDRV:log_dev_mem                             Failed to allocate 43.325MB (usage: dma rings) on ND 0:NC 1, current utilization:
	* total: 15.972GB
	* model code: 482.430MB
	* model constants: 1.518MB
	* tensors: 13.170GB
	* shared scratchpad: 512.000MB
	* runtime: 7.352KB
	* dma rings: 1.471GB
	* collectives: 367.002MB

2024-Jul-10 09:14:38.478367 12878:14359 ERROR  TDRV:io_create_queues                        Failed to create static rings for qSPPIOParam0
2024-Jul-10 09:14:38.658402 12878:14357 ERROR  TDRV:dma_ring_alloc                          Failed to allocate TX ring
2024-Jul-10 09:14:38.699628 12878:14359 ERROR  TDRV:kbl_model_add                           Failed to allocated io queues and dbtc hashes for io
2024-Jul-10 09:14:38.736785 12878:14357 ERROR  TDRV:dma_ring_create_static_rings_for_queue_bundle_instanceFailed to allocate static rx ring for queue qSPPIOParam0_5
2024-Jul-10 09:14:38.827323 12878:14357 ERROR  TDRV:io_create_queues                        Failed to create static rings for qSPPIOParam0
2024-Jul-10 09:14:38.869528 12878:14357 ERROR  TDRV:kbl_model_add                           Failed to allocated io queues and dbtc hashes for io
2024-Jul-10 09:14:39.129413 12878:14359 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2024-Jul-10 09:14:39.179234 12878:14359 ERROR  NMGR:kmgr_load_nn_internal_v2                Failed to stage graph: kelf-0.json to NeuronCore
2024-Jul-10 09:14:39.224599 12878:14359 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/ec2-user/neuroncc_compile_workdir/05680919-7207-45a6-a42a-7acb62519043/model.MODULE_06822009809fe50b37aa+334ad6c6.neff, err: 4
2024-Jul-10 09:14:39.301859 12878:14357 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2024-Jul-10 09:14:39.340985 12878:14357 ERROR  NMGR:kmgr_load_nn_internal_v2                Failed to stage graph: kelf-0.json to NeuronCore
2024-Jul-10 09:14:39.386028 12878:14357 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/ec2-user/neuroncc_compile_workdir/05680919-7207-45a6-a42a-7acb62519043/model.MODULE_06822009809fe50b37aa+334ad6c6.neff, err: 4
2024-Jul-10 09:14:40.090594 12878:14364 ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 45429760
2024-Jul-10 09:14:40.147149 12878:14364 ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_4_668e5100.csv
2024-Jul-10 09:14:40.202259 12878:14364 ERROR  TDRV:log_dev_mem                             Failed to allocate 43.325MB (usage: dma rings) on ND 4:NC 0, current utilization:
	* total: 15.972GB
	* model code: 482.430MB
	* model constants: 1.518MB
	* tensors: 13.170GB
	* shared scratchpad: 512.000MB
	* runtime: 7.352KB
	* dma rings: 1.471GB
	* collectives: 367.002MB

2024-Jul-10 09:14:40.399826 12878:14364 ERROR  TDRV:dma_ring_alloc                          Failed to allocate TX ring
2024-Jul-10 09:14:40.439652 12878:14364 ERROR  TDRV:dma_ring_create_static_rings_for_queue_bundle_instanceFailed to allocate static rx ring for queue qSPPIOParam0_5
2024-Jul-10 09:14:40.490375 12878:14364 ERROR  TDRV:io_create_queues                        Failed to create static rings for qSPPIOParam0
2024-Jul-10 09:14:40.535327 12878:14364 ERROR  TDRV:kbl_model_add                           Failed to allocated io queues and dbtc hashes for io
2024-Jul-10 09:14:40.707030 12878:14364 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2024-Jul-10 09:14:40.752697 12878:14364 ERROR  NMGR:kmgr_load_nn_internal_v2                Failed to stage graph: kelf-0.json to NeuronCore
2024-Jul-10 09:14:40.804289 12878:14364 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/ec2-user/neuroncc_compile_workdir/05680919-7207-45a6-a42a-7acb62519043/model.MODULE_06822009809fe50b37aa+334ad6c6.neff, err: 4
2024-Jul-10 09:14:41.171702 12878:14366 ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 45429760
2024-Jul-10 09:14:41.239493 12878:14366 ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_5_668e5101.csv
2024-Jul-10 09:14:41.309022 12878:14366 ERROR  TDRV:log_dev_mem                             Failed to allocate 43.325MB (usage: dma rings) on ND 5:NC 0, current utilization:
	* total: 15.972GB
	* model code: 482.430MB
	* model constants: 1.518MB
	* tensors: 13.170GB
	* shared scratchpad: 512.000MB
	* runtime: 7.352KB
	* dma rings: 1.471GB
	* collectives: 367.002MB

2024-Jul-10 09:14:41.564347 12878:14366 ERROR  TDRV:dma_ring_alloc                          Failed to allocate TX ring
2024-Jul-10 09:14:41.616024 12878:14366 ERROR  TDRV:dma_ring_create_static_rings_for_queue_bundle_instanceFailed to allocate static rx ring for queue qSPPIOParam0_5
2024-Jul-10 09:14:41.682776 12878:14366 ERROR  TDRV:io_create_queues                        Failed to create static rings for qSPPIOParam0
2024-Jul-10 09:14:41.741836 12878:14366 ERROR  TDRV:kbl_model_add                           Failed to allocated io queues and dbtc hashes for io
2024-Jul-10 09:14:41.935258 12878:14366 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2024-Jul-10 09:14:41.987894 12878:14366 ERROR  NMGR:kmgr_load_nn_internal_v2                Failed to stage graph: kelf-0.json to NeuronCore
2024-Jul-10 09:14:42.047050 12878:14366 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/ec2-user/neuroncc_compile_workdir/05680919-7207-45a6-a42a-7acb62519043/model.MODULE_06822009809fe50b37aa+334ad6c6.neff, err: 4
2024-Jul-10 09:14:42.787422 12878:14365 ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 45429760
2024-Jul-10 09:14:42.860855 12878:14365 ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_4_668e5102.csv
2024-Jul-10 09:14:42.935357 12878:14365 ERROR  TDRV:log_dev_mem                             Failed to allocate 43.325MB (usage: dma rings) on ND 4:NC 1, current utilization:
	* total: 15.972GB
	* model code: 482.430MB
	* model constants: 1.518MB
	* tensors: 13.170GB
	* shared scratchpad: 512.000MB
	* runtime: 7.352KB
	* dma rings: 1.471GB
	* collectives: 367.002MB

2024-Jul-10 09:14:43.217854 12878:14365 ERROR  TDRV:dma_ring_alloc                          Failed to allocate TX ring
2024-Jul-10 09:14:43.273898 12878:14365 ERROR  TDRV:dma_ring_create_static_rings_for_queue_bundle_instanceFailed to allocate static rx ring for queue qSPPIOParam0_5
2024-Jul-10 09:14:43.345432 12878:14365 ERROR  TDRV:io_create_queues                        Failed to create static rings for qSPPIOParam0
2024-Jul-10 09:14:43.408332 12878:14365 ERROR  TDRV:kbl_model_add                           Failed to allocated io queues and dbtc hashes for io
2024-Jul-10 09:14:43.571130 12878:14363 ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 45429760
2024-Jul-10 09:14:43.632292 12878:14363 ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_3_668e5103.csv
2024-Jul-10 09:14:43.693735 12878:14363 ERROR  TDRV:log_dev_mem                             Failed to allocate 43.325MB (usage: dma rings) on ND 3:NC 1, current utilization:
	* total: 15.972GB
	* model code: 482.430MB
	* model constants: 1.518MB
	* tensors: 13.170GB
	* shared scratchpad: 512.000MB
	* runtime: 7.352KB
	* dma rings: 1.471GB
	* collectives: 367.002MB

2024-Jul-10 09:14:44.060269 12878:14363 ERROR  TDRV:dma_ring_alloc                          Failed to allocate TX ring
2024-Jul-10 09:14:44.106787 12878:14363 ERROR  TDRV:dma_ring_create_static_rings_for_queue_bundle_instanceFailed to allocate static rx ring for queue qSPPIOParam0_5
2024-Jul-10 09:14:44.167318 12878:14363 ERROR  TDRV:io_create_queues                        Failed to create static rings for qSPPIOParam0
2024-Jul-10 09:14:44.220394 12878:14363 ERROR  TDRV:kbl_model_add                           Failed to allocated io queues and dbtc hashes for io
2024-Jul-10 09:14:44.283276 12878:14365 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2024-Jul-10 09:14:44.329275 12878:14365 ERROR  NMGR:kmgr_load_nn_internal_v2                Failed to stage graph: kelf-0.json to NeuronCore
2024-Jul-10 09:14:44.382803 12878:14365 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/ec2-user/neuroncc_compile_workdir/05680919-7207-45a6-a42a-7acb62519043/model.MODULE_06822009809fe50b37aa+334ad6c6.neff, err: 4
2024-Jul-10 09:14:44.590002 12878:14363 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2024-Jul-10 09:14:44.649774 12878:14363 ERROR  NMGR:kmgr_load_nn_internal_v2                Failed to stage graph: kelf-0.json to NeuronCore
2024-Jul-10 09:14:44.714672 12878:14363 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/ec2-user/neuroncc_compile_workdir/05680919-7207-45a6-a42a-7acb62519043/model.MODULE_06822009809fe50b37aa+334ad6c6.neff, err: 4
2024-Jul-10 09:14:47.587886 12878:14367 ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 45429760
2024-Jul-10 09:14:47.629812 12878:14358 ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 45429760
2024-Jul-10 09:14:47.668107 12878:14367 ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_5_668e5107.csv
2024-Jul-10 09:14:47.733447 12878:14358 ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_1_668e5107.csv
2024-Jul-10 09:14:47.765276 12878:14367 ERROR  TDRV:log_dev_mem                             Failed to allocate 43.325MB (usage: dma rings) on ND 5:NC 1, current utilization:
	* total: 15.972GB
	* model code: 482.430MB
	* model constants: 1.518MB
	* tensors: 13.170GB
	* shared scratchpad: 512.000MB
	* runtime: 7.352KB
	* dma rings: 1.471GB
	* collectives: 367.002MB

2024-Jul-10 09:14:48.089815 12878:14367 ERROR  TDRV:dma_ring_alloc                          Failed to allocate TX ring
2024-Jul-10 09:14:48.089868 12878:14358 ERROR  TDRV:log_dev_mem                             Failed to allocate 43.325MB (usage: dma rings) on ND 1:NC 0, current utilization:
	* total: 15.972GB
	* model code: 482.430MB
	* model constants: 1.518MB
	* tensors: 13.170GB
	* shared scratchpad: 512.000MB
	* runtime: 7.352KB
	* dma rings: 1.471GB
	* collectives: 367.002MB

2024-Jul-10 09:14:48.129966 12878:14367 ERROR  TDRV:dma_ring_create_static_rings_for_queue_bundle_instanceFailed to allocate static rx ring for queue qSPPIOParam0_5
2024-Jul-10 09:14:48.416932 12878:14358 ERROR  TDRV:dma_ring_alloc                          Failed to allocate TX ring
2024-Jul-10 09:14:48.467853 12878:14367 ERROR  TDRV:io_create_queues                        Failed to create static rings for qSPPIOParam0
2024-Jul-10 09:14:48.526637 12878:14358 ERROR  TDRV:dma_ring_create_static_rings_for_queue_bundle_instanceFailed to allocate static rx ring for queue qSPPIOParam0_5
2024-Jul-10 09:14:48.571499 12878:14367 ERROR  TDRV:kbl_model_add                           Failed to allocated io queues and dbtc hashes for io
2024-Jul-10 09:14:48.644401 12878:14358 ERROR  TDRV:io_create_queues                        Failed to create static rings for qSPPIOParam0
2024-Jul-10 09:14:48.775574 12878:14358 ERROR  TDRV:kbl_model_add                           Failed to allocated io queues and dbtc hashes for io
2024-Jul-10 09:14:48.981758 12878:14367 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2024-Jul-10 09:14:49.043694 12878:14367 ERROR  NMGR:kmgr_load_nn_internal_v2                Failed to stage graph: kelf-0.json to NeuronCore
2024-Jul-10 09:14:49.113538 12878:14367 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/ec2-user/neuroncc_compile_workdir/05680919-7207-45a6-a42a-7acb62519043/model.MODULE_06822009809fe50b37aa+334ad6c6.neff, err: 4
2024-Jul-10 09:14:49.287849 12878:14358 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2024-Jul-10 09:14:49.343165 12878:14358 ERROR  NMGR:kmgr_load_nn_internal_v2                Failed to stage graph: kelf-0.json to NeuronCore
2024-Jul-10 09:14:49.552911 12878:14358 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/ec2-user/neuroncc_compile_workdir/05680919-7207-45a6-a42a-7acb62519043/model.MODULE_06822009809fe50b37aa+334ad6c6.neff, err: 4

However, as mentioned in this notebook:

The memory required to host any model can be computed with:

total memory = bytes per parameter * number of parameters

When using float16 casted weights for a 8 billion parameter model, this works out to 2 * 70B or ~140GB of weights. In reality, the total space required is often greater than just the number of parameters due to caching attention layer projections (KV caching). This caching mechanism grows memory allocations linearly with sequence length and batch size.

To get very large language models to fit on Inf2 & Trn1, tensor parallelism is used to split weights, data, and compute across multiple NeuronCores. The number of NeuronCores that the weights are split across can be controlled by setting the tp_degree parameter. This parallelism degree must be chosen to ensure that the memory usage per NeuronCore will be less than the physical 16GB limit. When configuring tensor parallelism, the memory per NeuronCore can be computed with:

memory per core = (bytes per parameter * number of parameters) / tp_degree

This can be used to compute the minimum instance sizing by ensuring that the value selected for tp_degree results in less than 16GB allocated per NeuronCore.

Using this formula, plugging in the tp_degree I am using (12), I obtain the memory per core as 2 * 70B / 12 which is 140GB / 12 which is less than 12GB per core! I cannot figure out why the cores are running out of memory in this case.

The text was updated successfully, but these errors were encountered:

jeffhataws · 2024-07-10T21:26:25Z

Hi @sangraamp , thanks for filing the issue. The example is verified to work with TP=24. For TP=12, while the formula can give a rough back-of-envelop estimate, there are additional memory usage from runtime and compiler that would increase the memory usage above just the parameters and KV cache. You can try limiting the number of buckets by setting n_positions to an explicit list (bucketing info):

Instead of:

neuron_model = LlamaForSampling.from_pretrained(model_id, neuron_config=neuron_config, batch_size=1, tp_degree=12, amp='f16', n_positions=2048)

Making n_positions a list will force a single bucket to be compiled:
neuron_model = LlamaForSampling.from_pretrained(model_id, neuron_config=neuron_config, batch_size=1, tp_degree=12, amp='f16', n_positions=[2048])

sangraamp · 2024-07-11T08:06:45Z

@jeffhataws Thanks a lot for the workaround! Can successfully load the model in now. The issue I'm facing now is quite weird, the original issue of loading the model in has been resolved, though.
When I try to generate output using this code (for Llama 3 70B Instruct):

from transformers import AutoConfig
from transformers_neuronx import HuggingFaceGenerationModelAdapter

# Use the `HuggingFaceGenerationModelAdapter` to access the generate API
config = AutoConfig.from_pretrained(model_id)
neuron_model = HuggingFaceGenerationModelAdapter(config, neuron_model)

# Construct a tokenizer and encode prompt text
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "What is loss mitigation<|eot_id|>"
encoded_input = tokenizer(prompt, return_tensors="pt")
print(encoded_input)
# Convert the tensor to a list
input_ids_list = encoded_input.input_ids.tolist()[0]
# Decode the input_ids
decoded_prompt = tokenizer.decode(input_ids_list, skip_special_tokens=True)
print(decoded_prompt)

# Run inference using temperature
with torch.inference_mode():
    neuron_model.reset_generation()
    generated_sequence = neuron_model.generate(
        input_ids=encoded_input.input_ids,
        attention_mask=encoded_input.attention_mask,
        do_sample=True,
        max_length=256,
    )

print([tokenizer.decode(tok) for tok in generated_sequence])

Here is the output/gibberish I get:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
{'input_ids': tensor([[128000,   3923,    374,   4814,  66860, 128009]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
What is loss mitigation
['<|begin_of_text|>What is loss mitigation<|eot_id|> P (...\n Isaac\n Pggergger alantlrogglse Zend/etc Suff Zend Prve\xadt muelselseISStae\nbé GazEMONlez Zendlselse.gblseemodendx Harleybrulech Lucky wonders muecacrameslselechtaebéISSlech.gb NoirlechEMON399.gblse\nndxlech ultralech muelechbru399lezbé gcуlezBrunramesbélselech\nlezaley bru&Rlse&Rtaemaelechrames gsi wondersbé_gc vice GC wonders wondersaleyGXramestaeBrun\nlechbéndxbrulechlse Burnettlech dereg gc wonders ultracacndxGXmaeлон\r\nndx muerames_sqrtndxлонlez muetaehaarGXrn gcíslsendxlezndx_sqrt ultra.compatлон\nmaealeyvrd HarleyzoslechuratlezBrunISSlez_gcheimer\n Brunlezemodetaelez Harley_gctae dereglezlezlse wonders Harley.gblezixelechlech\nbélechBrun\n gcbérameslezlezBrunlez urn\nndx Harleyvrdлонzoslez Harleyndxrnlechzoslse.gb Harley.gb wonders_gc mueBrunmae\n_sqrtGX_gcDaemontaebru gc GC airyBrunhaar Harley mue_gcndx Harleyrames;;;;;; Harley Harley\ntaern gcuratлонlech Harleytaelech\n.gshaar399ixetaebé.gblez Gord']

When I run the same code with Llama 3 8B Instruct, it gives coherent outputs:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
{'input_ids': tensor([[128000,   3923,    374,   4814,  66860, 128009]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
What is loss mitigation
['<|begin_of_text|>What is loss mitigation<|eot_id|>igma\nWhat is loss mitigation?\nLoss mitigation is a process used in real estate investing and financing to reduce the risk of loss on a property. It involves taking steps to reduce the potential financial harm associated with a property that may be experiencing financial difficulties, such as a distressed property or a property that is facing a high risk of default.\nLoss mitigation strategies are commonly used by loan servicers, lenders, and property owners to minimize the financial damage caused by a property that is not producing the expected income or has other financial challenges. These strategies can include:\n1. Loan modifications: Modifying the terms of a loan to reduce the monthly mortgage payment or to extend the loan period.\n2. Short sales: Selling a property for less than the outstanding mortgage balance, with the lender agreeing to forgive the remaining debt.\n3. Foreclosure alternatives: Offering alternatives to foreclosure, such as a deed in lieu of foreclosure or a mortgage release.\n4. REO acquisition: Acquiring a property that is in default or has been foreclosed and selling it to a new buyer.\n5. Lease options: Allowing the tenant to purchase the property by making lease payments that can be applied to the purchase price.\n6. Market value appraisal: Appraising the property to determine']

I can't seem to figure out what is going wrong here.

jeffhataws · 2024-07-12T21:55:25Z

@sangraamp thanks for reporting back and glad that you have made progress. I agree that the LLam3 70B output is not expected and will investigate. One thing you can try is to match the flag recommended in the example by just having --enable-mixed-precision-accumulation and not include -O1.

sangraamp · 2024-07-16T10:42:19Z

@jeffhataws Thanks, tried removing the flag but still facing the same issue.

jeffhataws · 2024-07-16T15:53:00Z

Thank you @sangraamp. We have reproduced the issue with TP=12 and will be looking into it.

sangraamp · 2024-07-22T18:27:25Z

That's great, @jeffhataws! Hoping to hear from you soon regarding any steps I'll need to take, if necessary, to not encounter this error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to load llama 3 70b on inf2.24xlarge instance #92

Not able to load llama 3 70b on inf2.24xlarge instance #92

sangraamp commented Jul 10, 2024

jeffhataws commented Jul 10, 2024

sangraamp commented Jul 11, 2024 •

edited

Loading

jeffhataws commented Jul 12, 2024

sangraamp commented Jul 16, 2024

jeffhataws commented Jul 16, 2024

sangraamp commented Jul 22, 2024

Not able to load llama 3 70b on inf2.24xlarge instance #92

Not able to load llama 3 70b on inf2.24xlarge instance #92

Comments

sangraamp commented Jul 10, 2024

jeffhataws commented Jul 10, 2024

sangraamp commented Jul 11, 2024 • edited Loading

jeffhataws commented Jul 12, 2024

sangraamp commented Jul 16, 2024

jeffhataws commented Jul 16, 2024

sangraamp commented Jul 22, 2024

sangraamp commented Jul 11, 2024 •

edited

Loading