Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

builds on jetson orin failure #2004

Closed
malv-c opened this issue Jun 26, 2023 · 8 comments
Closed

builds on jetson orin failure #2004

malv-c opened this issue Jun 26, 2023 · 8 comments
Labels

Comments

@malv-c
Copy link

malv-c commented Jun 26, 2023

both llama.cpp with : % cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA_DMMV_F16=ON -DLLAMA_CUDA_DMMV_Y=16
and in koboldcpp with : % cmake .. -DLLAMA_CUBLAS=1
give : ggml.h(218): error: identifier "__fp16" is undefined

@manbehindthemadness
Copy link

I came here just for this:
Exact same problem on AGX Orin JP 5.1.1 L4T 35.3.1

/usr/src/llama.cpp/ggml.h(218): error: identifier "__fp16" is undefined

@manbehindthemadness
Copy link

image
Ahhhh, cortex 8+ processors no longer support neon, the library must be fully x64. They can support x32 but only when running within an x32 operating system / kernel.

@manbehindthemadness
Copy link

@malv-c If you replace __fp16 with uint16_t on line 218 of ggml.h the project builds and cuBLAS works without issue.

@manbehindthemadness
Copy link

Even though this successfully builds, it does seem to be attempting to use NEON, I am unsure if this will have a performance impact...

llama.cpp: loading model from /opt/gpt-models/vicuna-7b-1.1.ggmlv3.q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 1924.88 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 8234 MB
llama_new_context_with_model: kv self size  =  256.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
llama_print_timings:        load time =   593.09 ms

@swittk
Copy link
Contributor

swittk commented Jun 26, 2023

Does this thread help? #1455

@manbehindthemadness
Copy link

Oh! This here looks like it might be the silver bullet: #1455 (comment)

@malv-c
Copy link
Author

malv-c commented Jun 27, 2023 via email

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants