Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault Error "not enough space in the context's memory pool" #52

Closed
cannin opened this issue Mar 12, 2023 · 22 comments
Closed
Labels
bug Something isn't working need more info The OP should provide more details about the issue stale

Comments

@cannin
Copy link

cannin commented Mar 12, 2023

This prompt with the 65B model on an M1 Max 64GB results in a segmentation fault. Works with 30B model. Are there problems with longer prompts? Related to #12

./main --model ./models/65B/ggml-model-q4_0.bin --prompt "You are a question answering bot that is able to answer questions about the world. You are extremely smart, knowledgeable, capable, and helpful. You always give complete, accurate, and very detailed responses to questions, and never stop a response in mid-sentence or mid-thought. You answer questions in the following format:

Question: What’s the history of bullfighting in Spain?

Answer: Bullfighting, also known as "tauromachia," has a long and storied history in Spain, with roots that can be traced back to ancient civilizations. The sport is believed to have originated in 7th-century BCE Iberian Peninsula as a form of animal worship, and it evolved over time to become a sport and form of entertainment. Bullfighting as it is known today became popular in Spain in the 17th and 18th centuries. During this time, the sport was heavily influenced by the traditions of medieval jousts and was performed by nobles and other members of the upper classes. Over time, bullfighting became more democratized and was performed by people from all walks of life. Bullfighting reached the height of its popularity in the 19th and early 20th centuries and was considered a national symbol of Spain. However, in recent decades, bullfighting has faced increasing opposition from animal rights activists, and its popularity has declined. Some regions of Spain have banned bullfighting, while others continue to hold bullfights as a cherished tradition. Despite its declining popularity, bullfighting remains an important part of Spanish culture and history, and it continues to be performed in many parts of the country to this day.

Now complete the following questions:

Question: What happened to the field of cybernetics in the 1970s?

Answer: "

Results in

...
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


You are a question answering bot that is able to answer questions about the world. You are extremely smart, knowledgeable, capable, and helpful. You always give complete, accurate, and very detailed responses to questions, and never stop a response in mid-sentence or mid-thought. You answer questions in the following format:

Question: What’s the history of bullfighting in Spain?

Answer: Bullfighting, also known as tauromachia, has a long and storied history in Spain, with roots that can be traced back to ancient civilizations. The sport is believed to have originated in 7th-century BCE Iberian Peninsula as a form of animal worship, and it evolved over time to become a sport and form of entertainment. Bullfighting as it is known today became popular in Spain in the 17th and 18th centuries. During this time, the sport was heavily influenced by the traditions of medieval jousts and was performed by nobles and other members of the upper classes. Over time, bullfighting became more democratized and was performed by people from all walks of life. Bullfighting reached the height of its popularity in the 19th and early 20th centuries and was considered a national symbol of Spain. However, in recent decades, bullfighting has faced increasing opposition from animal rights activists, and its popularity has declined. Some regions of Spain have banned bullfighting, while others continue to hold bullfights as a cherished tradition. Despite its declining popularity, bullfighting remainsggml_new_tensor_impl: not enough space in the context's memory pool (needed 701660720, available 700585498)
zsh: segmentation fault  ./main --model ./models/65B/ggml-model-q4_0.bin --prompt
@gjmulder
Copy link
Collaborator

Are you running out of memory?

@gjmulder gjmulder added the need more info The OP should provide more details about the issue label Mar 15, 2023
@Gobz
Copy link

Gobz commented Mar 16, 2023

I experience this as well, and I always have 5-6gb of RAM free when it occurs and around 20gb of swap.
It appears to be a known problem with memory allocation based on ggernanov's comments in #71

@Green-Sky
Copy link
Collaborator

potentially fixed by #213

@edwios
Copy link

edwios commented Mar 24, 2023

Latest commit b6b268d gives segmentation fault right away without even dropping into the input prompt. Was run on a Mac M1 Max with 64GB RAM. The crash happened on 30B LLaMA model but not on 7B. It was working fine even with the 65B model before the later commit.

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.

 
Text transcript of a never ending dialog, where User interacts with an AI assistant named ChatLLaMa.
ChatLLaMa is helpful, kind, honest, friendly, good at writing and never fails to answer User’s requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what User and ChatLLaMa say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 1000ggml_new_tensor_impl: not enough space in the context's memory pool (needed 536987232, available 536870912)
./chatLLaMa: line 53: 99012 Segmentation fault: 11  ./main $GEN_OPTIONS --model "$MODEL" --threads "$N_THREAD" --n_predict "$N_PREDICTS" --color --interactive --reverse-prompt "${USER_NAME}:" --prompt "

@Green-Sky
Copy link
Collaborator

@edwios try 404e1da (the one before 483bab2) or try my pr #438 (closed since gg is going to do it differently, but still should work until then)

@edwios
Copy link

edwios commented Mar 24, 2023

Last known good commit I have just tested was indeed
404e1da

@ggerganov
Copy link
Owner

What error do you get with 483bab2 ?

@edwios
Copy link

edwios commented Mar 24, 2023

Same, ./chatLLaMa: line 53: 99012 Segmentation fault: 11 ./main $GEN_OPTIONS --model "$MODEL" --threads "$N_THREAD" --n_predict "$N_PREDICTS" --color --interactive --reverse-prompt "${USER_NAME}:" --prompt "

main-2023-03-24-155839.ips.zip

@Green-Sky
Copy link
Collaborator

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 536987232, available 536870912)
Segmentation fault (core dumped)

just a few batches in. (30B q4_1) (just fed it a large file -f)

@edwios
Copy link

edwios commented Mar 25, 2023

Yippy! Commit 2a2e63c did fix the issue beautifully! Thank you!!

@eshaanagarwal
Copy link

Hi I am facing the issue of our of memory for context while using gpt4all model 1.3 groovy with a 32 cpu 512 gb ram model using cpu inference.

@dukeeagle
Copy link

Bumping on @eshaanagarwal ! Facing the same issue

@sw
Copy link
Contributor

sw commented Jul 10, 2023

Did it work for you with commit 2a2e63c and can you narrow down the commit that broke it?

In #1237, I changed some size_t parameters to int, I'm now worrying that may be the culprit. This was done because the dequantize functions already used int for the number of elements.

@Uralstech
Copy link

I am getting the same error: ggml_new_tensor_impl: not enough space in the context's memory pool (needed 20976224, available 12582912). I see that this has been a problem since March 12th.

I am using llama-2-13b-chat.ggmlv3.q3_K_S.bin from TheBloke in Google Cloud Run with 32GB RAM and 8 vCPUs. The service is using LLaMA CPP Python.

I'm quite new to LLaMA-cpp, so excuse any mistakes. This is the relevant part of my script:

app: FastAPI = FastAPI(title=APP_NAME, version=APP_VERSION)
llama: Llama = Llama(model_path=MODEL_PATH, n_ctx=4096, n_batch=2048, n_threads=cpu_count())

response_model: Type[BaseModel] = model_from_typed_dict(ChatCompletion)

# APP FUNCTIONS

@[app.post](http://app.post/)("/api/chat", response_model=response_model)
async def chat(request: ChatCompletionsRequest) -> Union[ChatCompletion, EventSourceResponse]:
    print("Chat-completion request received!")

    completion_or_chunks: Union[ChatCompletion, Iterator[ChatCompletionChunk]] = llama.create_chat_completion(**request.dict(), max_tokens=4096)
    completion: ChatCompletion = completion_or_chunks

    print("Sending completion!")
    return completion

@anujcb
Copy link

anujcb commented Aug 14, 2023

I am getting this with Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin, only when i send in embeddings from a vector db search results. inferences without retriever works fine with out this issue. I will try the regular llama 2 and see what happens. This is where it happens in the langchain.chains import ConversationalRetrievalChain

image

@anujcb
Copy link

anujcb commented Aug 14, 2023

OK, i was able to make it work by reducing the number of docs to 1, any value above 1 throws the memory access violation

image

@slaren
Copy link
Collaborator

slaren commented Aug 14, 2023

It would really help to diagnose this if you are able to reproduce it with one of the examples in this repository. If that's not possible, I would suggest looking into what parameters are being passed to llama_eval. This could happen if n_tokens is higher than n_batch, or if n_tokens + n_past is higher than n_ctx.

@anujcb
Copy link

anujcb commented Aug 14, 2023

I think, the issue maybe because of the special characters in the context. This was the context send to generate from llm.
image
i debugged it and intercepted the call before this text was sent to the llm. Copy pasted this into textpad to clean out the special characters and it seemed to be working.
image

The context is produced from a vector DB containing chunks of Tesla's 10K filings for the last 4 years. Looks like when the chunking was done, the special characters got in to the vector db and LLM was not bale to process the special characters.

The prompt that went with the context was "what are the risk factors for Tesla?"

@anujcb
Copy link

anujcb commented Aug 15, 2023

binary_path: F:\ProgramData\Anaconda3\envs\scrapalot-research-assistant\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll
CUDA SETUP: Loading binary F:\ProgramData\Anaconda3\envs\scrapalot-research-assistant\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll...
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1
INFO: Started server process [24636]
INFO: Waiting for application startup.
llama.cpp: loading model from ./../llama.cpp/models/Vicuna/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 4096
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_head_kv = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 596.40 MB (+ 2048.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 6106 MB
llama_new_context_with_model: kv self size = 2048.00 MB
AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

@anujcb
Copy link

anujcb commented Aug 15, 2023

I think, the issue maybe because of the special characters in the context. This was the context send to generate from llm. image i debugged it and intercepted the call before this text was sent to the llm. Copy pasted this into textpad to clean out the special characters and it seemed to be working. image

The context is produced from a vector DB containing chunks of Tesla's 10K filings for the last 4 years. Looks like when the chunking was done, the special characters got in to the vector db and LLM was not bale to process the special characters.

The prompt that went with the context was "what are the risk factors for Tesla?"

UPDATE: After extensive testing, i am at a conclusion that this is not caused by the special characters, this is caused by the amount of text that is being sent as context. I can comfortably send about 2000 characters(2K bytes) without this memory issue, sometimes even more(I think this depends on how much memory i have free...maybe).

I am using CUDA with an old GPU and older processor(AVX2=0), 32 GB of memory.

@anujcb
Copy link

anujcb commented Aug 15, 2023

It would really help to diagnose this if you are able to reproduce it with one of the examples in this repository. If that's not possible, I would suggest looking into what parameters are being passed to llama_eval. This could happen if n_tokens is higher than n_batch, or if n_tokens + n_past is higher than n_ctx.

Is there an example, where i can send in a context and the prompt?

Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working need more info The OP should provide more details about the issue stale
Projects
None yet
Development

No branches or pull requests