Loading models with mmap #615

polyrand · 2024-02-03T10:46:34Z

polyrand
Feb 3, 2024

Hi, first, thank you so much for the wonderful work on mlx!

Are there any plans to support memory-mapping models from disk?

When I try to load a model that's larger than the memory I have available, it uses swap memory anyway, so maybe allowing to mmap the model directly is more convenient and allows for faster loading (i.e: faster script startup and faster time-to-first-token)? It would also make it more efficient to run multiple processes that read the same model file?

I'm putting those as questions because I'm not super familiar with the unified memory model of the new Macs and how it interacts with the disk / GPU memory / etc. So maybe what I'm saying doesn't make a lot of sense.

awni · 2024-02-03T14:10:20Z

awni
Feb 3, 2024
Maintainer

I think it's quite possible to have a mmap mode, but we would still need to allocate memory to use the weights once they are needed and if the model is too big then it will still swap. There is no easy way to avoid that, as far as I know we can't mmap memory in a way that makes it available for the GPU as well.

So right now we have lazy loading from a file path which may effectively achieve the same goal: We don't mmap the model instead just read the weights into memory at the time they are needed.

2 replies

antmikinka Feb 8, 2024

Would a 8GB MacBook Pro be able to run a 4bit 7.79 GB MLX model? such as mlx-community/CodeLlama-13b-Instruct-hf-4bit-MLX. My understanding is that the weights are read in and correctly de allocated as the model is processing the prompt. This allows for something like above to be possible?

I first installed MLX and built when MLX was V.0.0.11, the last version. I seen in the new release, some memory issues were fixed. Would I just need to update my 'pip install mlx' or would I need to rebuild the entire c++ and python bindings for this new update?

Is there anything else I can do to run decent LLMs on my 8GB 2020 MacBook Pro M1, or is it time to upgrade?

I am trying to run the above model as well rn, it's using python bindings, so thats why it might be so slow, I am not sure. Only 10% - 15% of my CPU is being used, my RAM is staying steady at 3.2GB. Below is a photo of my ASITOP computer metrics.

Any help is appreciated.

awni Feb 8, 2024
Maintainer

That model is too big to run on 8GB. You're machine needs some RAM for other things so it's not going to work to have just the model taking up all the RAM, it will swap and be extremely slow.

If you want to run that4-bit quantized 13B model you will need more RAM (16GB should be ok). Otherwise I would try a smaller model like the StableLM 1.6 B https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b or a 4-bit quantized 7B model

polyrand · 2024-02-08T19:05:12Z

polyrand
Feb 8, 2024
Author

Thanks for the response.

as far as I know we can't mmap memory in a way that makes it available for the GPU as well.

Regardless of the original question, this is good to know!

So right now we have lazy loading from a file path which may effectively achieve the same goal: We don't mmap the model instead just read the weights into memory at the time they are needed.

Is this done by default when loading a model? i.e: If I do:

model, tokenizer = load("mlx-community/Nous-Hermes-2-Mixtral-8x7B-DPO-4bit")
response = generate(model, tokenizer, prompt="hello", verbose=True)

it's already lazy-loading the weights from the file on-demand?

1 reply

awni Feb 8, 2024
Maintainer

it's already lazy-loading the weights from the file on-demand?

Unless you explicitly eval the arrays if you load from a string file then it should be lazy. In the case you point to specifically it is in fact evaluating the arrays so it's not lazy https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/utils.py#L262

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading models with mmap #615

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Loading models with mmap #615

polyrand Feb 3, 2024

Replies: 2 comments · 3 replies

awni Feb 3, 2024 Maintainer

antmikinka Feb 8, 2024

awni Feb 8, 2024 Maintainer

polyrand Feb 8, 2024 Author

awni Feb 8, 2024 Maintainer

polyrand
Feb 3, 2024

Replies: 2 comments 3 replies

awni
Feb 3, 2024
Maintainer

awni Feb 8, 2024
Maintainer

polyrand
Feb 8, 2024
Author

awni Feb 8, 2024
Maintainer