Replies: 2 comments 3 replies
-
I think it's quite possible to have a mmap mode, but we would still need to allocate memory to use the weights once they are needed and if the model is too big then it will still swap. There is no easy way to avoid that, as far as I know we can't So right now we have lazy loading from a file path which may effectively achieve the same goal: We don't mmap the model instead just read the weights into memory at the time they are needed. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the response.
Regardless of the original question, this is good to know!
Is this done by default when loading a model? i.e: If I do: model, tokenizer = load("mlx-community/Nous-Hermes-2-Mixtral-8x7B-DPO-4bit")
response = generate(model, tokenizer, prompt="hello", verbose=True) it's already lazy-loading the weights from the file on-demand? |
Beta Was this translation helpful? Give feedback.
-
Hi, first, thank you so much for the wonderful work on
mlx
!Are there any plans to support memory-mapping models from disk?
When I try to load a model that's larger than the memory I have available, it uses
swap
memory anyway, so maybe allowing tommap
the model directly is more convenient and allows for faster loading (i.e: faster script startup and faster time-to-first-token)? It would also make it more efficient to run multiple processes that read the same model file?I'm putting those as questions because I'm not super familiar with the unified memory model of the new Macs and how it interacts with the disk / GPU memory / etc. So maybe what I'm saying doesn't make a lot of sense.
Beta Was this translation helpful? Give feedback.
All reactions