-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training on M1 "MPS" #28
Comments
I take it back. Seems like these are 8 x 40 GB systems. There is a good paper on cramming [Cramming: Training a Language Model on a Single GPU in One Day] I thought some work on these lines was done here as well. |
Actually I think this issue is great to keep open, in case anyone investigates nanoGPT in |
What is the actual memory requirement ? Will Mac Studio with 128 GB RAM be sufficient for training ? |
Refining the above comment slightly, do you currently have any (rough is fine) estimates on the relative sizes of the memory footprint for the just the model parameters, params plus the forward activations as a fn of bsz, versus the backward graph as a fn of bsz, on the 8xA100 40gb configuration? Where does it peak across the server during training? That might start to inform some people on how to go about laying this out on the resources they have. |
Also relevant for inference. |
I haven't had a chance to do any benchmarking yet but training starts just fine on M1 Ultra with |
I use Google Colab with the smaller model |
I tried out "i only have a MacBook" from README but with --device="mps" and it seems to run faster. With CPU, one iteration is roughly about 100ms whereas with mps is about ~40ms. My machine is a base line Mac Studio. |
That's for training a very small transformer. My machine is 64 gb RAM, M1 Max. For bert-medium like architecture, this is how it goes.
|
@itakafu thank you for reporting, i'll add mentions of |
test on MacBook Air M2, without charger: with mps: roughly 150~200ms for one iteration just for one reference
|
Confirmed works great w/device='mps'. But make sure to install this version of pytorch: $ pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu I'm getting <40ms Thank you SO MUCH for this |
@tomeck Weird, I'm getting 300ms on M2 (Macbook Air 16GB):
|
Just out of curiosity, I'm getting 17ms with a ryzen7 5700x and a 3060ti, 64 gb ram. What kind of iteration time does a A100 do? Are they horribly faster? I have a friend with 2x 3080s and I'm considering doing the big one... |
Yep the README documentation doesn't make sense in terms of ms calculations on A100. It states: This would mean - 500000/86400 = 5.787 itr / 1000 ms = 172.8 ms per itr. |
Oh I'm being stupid, I'm getting 17ms on Shakespeare, I bet it'd way higher on openwebtext |
Thanks to this thread I got it working on my M2 MacBook Pro - I wrote up some detailed notes here: https://til.simonwillison.net/llms/nanogpt-shakespeare-m2 |
I also built a little tool you can copy and paste the log output from training into to get a chart: https://observablehq.com/@simonw/plot-loss-from-nanogpt Example output: |
I think the mps section of the readme may be inaccurate: my understanding is that mps just utilizes the on-chip GPU. To use the Neural Engine you'd have to port it to CoreML — which may or may not speed up training but should do wonders for inference. See PyTorch announcement here. |
For training, you have to use MPS. For inference you can use ANE. |
Hey @simonw , thanks for sharing tutorial on your website! I tried on my MacBook Air M2 and getting much worse performance:
Currently on Python 3.11. Spent couple hours trying to reinstall everything but it didn't help. Does anyone have ideas what can be wrong here? |
Macbook M1 MAX results on train_shakespeare_char
|
It appears that after 086ebe1 was merged the training performance on M1/M2 is significantly slower. |
Thanks @deepaktalwardt! I am using the command suggested by @simonw:
After reverting that commit this is literally flying on my Macbook Pro M2 Max! So just make sure the Stopped training after 10k iters which took 4min18s.
|
Has someone tried 'mps' together with 'compile=True' and succeed? |
+1 to reverting 086ebe1; I went from 1500ms to 70ms per iteration. |
indeed, I also made my own fork and reverted 086ebe1, resulting in a dramatic speedup on my Mac mini M1! |
Simon, thank you very much for your walk-through of an installation of nanoGPT on Apple silicon. By the way, I just tried to run |
Yep and as follows, Overriding: dataset = shakespeare ~/nanoGPT master ± pip list | grep torch ~/nanoGPT master ± python --version |
Reverting commit 086ebe1 or overriding BTW I also did not need the nightly PyTorch build for this. The version available on MacPorts did fine. I did have to comment out code in |
Unfortunately, I cannot confirm the above statement: using a fresh installation of this repo, trying to train "Shakespeare" took approx. 2.2s per iteration on a Mac mini M1 with 16GB RAM - after reverting 086ebe1 again, every iteration took only 0.067s or even less (what a dramatic change!) |
That's strange. When I revert, which effectively sets gradient_accumulation_steps to 1, I get no change, so to me it seems commit 21f9bff resolves things. Ideas anyone? |
well, if you look into commit 21f9bff and compare that with the statement you used for testing ( Did you also test |
I know, this overrides the setting in
No, I only have one GPU, so from my understanding from this issue I want this value to be at It is not clear to me why the code current at the time of writing is dramatically slower for you than the code after reverting the commit. Are you seeing different values for |
Well, I think the reason why setting I tested nanoGPT with the Shakespeare dataset, not with Shakespeare_char which is why I ran into the same problem as a few weeks ago. And since setting |
Maybe simply try, python train.py config/train_gpt2.py |
|
Hi all, I have an mps error, but only when doing architecture sweeps, can someone comment on this issue? |
Do you folks not run into this issue with a buggy |
Add csv preparation scripts
Hi there! |
my m1 pro gets slower when I use |
yeah, having the same problem, but only with inference |
It's been a couple of weeks since I last checked on this, but I had the same issue. What's even more strange is that, at least for me, this slowdown only happened when generating the first sample (in This issue only happens with MPS. |
at this point, i might have to subscribe to google colab just to run the code...the downfall of poverty |
Add csv preparation scripts
Most of the people do not have access to 8XA100 40GB systems. But a single M1 Max laptop with 64 GB memory could host the training. How difficult is it to port this code to "MPS" ?
The text was updated successfully, but these errors were encountered: