You’re right, I made a mistake here - I only really noticed this reviewing the video, I guess since I made it I could tell the difference. Next time the graphs will be easier to distinguish!
Thanks, Daniels, for the video and for sharing the materials' links. You're a legend. Got an M3 Pro 14" (11-core CPU, 14-core GPU, 18GB) last month and have been wondering it was an optimal move.
Surprised, that you did not include RAM bandwidth in the beginning. Whenever you do non-batched inference, the memory-bandwidth becomes your main constraint, instead of your GPU-performance. As shown in your M1 Pro to M3 Pro comparison. llama-cpp's M-series benchmarking shows really nicely, why the M3 Pro with it's 150GB/s instead of 200GB/s memory is a problem, not its (faster) GPUs. If one just does inference and has large models, requiring lots of RAM, the M2 Ultra really shines with its loads of 800GB/s RAM. Totally agree, that with learning and batching, it's different and NVIDIA's new GPU performance blows away Apple silicon.
that's RU-vid quality education, good enough but most of the times they are missing crucial details and due to that mistake twisting the truth especially for performance. Although this person has studied and gets paid BIG salary to know such details ..... weird but I boil it down to maybe a simple human mistake. Still a good video!
Woah, I didn't know about the lower memory bandwidths between the M1/M3. Thank you for the information. I just wanted to try raw out-of-the-box testing. Fantastic insight and thank you again.
@@gaiustacitus4242 yes, but you can split layers to multiple cards. For me, I decided for a M2 Max 96GB MacStudio and not for a 1kW+ heater PC, even though in pure GPU horsepower the 4090 is much faster. And never regretted it. Correction - I now regret my M2 Max decission since last week, because Apple/MacOS Sequoia finally will do nested Virtualization. But only on M3 and above. And with this I have hopes of virtualized GPUs at some time. Nvidia/CUDA always was virtualize-able and works in Docker-containers/VMs.
not even that much, it doesn't even come close to those who really use Tensorflow and Pytorch, besides that if you have your production environment in the cloud, those 2 libraries are better integrated than MLX, in addition to the fact that for quick deployments you already have the containers preconfigured and optimized with those libraries and CUDA since the cloud servers are dominated by NVIDIA and not Apple's "Neural Engine".
When you have the same chip you will hit the silicon lottery and one machine will have a better GPU while the other will have a better CPU depending on dead transistors and little lottery based differences. So I'm not surprised that an M# pro and M3 Max with the same Neural engine will perform differently. The Silicon lottery is a real thing that will always be a factor in computing. Great video by the way and very informative.
For small LLMs, you are correct. For 13B parameter or larger LLMs a maximum spec'd Mac Studio M2 Ultra or MacBook Pro M3 MAX will outperform the best Windows-based solution you can build. Of course, the new Copilot+ PCs running Snapdragon X Elite CPUs will also outperform the desktop build you've recommended when running 3B to 3.8B parameter LLMs.
Would be interesting to see how the 128GB version of M3 Max performs compared to the RTX cards on very large datasets, since 75% ~ 96GB could be used as vram in that Apple Silicon.
Hi Daniel, will you be teaching something more than image classification? You are the best programming teacher I have ever followed. Looking forward to your new deep learning course on ZTM.
Sir I follow all of your blogs, vedios etc I want to be a ML Engineer so i enrolled in your 'Complete ML and Data Science course on ZTM'. What a marvellous way of teaching ❤❤
One thing is clear even as a PC person Mac had a steep advantage with M3's dynamic ram to vram conversion and mow power. Sure they don't have the hardware or software of nVidia but for some Ai users, the entry price for the VRam is a winner.
I believe you can also target pytorch to run on Apple silicon's NPU rather than the GPU. And I am sure it will perform better. Though not sure about how much memory the NPU has access to. It will be great if you can explore this and do a video on it.
The M3 Pro being slower/not much faster in some tests is probably because of the slower ram. I'd be interested to see how 30 and 40 series cards stack up but considering the cost of the laptops already this is quite the effort so no complaints.
@@kborak I'm not a mac user, I wouldn't buy Apple hardware for love or money. But the chips are still pretty good so it's interesting to see how they stack up to a better GPU for this kind of workload.
This is interesting! It seems between the m3 pro 16GB (150GB/s) and m3 max 32GB (400GB/s), and considering the m1 pro 32gb (200GB/s), would you suggest that RAM is a much important factor to these ML tasks than memory bandwidth? Or other? Would be keen to see a test between an m3 pro 32gb vs your m1 pro 32gb to see if memory bandwidth of 50GB/s difference has any real world result differences. (also one less GPU core but faster boost in M3 pro)
Very helpful thanks Daniel. I was going to race out and buy an M3 to do my ML work, but I will hold off for now. I suspect Apple will do something to help boost performance considerably on the software side, but who knows.
Finally a useful video. Too many “reviews” focus solely on content creators. Now I know I can do light ML on my Mac. And do the heavy lifting with my 30 series RTX card.
I'd love to see you test the M3 Ultra with 64 GB RAM when it comes out, I am using the M2 Studio Ultra at present and wonder if it will be worth upgrading. Running batches, it gets warm, but I've never heard its fan yet.
7B parameters / 25 ( 25 and delete 7 zeroes or divide by 250 000 000) = 28GB which is close enough for a simple maths for GB Memory for Molde Parameters.
I wish you would make videos covering AI news your probably more qualified to talk about new developments in this space then 80% of these “AI channels”
The comparison between the M1 Pro and M3 Pro is not ideal. The M3 pro you are testing is the binned version with only 14 cores however your comparing it too the full M1 Pro. To get an accurate performance measurements its best to measure both the full chips rather than the binned version that way we can truly see if the memory bandwidth has any difference when it comes to Machine learning
Hi Daniel! What a great PyTorch tutorial you have made. Thanks for that! Also thanks for that speed comparing video. Can you record the video that comparing the speed of different Colab versions? I mean free, 10$ and 50$. Also here can be added M3 max and your Titan (which you already have done). Maybe one of your friends has 50$ account and he can do that tests for you [for all of us :)]
Yeah you're right, I also just found out that M1 has a higher memory bandwidth than the M3 (150gb/s vs 200gb/s) thanks to another comment. That likely adds to the performance improvement on the M1. Strange to me that a 2-year-old chip can comfortably outperform a newer chip.
I have only a 16 GB M1 Pro, on the first 2 benchmark I get similar or slightly faster speeds. I will try to run them on the other benchmarks, I got side tracked modifying the 1st benchmark to run on a quad RTX 1070 setup.
In the process of learning ML/Ai related tasks. Based on your experience would you prefer a 13” MBP M2 24GB RAM ($1,299 new) or a 14” MBP M3 Pro 18GB RAM ($1,651 used)?
The 24GB of RAM would allow you to load larger models. But it also depends on how many GPU cores the two laptops have. Either way, both are great machines to start learning on
IMHO macbooks are only inference machines, not training. It's great to run locally 7B, 13B, 30B LLMs (depends of your # of RAM), run quick stundents training on something like MNIST. I personally write code for training and run experiments with small batch size on my M1 pro, than copy the code on my 3090 PC and run long training with bigger batch and fp16. While PC is busy, I run next experiments in paralle on laptop. If you load with big training your main laptop, you will have uncomfortable experience if you want browsing, gaming, etc in parallel with training.
While this is a nice buying guide for my next laptop, this is just a shining endorsement for Google Colab. What an insane value for new-comers looking to learn while not being hobbled by old equipment.
On my m1 max 64GB... I'm getting 8208 on Core ML Neural Engine... My Core ML Gpu falls more in line at 6442... All this while powering 3 screens. Watching youtube and a twitch stream. Not that I expect those things to add much load... But it is nice to have a machine that can basically do everything at once with near zero penalty.
If you really want to show the Apple silicons advantage just wait till the M3 ultra comes out with 256GB Memory and then use a model that needs that much memory. Then the only comparison would be ~3 A100s. With apples new MLX and flash is all you need we might even get better results
Can’t wait to pick one up, I was planning on a M2 Ultra but, I’m expecting to keep this machine for a good while as part of my server rack.. So M3 ultra it is!
although it's nice to see vision models most people wanted to see inference w/transformer LLM's then fine tuning LORA, SFT. llama2 q40 is hardly a test even an 8gb mac metal can run that. would like to see different quants at 33b and 70b with different loaders, AWQ, GPTQ, exllama etc.
At one point you say the bottleneck is memory copies from CPU to GPU and back, but the M-series doesn't have to do memory copies because it's all shared memory. In fact, one of the first optimizations for code on Apple Silicon is removing all the memory copying code because it's an easy gain. Have you accounted for this in either your code or the library code you're using, or both?
I am a medical doctor with a recently acquired Ph.D. in pharmacology. I am currently engaged in clinical research, focusing on identifying factors that lead to therapeutic failure in patients with various conditions. My work involves analyzing patient data files that include sociodemographic information, pathological records, clinical data, and treatment details. These datasets typically contain between 100 and 2,000 variables per patient, with a maximum of 1,000 patients in an ideal scenario. I will be using R and RStudio to process and analyze this data in various ways. Based on your experience, could you suggest a computer configuration capable of handling this type of data processing efficiently? Thanks in advance!
Thanks for this, really useful and confirms my initial thoughts on just getting an M1 Pro 16GB over M3 8GB (M1 Pro is slightly cheaper). My M1 Pro is similar to yours 10 cpu + 16 gpu but just 16GB and has been slightly faster on both pytorch benchmarks. I then was curious to see how it compares to a quad RTX 1070. I modified your code (I will make a PR) to use all four GPU for CIFAR100. In general it is faster than the M1 Pro, what is interesting is how it compares to single card vs quad cards. CIFAR100 on small batch it was really bad, however by 512 batch size it was faster than a single card (34 secs on 1024 batch). It keeps on improving until 3072 with 16 secs, then gets worse at 4096 back to 19 secs similar to 2048. Also by 4096 batch size the GPU VRAM is almost full and close to 8GB.
The problem with reliance on nVidia GPUs is that performance takes a nosedive once the LLM can no longer be loaded into the video card's onboard RAM. Any M-series Mac with 128 GB RAM will outperform a PC equipped with 120 GB RAM and the best available nVidia GPU. I know because I've invested in builds of both sets of hardware only to learn the hard way that a Windows PC with an nVidia 4090 GPU with 24 GB RAM is extremely disappointing for 13B parameter or larger LLMs. The smaller LLMs do not yield acceptable quality of results. At present, your best approach to running a private LLM that approaches the accuracy of ChatGPT 4o is a Mac Studio M2 Ultra with 192 GB RAM and maximum CPU/GPU cores, followed by a MacBook Pro M3 MAX with 128 GB RAM and maximum CPU/GPU cores. Of course, if your goal is to just tinker with a local LLM to gain a better understanding of how AI works, then run smaller LLMs on a Windows PC with an nVidia GPU.
Both the M3 Pro and the M3 Max you tested have lower bandwidth than the previous M1/M2 Pro / M1/M2 Max and since bandwidth is hugely important that was reflected in your results. The M1/M2 Pro have a 200 GB/s whereas the M3 Pro only has a 150 GB/s. The M1/M2 Max have a 400 GB/s bandwidth but the M3 Max model you chose only has a 300 GB/s bandwidth (there are also M3 Max models with 400 GB/s).
Wow! I didn’t even know this… excellent info. So what makes the bandwidth increase from the base models? Is it RAM upgrades or storage? Or something else?
Yes that would be a perfect laptop to start learning ML. You can get quite far with that machine. Just beware that you might want to upgrade the memory (RAM) so you can use larger models.
@@mrdbourke sir I am having M3 Air 16 GB and Macbook Pro M3 Pro 18 GB What should I go for, if I am starting to learn and grow in ML in long term and the price difference between both is 30,000 /- please adive, thanking you
@@krishna1-c6d You dont need such heavy powered machines to start learning ML. Just use google collab to learn. May be then once you implement projects you will understand which is better.
Question: I bought the M1 max with 64 GB ram, and 32 cores GPU. Like you, I am now extremely satisfied with my purchase two years later. Question: I like your set up using the Apple machine in conjunction with a box with that RTX4090 installed. Would that set up run in parallel with my GPU course? And similarly, if I added equivalent ram to that box, would it work together with my installed 64 GB?
Hm, in my opinion, a strange metric because "effectiveness per dollar" doesn't really tell you much. My bike costs $300 and my car cost $10000. My bike averages around 20 mph and my car 75 mph. That comes out to 30x the price for 4x the speed. Did this tell you anything? In my opinion, no. What is a far more useful metric is the options the purchase makes available to you. If I have a car, traveling 10 miles for food is a very easy decision to make. If I only have a bike, traveling 10 miles is a major decision. With the right hardware, you unlock options like "iterative experimentation" whereas before, you had to carefully choose your workloads. And as he mentions, certain configurations simply lock you out of certain desired avenues. (8 GB of RAM is too little for many projects.) So yeah... spend is not a very useful metric, in my opinion. Choosing the bike over the car is a pretty pricey choice for reasons beyond money.
Interesting analogy, but with the car many other features (in the warm, carries 4 people....). When buying compute power for AI then yes you could also consider laptop might be better than desktop for convenience, but not really like the car example. If you were comparing mainframe to laptop to desktop then might be nearer this analogy. Guess will not matter soon, as cheapest will be cloud purely by volume!@@jks234
This is really a great video. The problem I have is all my development is on a laptop and I think this is wrong. The conundrum is simple, I will present my work, that's a given, so how do I dev on a much more powerful desktop and still have the ability to present my work? I hate powerpoints of screenshots, I want to really show what I'm doing.
Great video, could you please update us if the new mlx change the result or your conclusion at all? Would love to know if the m series chip is as good as what the others are saying .
So cool, are you able to run these tests on a m3 max chip with a maxed out ram configuration? Could it be more "usable" than say a 4090 with "only" 24gbs of dedicated vram?
There’s too many considerations that were left out. M3 chips need more RAM because they’re sharing with the system. You want 48-64GB for these tests. In addition this didn’t mention the difference between performance cores and efficiency cores. The ratios changed with the latest M3 CPUs. Finally, with RAM upgrades you want to consider the memory throughput, which was capped lower unless you upgraded the M3 Max. All-in-all this is a good general comparison for affordable devices that students may have. I’d like to see an upgraded M3 Max/64GB/4TB. Acknowledging, NVIDIA would still be faster. Of course, if speed is the game you’d put this on a AWS server somewhere and just have it churn for you.
The Titan is five years old. Would have been nice to include a current GPU. like the 4090. It can be 2.5× faster than the 3090, which is newer than the Titan.
Seems M3 is crippled on most tests due to low memory to be a real M version vs, and more a “how low ram can hurt you”. I would have loved all models at the same ram, or all being base or maxed out models. That said, interesting insights on the effect of ram and how nVidia performs when we’re talking strict GPU.
All M3 models are the base variant in their category. Only upgraded model was the M1 Pro (can’t buy anymore). But yes you’re right would be cool to see them all on the same RAM!
Do apple silicon chips handle the workload on neural cores themselves or do they need to be specifically invoked via an sdk from the code? what was the workload on those during each test? I wonder if they were invoked at all. if they were, it sounds like they do not matter compared to GPU, however it's claimed they can do something like 17 tops which outperforms any google coral. Moreover, apple claims neural cores are 60% faster on m3 compared to m1. confused now.
In this video, the M3 base model has only 8gb ram and the M1 pro has 32gb ram. What if I'm choosing between the M3 base that has 16gb ram and the M1 pro that has 16gb ram as well, should I still go for the M1 pro? Thanks
not for pure non-batched inference, where the memory-bandwith as well as memory-size is the main constraint. There the M2 Ultra's 800GB/s vs. 4090 1080GB/s is not so bad. The higher GPU-power of the 4090 really shines with batched processing.
Seasoned ML/AI engineers know just about the only thing we use our laptops/personal machines for is a web client to log in to cloud services and train from there 😂
M1 Pro doesn't perform better because of the more GPU cores, but because M3 Pro was seriously handicapped, not just less performance cores, but the memory bandwidth is severely cut back.