Amazing video. Very useful. I was trying to find some content using Apple Silicon. I Have M3 Pro, but still having some problems. 7b models are very hard to run. Thanks for the show!!!
Dear Nono, First off, thanks a lot for your work, it's super helpful. Can you elaborate on what's going on when we are downloading the shards, loading the tokenizers and safetensors? To me it seemed like the gemma model I previously downloaded and cached was being downloaded again. Or is that just showing the progress of loading into memory? How can I make sure the local, cached model is being used? Thanks again and all the best, Flo
5 месяцев назад
Hey, Florian! The models and shards get downloaded to HuggingFace's local cache in your machine (on macOS, for me, that's ~/.cache/huggingface) and it shouldn't be re-downloaded on every execution. It normally does a fast process to verify the models were already there, and if not it downloads. As you are saying, there's always a loading bar for loading the model into memory. 👌🏻 Nono
Nice information. I just had one quick question, what is the configuration of your m3 max? How many gpu cores and RAM?
5 месяцев назад
Hey! I have an Apple M3 Max 14-inch MacBook Pro with 64 GB of Unified Memory (RAM) and 16 cores (12 performance and 4 efficiency). It's awesome that PyTorch now supports Apple Silicon's Metal Performance Shaders (MPS) backend for GPU acceleration, which makes local inference and training much, much faster. For instance, each denoising step of Stable Diffusion XL takes ~2s with the MPS backend and ~20s on the CPU. I hope this helps! Nono
@ Thanks for this information. I created exactly same environment but for GPU version upon calling generate, it's returning same prompt as output (nothing more), whereas this works perfectly fine when I use CPU code. Here's my GPU code.. from transformers import AutoTokenizer, AutoModelForCausalLM import time start = time.time() tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it") print(f"Tokenizer = {type(tokenizer)}") model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", device_map="auto") print(f"Model = {type(model)}") input_text = "Tell me ten best places to eat in Pune, India" input_ids = tokenizer(input_text, return_tensors="pt").to("mps") print(input_ids) outputs = model.generate(**input_ids, max_new_tokens=300) print(outputs) print(tokenizer.decode(outputs[0])) end = time.time() print(f"Total Time = {end - start} Sec")