Assuming you mean llama3.1 405B: I have tried so much and all my attempts have ended in failure. It's just too big. 405 billion parameters is like 1.6 trillion bits at Q4. This is something ludicrous like 203 GB of vram. The biggest GPU we have is 80 GB of vram. In theory you can use multiple GPUs, but I haven't gotten a config that works reliably. Maybe some time in the future 405B will work (or there will be a dedicated inference service or something), but for now 70B is realistically the limit.
@@flydotio Yes, oops I meant the 405B, all of the names are sticking in my lizard brain. I just wondered how people were even using a model that big! Thanks for your response. P.S. I just started deploying my apps on fly - you guys rock!
2 месяца назад
Great video, do you know how to load an encrypted model and load into memory?