for models at ~70b, i am getting timeout issues using vanilla ollama. It works with the first pull/run, but times out when i need to reload model. Do you have any recommendations for persistently keeping the same model running?
This is very informative! Thanks :) Curious why you used a g4dn.xlarge GPU ($300/month) instead of a t3.medium CPU ($30/month)? I assumed the 8 Billion parameter model was out of reach with regular hardware. What max model size works with the g4dn.xlarge GPU? To put into perspective, I have a $4K macbook (16gb ram) that can really only run the large (150 million) or medium (100 million parameter) sized model, which i think the t3.medium CPU on AWS can only run the 50 million param (small model).