fine tuning saves you on the token cost. you dont have to provide the examples with every request for a very specific format. with a larger system with many more users, this is preferable over larger instruction prompts.
Interesting. I think longterm cost savings for input tokens can be a big factor. If you’re spending 200k a month on llm calls because and your prompt is several thousand tokens, it really starts to make sense to fine-tune. Benchmarking standards are also kinda tricky. Saying it’s “as good or better” I’ve found leaves a lot of room for it being absolute crap in certain scenarios. 4o is “as good or better” than 4 legacy, which is laughable when you’re asking 4o a complex coding question or stuck in a loop where it provides the same bad suggestion over and over. Thank you for sharing
but input tokens are veryyy cheap compared to output tokens and for fine tuning you would need to consider the cost of collecting data, fine tuning itself and when you wanna change smth that will also cost you
@@N7Tonik very cheap… at what scale? I’ve found that very cheap can turn into $200k pretty quickly. Collecting data is free if you’re already doing it for analysis. And fine tuning jobs are pretty darn cheap relatively too. There are considerations to be made. Saying it’s cheap and moving on only gets you so far. Even at small scales, fine tuning starts to make sense if you have the data needed and aren’t getting the results you need with prompt engineering
If you want the llm to output JSON strictly it's a great idea to actually enable format JSON mode while making the API call to the server to get the llm response. Format as JSON is supported by Ollama, Groq, Open AI and probably even other inference api providers. So clubbing the prompt shown in the video + format as JSON should give the best outputs