In-Context Learning: EXTREME vs Fine-Tuning, RAG

Discover AI

Подписаться 44 тыс.

Просмотров 4,3 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

22 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 10

@DaveRetchless 5 месяцев назад

Your content is the best detailed explanation of new papers and topics as they evolve. Thank you for the exceptional work.

@code4AI 5 месяцев назад

Appreciate your comment!

@AaronALAI 5 месяцев назад

I built a rig at home which lets me run mixtral 8*22 quantized to 8bit using exllamav2. It has a 64k context length and I've found that in context learning works very well. This is just my subjective observation, but i have my setup so new conversations with specific topics first incorporate a bunch of context. It's a small upfront time cost (about a minute on intial setup), but the model responds much better this way. Additionally, ive tried giving the model a bunch of context up front via rag with similar results. I think in context learning is going to dominate rag and fine-tuning, because of its simplicity, dynamic nature, and one needs fewer resources to have a larger impact on the model output.

@kevinpham6658 5 месяцев назад

Really interesting approach. Using sglang’s RadixAttention and the fork primitive right after all the ICL examples would make this strategy extremely viable and fast because you only have to evaluate the examples once. Multiple forks == multi-LoRa, but without the hassle of finetuning?

@frankschilder8974 5 месяцев назад

very nice summary. I liked in particular your insights at the end of the video. I'm wondering, however, about the additional cost of ICL+ for a production system compared to a fine-tuned model. It would be nice to see a chart comparing the inferencing cost answering the question of which approach would be more cost-effective in the long run, especially for high through-put scenarios.

@kevinpham6658 5 месяцев назад

If you use sglang’s fork primitive, you cache the ICL tokens and get it for free on subsequent calls.

@MrJucazila 2 месяца назад

Thanks to much for this content, it´s super clear, thanks! 🙂

@code4AI 2 месяца назад

You're very welcome!

@covertassassin1885 5 месяцев назад

@code4AI How could we apply this with longer problems? Having more examples where each example is long would fill up the context window very rapidly (eg, summarization). How would you recommend we balance those? My ideas: 1. Simply use RAG ICL to get the most relevant examples until the context is nearly filled 2. If the output of the model is short but the input is long, show many examples of the output of the llm/answer and just omit the long input that would take up many tokens. There are a few reasons behind this: the answer typically is a condensed form of information meaning much of the utility of the example is in the answer, the answer has the formatting the model should follow, and preventing dilution of the context window. (If you are filling a lot of the context window with a lot of tokens from the actual input of the problem, then the model will have fewer tokens of example answer text to pay attention to) 3. Potentially the inverse of #2 could also be useful: if the output is long for a short input, eg write a piece of code to solve a problem, then give the llm multiple examples of the input so it knows the general types of things it should be solving for. What are your thoughts on #2? I think it would still be very important to give a variety of examples to make aure you get a good distribution. However, maybe a 4th solution would be even better: 4. Hybrid ICL: Use RAG to retrieve a few full-length examples but then append many short examples (eg. just the output if the input is long). This way, the model can look at a few full problems & solutions but then has many more examples of the answer to reference. These output answers if in the form of chain-of-thought could be similar to what you referenced at the end with regards to many examples of reasoning