Whats the best Chunk Size for LLM Embeddings

Подписаться 27 тыс.

Просмотров 11 тыс.

50% 1

When working with Embeddings, one of the challenging decisions is how big the chunks in the embedding should be. In this video I look at the question and discover some surprising conclusions.
Code at on GitHub under technovangelist/videoprojects
Be sure to sign up to my monthly newsletter at technovangelist.com/newsletter
And if interested in supporting me, sign up for my patreon at / technovangelist

Наука

Опубликовано:

13 мар 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 77

@sergey_is_sergey 4 месяца назад

I read the thumbnail as "Talk With Your Dogs" and was impressed with just how multi-modal Ollama has become.

@HistoryIsAbsurd 4 месяца назад

😂😀😂 love it

@sebington-ai 4 месяца назад

Hi Matt, I really like your experimental approach to chunk size, so useful. I'm looking forward to installing bun (a step into the unknown for me 😱) and trying out your code! Thanks for your videos, they are a pleasure to watch 🙂

@melaronvalkorith1301 9 дней назад

Loved your video and especially the experiment you ran! Everything is so new and changing so fast with LLMs that experiments like yours are very valuable. I would love to see you perform a more in-depth experiment on this and/or anything else you think would be worthwhile! It would be interesting to see a program based on utilizing two representations of the same document with different chunk sizes like one high and one low (e.g. 100, 5) and using one or the other based on what you need from it.

@lucioussmoothy 4 месяца назад

Great video - I dig your style. Just back from Spring break adventures and needed something to get pumped about diving in the AI madness on Monday. well done Sir

@darenbaker4569 4 месяца назад

Brilliant results thank you.

@c0t1 4 месяца назад

Thank you for addressing this - great video! Thanks for providing an example for analyzing the chunk size and overlap. I've wanted to try this exact thing but wasn't sure on a good way to programatically assess the quality of the LLM's response. I'm surprised that the chunk size makes so much of a difference in the end results, and I'd LOVE to see your analysis of the myriad vector dbs out there!

@technovangelist 4 месяца назад

It’s so rare to see someone use the word myriad correctly. Yes the vector db looks are coming soon

@c0t1 4 месяца назад

I know there are not an innumerable number of vector dbs out there, but from the perspective of a relative newbie in this space, there might as well be. I feel like the ex-Soviet man when first confronted with 100 different brands of shampoo.

@tal7atal7a66 4 месяца назад

excellent / pro tutos ❤ , clean english too and suite explainations thank you

@sahajamitrawat 4 месяца назад

I love all your videos so please continue posting them. On chunking, I recently built a RAG app to do QnA on our product docs where I have loaded all the product documentation in vector DB. In my case I converted these docs in to json array where each json object talk about one independent requirement for given capability. So there is no fixed chunk size for me, I stored each requirement in one chunk irrespective of its size an I have not used any overlapping. I was surprised by the accuracy of the results with this approach with nomin-embed-text embedding.

@NLPprompter 4 месяца назад

care to make a video presentation about it? seems like interesting

@rj7855 4 месяца назад

Very interesting, as usual

@HistoryIsAbsurd 4 месяца назад

Thank you good sir!

@jim02377 4 месяца назад

Nicely done. I struggle with chunking on my projects and sometimes I think the performance varies from day to day. I am curious how chunk size varies RAG performance on smaller models like Mistral 7B vs the big ones like GPT 3.5Turbo?

@AlekseyRubtsov 4 месяца назад

Thanks!

@veiculoseaventuras 3 месяца назад

Que vídeo espetacular! Adorei! Forte abraço, diretamente do Brasil!

@HistoryIsAbsurd 4 месяца назад

Hey not sure if this is your thing or not (no worries if not totally cool!) but could you like concatenate your videos about embedding together into a super tutorial or maybe set up a playlist from your embedding videos? Super super good info in your videos man

@user-ro4ov2xv7s 2 дня назад

What K value for retrieval did you see had the best results?

@phizc 4 месяца назад

9:35 One interesting twist would be to use a rather short chunk size to look up the embedding, then privide the text of the couple of previous and next chunks as well to the LLM. That would give the LLM a longer context to use for providing the answer while still being able to find the closest matching embedding. I'm not too surprised about the results, except perhaps about overlap not making much of a difference. I would think having enough overlap to at least have complete sentences would be ideal.

@ginisksam 4 месяца назад

Ciao Matt, Thanks for the insight. Been playing with Langchain+Ollama(nomic-embed-txt)+Groq(free now & speed) - Chunking on a pdf article with 20 sub-topics. Found out chunk size = 1024 & 512 w/o overlap sufficed. Finish off using gpt-3.5 (free) to merge my outputs for each topic do produce rather good summary for own use on a cpu laptop for now. On large pdf single topic - my goto approach is ask llm to generate list of questions for me - then will peruse those questions - in this way allows for better enjoyable reading & understanding of the pdf. What's your view on this? Keep up your videos. Cheers.

@StudyWithMe-mh6pi 4 месяца назад

Music is inspiring me to try chunking text -:)

@meyermc80 4 месяца назад

Stating the obvious a little but there are more than 2 variables to control for here. The video mentions another big one: the specific application. Also another huge variable is the specific embedding model you use and what capabilities it was trained with. Also how are you embedding? top match top 5 matches, anything above a similarity threshold. How do you measure similarity cosine, euclidian, manhattan, ...? Are you embedding raw chunks or generating hypothetical embeddings? I love any material that help answer any part of this. 😍

@technovangelist 4 месяца назад

yes. can only do so much in a video. so a lot of that is coming in future videos

@cnmoro55 4 месяца назад

Several months ago I have made a very similar test. I found that chunks of 150 words with 20 words overlap were the best for my case. But now I am using token chunks instead of word chunks. Approximately, those 150 words are equivalent to 300 tokens.

@technovangelist 4 месяца назад

It tends to average out to 3 words to 4 tokens in general. But depends on the text.

@cnmoro55 4 месяца назад

@@technovangelist yes, you're right. but my use case is for portuguese language, so the token count tends to be a little higher :(

@panckreous 4 месяца назад

Remember when the MCU was great, and how as good as a movie may have been, the post-credits stingers were always the best part? Thank you for filling the void. 10:31 (alternatively, "dude...")

@technovangelist 4 месяца назад

The only reference I know of for MCU is related to Marvel, so no, I don't remember when MCU was great, because every Marvel movie has been garbage. I have gone to some and just walked out they were so bad. I went to see Dune 2 this weekend and would have walked out on that if I hadn't fallen asleep.

@yevhendyachenko1384 Месяц назад

Could You test the code repo chunking?

@technovangelist Месяц назад

Um, I'm not sure what you mean here.

@mvdiogo 4 месяца назад

very nice video. I use pgvector, for me, chroma db did not suport all my files

@HenryETaylor 4 месяца назад

I'm barely a novice on any of this but intuitively I'm wondering why the length of your longest question doesn't factor into choosing your minimum embedding length. Along the same intuitive line, wouldn't you want to pad or augment your answers so that questions always start at the beginning of a chunk? If your questions and combinations upon your questions are the most likely search criteria which you are likely to receive from users, wouldn't guaranteeing that each question is totally encapsulated in a chunk facilitate the speed and accuracy of the resulting searches?

@rafaelrodrigues6320 3 месяца назад

The embedding length (number of dimensions) is fixed, depending on the model you choose, so I think by "minimum embedding length" you meant "minimum chunk length". It's a good idea to determine you chunking method according to the queries you're expecting, but maybe the longest question can be an outlier. Be careful with that. I didn't quite get your final question, but you can't tailor your chunks to each specific query. Chunking your texts, embedding them, and indexing for efficient searching, takes quite a long time. Everything must be ready for when your wants to do a search.

@darenbaker4569 4 месяца назад

Vector dbs can't wait my favourite is milvus for local dev

@wholeness 4 месяца назад

Chunk from Goonies?

@HarmAarts 3 месяца назад

I do have a question: wouldn't chunking by sentence(s) make sense?

@technovangelist 3 месяца назад

Yup. Look at some of the later videos

@daryladhityahenry 4 месяца назад

Hi! Really interesting to hear 100 words only is kind the best... Still, this is based on use case right? Let say the question is: How to do xyz step by step? It's impossible to be answered in 100 words right? And if the list of how to split into chunks ( even with overlap ), it will lose the context above right? I mean... Let say there's 10 list item. The first 3 list item know the context since it's at the start ( 1st chunk ). In the second chunk it has 3 more list item, and maybe it doesn't know the start of the sentence which is the key. This is the problem right? Or embedding somehow can get the connection between the list and the main intention? Thanks

@technovangelist 4 месяца назад

that was the summary at the end....you have to experiment with what you are trying to do.

@daryladhityahenry 4 месяца назад

@@technovangelist I see.. So there's really no generalization about that? Hmm.. If that's right then, the real implementation is really segmented to simple things? ( Or basically depends on how good the LLM handle context )?

@technovangelist 4 месяца назад

i found for my use case even the complicated stuff was covered by 100 words or less.

@daryladhityahenry 4 месяца назад

@@technovangelistwoahh.. I see.. Nice2.. Thanks for sharing :D:D:D:D..

@neodim1639 4 месяца назад

What about semantic chunking?

@ilianos 4 месяца назад

I was waiting for this as well.

@ilianos 4 месяца назад

Recently, I even heard about "agentic chunking". Pretty interesting concept!

@95jack44 4 месяца назад

And what about using an LLM to do the embedding ?? Seems like it would know what's the best place to cut the text !

@technovangelist 4 месяца назад

you definitely don't want to use a regular llm to do the embedding. that’s not what they are for. Your results are going to be so much better using an embedding model for that purpose

@spartaleonidas540 3 месяца назад

@@technovangelistI think he meant in terms of chunking the text. A semantic chunker instead rigid arithmetic

@technovangelist 3 месяца назад

Same answer. Embedding model is much better for this.

@idleidle3448 3 месяца назад

What's the difference between RAG and Embeddings?

@technovangelist 3 месяца назад

Embeddings are a part of what goes into a vector db that is used for rag.

@idleidle3448 3 месяца назад

@@technovangelist thanks Matt! Do you have a buymeacoffee link or patreon?

@technovangelist 2 месяца назад

Well I do have that patreon now. Just set it up: patreon.com/technovangelist

@chrisBruner 4 месяца назад

I'm starting to think I'm going to have to learn ts (and bun).

@mungojelly 4 месяца назад

the sms length of 160 characters was based on research that suggested that was long enough for most things people say,,, i think we overestimate how much information we're adding most times that we say more than that,,, a few words is enough to enter a completely different world of meaning, we forget how just a sentence or two is enough to bring us to almost anywhere

@technovangelist 4 месяца назад

yes, but if we have a long document with a bunch of concepts discussed in it, i thought I would need a longer chunk size to capture the idea in the doc.

@technovangelist 4 месяца назад

and i remember seeing that story once, so why was the original limit 140. the story was specifically about 160 characters on a one guy's postcard. i worked for a fax vendor, and folks that peddle out of date communications mediums tend to be interested in each other I think.

@mungojelly 4 месяца назад

@@technovangelist idk why we talk about it in terms of static docs, i think of it mostly in terms of asking llms for specific forms & for categorizations ,, maybe depends on how much text you have already, it's like a dirty way to ingest stuff made by dirty random humans, i guess

@mshonle 4 месяца назад

@@technovangelistI will send you some Gregg shorthand via carrier pigeon. I must use duct tape, alas, because my wax for seals ran out when my second cousin thought he found “ceiling wax”.

@technovangelist 4 месяца назад

Nice, i look forward to that. I remember watching a video a year ago comparing authentication to wax seals in olden days.

@AndreaBorruso 2 месяца назад

Hi the ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-6QAlbThWomc.html video isn't available any more. Is there a new URL? Thank you

@technovangelist 2 месяца назад

Ok. I don’t know what that is or why you are leaving a comment here

@AndreaBorruso 2 месяца назад

@@technovangelist it's the URL you suggest at 2:30 time ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-9HbU9Of-Ptw.html

@technovangelist 2 месяца назад

That’s was more of an example of the kind of output I want. The actual video doesn’t matter

@AndreaBorruso 2 месяца назад

@@technovangelist I feel like a jerk. Thank you

@technovangelist 2 месяца назад

I should have put a silly word in the url to be more obvious

@valueray 3 месяца назад

Why Docs need to have all the same size. Cant it be dynamic?

@fabriziocasula 4 месяца назад

ciao Matt, where is the code? :-)

@technovangelist 4 месяца назад

On GitHub. . Technovangelist/videoprojects on my repo

@fabriziocasula 4 месяца назад

thank you :-)@@technovangelist

@technobanjo 4 месяца назад

Like granted for using Bun

@florentflote 4 месяца назад

@mshonle 4 месяца назад

What about a more content driven approach, such as using a traditional NLP library like spaCy to first break your text up into sentences? (You can remove all punctuation and let spaCy decide the breaks, or I suppose you could go an evil regular expression route.) Also, what about hierarchical embeddings? E.g., use the sentence embeddings, but also have paragraph embeddings, section embeddings, and so on until you have document embeddings at the very top?

@technovangelist 4 месяца назад

Normally I do it based on multiple sentences. No need to use a library for simple stuff like that. Actually the first version of this video was going to do that, but I wanted to see if sub-sentence chunk sizes were useful.... and they are. Most vector db's that I have used also supported what you have referred to as hierarchical embeddings. But you can go too far pretty easily, potentially giving the whole document which means you have made it almost as inefficient as without rag.