Chunking Strategies in RAG: Optimising Data for Advanced AI Responses

Mervin Praison

Подписаться 42 тыс.

Просмотров 25 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Хобби

Опубликовано:

29 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 119

@micbab-vg2mu 6 месяцев назад

I am a fan of preparing data beforehand and semantic chunking - this random 1k token chunking methods does not work for me. In medical field you need at least 95% accuracy:)

@MervinPraison 6 месяцев назад

True. Preparing data beforehand would be more ideal

@RobertoMartin1 6 месяцев назад

Yep data prepping is key especially if you're extracting data from pdfs and such!

@harshitgupta7496 6 месяцев назад

Hello! How do we limit the output tokens in each document to say 512 tokens in semantic chunking. I am trying to implement it but it fails.

@smklearn-hy9me 2 месяца назад

Hey what do you mean by preparing data beforehand here can you explain me

@smklearn-hy9me 2 месяца назад

Hello

@jawadmansoor6064 6 месяцев назад

agentic chunking, good idea, but you can top that apply vactorization of these chunks i.e. apply previous (semantic chunking) now.

@jawadmansoor6064 6 месяцев назад

BTW, agentic chunking is extremely expensive, though on MTEB only LLMs stand atop.

@MervinPraison 6 месяцев назад

Yes you can again implement semantic back again . Yes it’s expensive

@I.L.-nd6hs 6 месяцев назад

@@jawadmansoor6064 If you mean 'expensive' as a computing power - you're right. About expensive in money - maybe it can be implemented using one of the great free available LLMs? What do you say, dear @MervinParison ?

@luxlmj109 6 месяцев назад

Thanks a lot for the video! I have a question? how can use the same code but directly to a pdf file and not a string in the code? thanks a lot

@sanoussabarry4218 4 месяца назад

really great video and well explained

@AnangTomar 4 месяца назад

Amazing Video !!

@DougLopes 4 месяца назад

Good explanation. But it is impossible to read line 185 and by consequence, impossible to test the best snippet that you are soposed to teaching

@latrompetasonara 4 месяца назад

This is so mind-blowing!! Is there any option for coder beginners? I'd be happy to pay a one-time subscription to get a software that helps me organize my data this way. Especially if the rag system is able to get the references from the retrieved info. That'd be awesome for a deeper understanding of multiple (not hundreds, but thousands) papers in academic reaearch. Thanks for the video!

@cynorsense6654 2 месяца назад

Nice tutorials Great stuff. How do we store the json data as chucking is actually not doing great job with key values instead thinking random brackets and so on? Do you have a solution ?

@ImpactEtching 28 дней назад

this is good sh..t.. thank you for your public service !

@pochtaliot 5 месяцев назад

Very nice, how about achieving the same using open source LLMs?

@antonijo01 Месяц назад

What would be the best chunking strategy for a large code base?

@loganhallucinates 5 месяцев назад

This is a shorter version of Greg's video, thanks for sharing!

@AmeliaMelia-tj3kc 22 дня назад

does this is for all cases ,or for this specific case

@Augmented_AI 6 месяцев назад

Are you from South Africa?

@SaurabhSharma-jp2kv 4 месяца назад

can we try agentic chunking with ollama?

@TelB 6 месяцев назад

Outstanding! An absolute masterclass. Thank you so much Mervin. Personally, I think that this is one of your best videos to-date. I watched enthralled!

@MervinPraison 6 месяцев назад

Thank you 😊

@Puneet-Bajaj 4 месяца назад

@@MervinPraison How can we use the agentic chunking with groq api, i tried to replace there in the code but it is throwing error

@hughesadam87 2 месяца назад

Really great stuff

@pradikshadevi2024 2 месяца назад

Really amazing...Thanks a lot

@nagireddygajjela5430 2 месяца назад

Great video

@machinelearning6817 Месяц назад

Thakyou so much sir

@NetZeroEarth 6 месяцев назад

Thank you! Any IOT use cases with CrewAI?

@MervinPraison 6 месяцев назад

I will look in to this

@Hoxle-87 5 месяцев назад

Thank you for the informative video. How do you handle pictures or plots in documents ? Is there a special way to handle them?

@MrAhsan99 4 месяца назад

u got anything on that?

@Hoxle-87 4 месяца назад

@@MrAhsan99 no, but ollama has a model for that, Llava. It does ok. I guess I would need to fine tune a model for better results

@MrAhsan99 4 месяца назад

@@Hoxle-87 Thanks for sharing this

@MrAhsan99 4 месяца назад

1- How are you handling the tables and diagrams in the documents? 2- You are loading a text document, How to do this on a pdf file?

@I.L.-nd6hs 6 месяцев назад

Thanks for an amazing, educational and insightful video. One comment please: It seems you're skipping the "Chunk #1" along the process, which has the "This capability" in its beginning, so it's not categorized correctly to the first chunk. This seems to be caused by an error of the LLM, not un-contextualizing that sentence correctly as it's instructed by the prompt. Am I right? Any idea why it happened? Thanks again for a great video!

@traineroperators2885 6 месяцев назад

Really a good job Mr Mervin !!!!!! we really appreciate your effort for this valuable knwoledge

@MervinPraison 6 месяцев назад

Thank you

@alqods80 6 месяцев назад

So the best technique is to use agentic chunking with grouping ?

@MervinPraison 6 месяцев назад

Yes, but it also costs a bit more compared to others

@alqods80 6 месяцев назад

@@MervinPraison so which next best for free

@yotubecreators47 6 месяцев назад

this is best high quality channel and videos no intor/no long stuff just to point thank you very much

@MervinPraison 6 месяцев назад

Thank you

@TerrenceVerlander 6 месяцев назад

So once youve used this tactic to create a vector database with all the stuff inside it and save it to a chroma db, how would you use an llm to query the database afterwards ? It seems the chunking is part of the retrival process aswell based on how you presented this, running this script everytime doesnt seem that great

@agentDueDiligence 6 месяцев назад

Thank you for the video! I recently built my first RAG application and chunking + retrieval was the main challenge.

@MervinPraison 6 месяцев назад

🙏

@vikramkharvi9679 5 месяцев назад

value is not a valid list (type=type_error.list) I get this error

@sivi3883 4 месяца назад

Best 15 mins of my day. Excited to try the agentic chunker. Would love to hear your thoughts on what if the source contents contain a mix of texts and lot of tables.

@umangternate 4 месяца назад

Can I use Gemini embedding instead of openai?

@MervinPraison 4 месяца назад

Yes. But you might need to modify the code accordingly.

@umangternate 4 месяца назад

@@MervinPraisonI installed langchain-experimental and it is listed when I checked it with pip list. However, when I tried to import it, it could not be resolved. I tried to rerun the terminal, conda, reopen the folder, but nothing seems to work.

@adamchan4403 6 месяцев назад

That’s what I m looking for ! Thanks so much again , Mervin 🙏🏻

@MervinPraison 6 месяцев назад

Thank you

@rodeldagumampan8858 Месяц назад

Great video. Short and practical. All the best.

@RonMar Месяц назад

Fantastic ideas and breakdown! Thank you!

@hamzazerouali7820 6 месяцев назад

Thank you for the video. That's amazing. Could you please give us the name of the tool that makes autocomplete for your code ? Amazing content :D

@MervinPraison 6 месяцев назад

I use GitHub copilot and continue interchangeably

@powersshui2406 6 месяцев назад

hello Mervin Praison， Your coding environment is my favorite style. colud you tell me how to setup in ubuntu? thank you.

@MervinPraison 6 месяцев назад

It’s just VS code

@THE-AI_INSIDER 6 месяцев назад

awesome as always! i have a question, can we use any other llm for agentic chunking, can you suggest any open source free alternative to gpt3.5? I have the same setup with local ollama mistral, can i again use that for agentic chunkling?

@MervinPraison 6 месяцев назад

Nexus Raven is the closest. You need funtion calling to do that task . But you might need to modify the agentic chunker.py code

@VinayakShanawad 6 месяцев назад

Thank you so much for the video!! When using models like "all-mpnet-base-v2" from the Sentence Transformers library to generate embeddings and store them in Vector DB, which are based on architectures like BERT, the maximum token limit per input sequence is typically 512 tokens. As long as my input text does not exceed the maximum input sequence which is 512 tokens then we don't need to deal with different chunking strategies covered in this video. I meant chunking strategies are meant for long input text (1000s of words), is that correct understanding?

@ilianos 6 месяцев назад

You're right, chunking is only relevant for longer input. Though, I'm not sure if it would be correct to say that it's ONLY relevant when the context length of your LLM is exceeded or if a RAG makes sense nonetheless.

@VinayakShanawad 6 месяцев назад

@@ilianosYeah, we will be retrieving the top k closest vectors, ranking them if necessary then fed to LLM so that we will not exceed the context window of our LLM.

@tomhiggins451 4 месяца назад

Excellent Video! Thanks for sharing!

@limjuroy7078 6 месяцев назад

Thank you very much! This video is helpful for those interested in building RAG apps. I'm not sure if it can also be helpful with chunking legal or contract documents in PDF format.

@MervinPraison 6 месяцев назад

Yes , you need to extend this document chunking code to include PDF docs .

@raajahaihum6177 2 месяца назад

How to increase this code for PDF?

@FANATANGO 6 месяцев назад

Thanks a lot for your work, it is very interesting and you are very clear in your explanation

@MervinPraison 6 месяцев назад

Thank you

@robinmordasiewicz 6 месяцев назад

Is it possible to use nomic-embed-text with autogen ?

@MervinPraison 6 месяцев назад

Yes you can

@PatriceDeCafmeyer 6 месяцев назад

That's really excellent! I watched Greg's video last week and have been thinking since about implementing it. Thank you so much!

@MervinPraison 6 месяцев назад

Thank you

@paoloavogadro7329 6 месяцев назад

Excellent, bravo

@MervinPraison 6 месяцев назад

Thank you

@tal7atal7a66 6 месяцев назад

❤ , excellent tutorial/infos . thanks bro

@MervinPraison 6 месяцев назад

Thank you

@truthwillout1980 6 месяцев назад

This is an absolutely brilliant video Mervin. Exactly what I needed and implemented and explained brilliantly.

@MervinPraison 6 месяцев назад

Thank you

@proterotype 4 месяца назад

This is amazing

@rafsankabir9152 4 месяца назад

Amazing one

@AIWalaBro-Bharatbhushan 6 месяцев назад

love with this video

@user-motivational13 6 месяцев назад

Thanks for the video, trying this out

@MervinPraison 6 месяцев назад

👍

@memhans 6 месяцев назад

Well done, it expleained briefly. Thanks

@MervinPraison 6 месяцев назад

Thank you

@alexsov 6 месяцев назад

Little slower might be better?

@MervinPraison 6 месяцев назад

You are correct. Thanks for the feedback First I recorded the whole tutorial and at the end realised I forgot to press the record button 🤦‍♂️ Then out of frustration I recorded the second time

@nicklesseos 5 месяцев назад

😮Thank you so much!

@TLabsLLC-AI-Development 6 месяцев назад

This is very nice of you to make. Thank you.

@MervinPraison 6 месяцев назад

Thanks

@satyajamalla1429 6 месяцев назад

Too fast to understand 😢

@MervinPraison 6 месяцев назад

Sorry for that, yes I did speed up little this time. Will try to be normal next time

@ForTheEraOfLove 6 месяцев назад

You did what I recommended! I appreciate you 🎉

@MervinPraison 6 месяцев назад

Thank you

@ainc-fp2cp 6 месяцев назад

you are excellent! thank u

@MervinPraison 6 месяцев назад

Thank you

@MrSuntask 6 месяцев назад

Great content ❤

@MervinPraison 6 месяцев назад

Thank you

@UncleDao 6 месяцев назад

That is great! Mervin.

@MervinPraison 6 месяцев назад

Thank you

@PandoraBox1943 6 месяцев назад

I like it 👍

@MervinPraison 6 месяцев назад

Thank you

@hassentangier3891 6 месяцев назад

I applied the semantic approach for topic modelling approach and it's produce a better topicallity a better titling propositions when we use hf LLM with. I will try the last method for the db creation Thanks Mille

@MervinPraison 6 месяцев назад

Great!

@jim02377 6 месяцев назад

Great video!

@MervinPraison 6 месяцев назад

Thank you

@darwingli1772 6 месяцев назад

Thanks. What do you think about chunking manually? Human brain understands the context more accurately ..

@MervinPraison 6 месяцев назад

Haha Good idea . But what if you have 1 million pages to chunk? And you have 1 day to do it

@darwingli1772 6 месяцев назад

Coz the problem I’m facing is that the data is very scattered… like there is a webpage(1000url) that contains everything. But looking simply at one url and do the chunk won’t yield valuable information for query. So imagine I will have to access some parts of 5-6 url to create an useful document for retrieval. I think it’s because I don’t have a structured and organised documents so I resort to manual chunking. I know it’s very not efficient so I’m looking for ways to solve this problem.