I am a fan of preparing data beforehand and semantic chunking - this random 1k token chunking methods does not work for me. In medical field you need at least 95% accuracy:)
@@jawadmansoor6064 If you mean 'expensive' as a computing power - you're right. About expensive in money - maybe it can be implemented using one of the great free available LLMs? What do you say, dear @MervinParison ?
This is so mind-blowing!! Is there any option for coder beginners? I'd be happy to pay a one-time subscription to get a software that helps me organize my data this way. Especially if the rag system is able to get the references from the retrieved info. That'd be awesome for a deeper understanding of multiple (not hundreds, but thousands) papers in academic reaearch. Thanks for the video!
Nice tutorials Great stuff. How do we store the json data as chucking is actually not doing great job with key values instead thinking random brackets and so on? Do you have a solution ?
Outstanding! An absolute masterclass. Thank you so much Mervin. Personally, I think that this is one of your best videos to-date. I watched enthralled!
Thanks for an amazing, educational and insightful video. One comment please: It seems you're skipping the "Chunk #1" along the process, which has the "This capability" in its beginning, so it's not categorized correctly to the first chunk. This seems to be caused by an error of the LLM, not un-contextualizing that sentence correctly as it's instructed by the prompt. Am I right? Any idea why it happened? Thanks again for a great video!
So once youve used this tactic to create a vector database with all the stuff inside it and save it to a chroma db, how would you use an llm to query the database afterwards ? It seems the chunking is part of the retrival process aswell based on how you presented this, running this script everytime doesnt seem that great
Best 15 mins of my day. Excited to try the agentic chunker. Would love to hear your thoughts on what if the source contents contain a mix of texts and lot of tables.
@@MervinPraisonI installed langchain-experimental and it is listed when I checked it with pip list. However, when I tried to import it, it could not be resolved. I tried to rerun the terminal, conda, reopen the folder, but nothing seems to work.
awesome as always! i have a question, can we use any other llm for agentic chunking, can you suggest any open source free alternative to gpt3.5? I have the same setup with local ollama mistral, can i again use that for agentic chunkling?
Thank you so much for the video!! When using models like "all-mpnet-base-v2" from the Sentence Transformers library to generate embeddings and store them in Vector DB, which are based on architectures like BERT, the maximum token limit per input sequence is typically 512 tokens. As long as my input text does not exceed the maximum input sequence which is 512 tokens then we don't need to deal with different chunking strategies covered in this video. I meant chunking strategies are meant for long input text (1000s of words), is that correct understanding?
You're right, chunking is only relevant for longer input. Though, I'm not sure if it would be correct to say that it's ONLY relevant when the context length of your LLM is exceeded or if a RAG makes sense nonetheless.
@@ilianosYeah, we will be retrieving the top k closest vectors, ranking them if necessary then fed to LLM so that we will not exceed the context window of our LLM.
Thank you very much! This video is helpful for those interested in building RAG apps. I'm not sure if it can also be helpful with chunking legal or contract documents in PDF format.
You are correct. Thanks for the feedback First I recorded the whole tutorial and at the end realised I forgot to press the record button 🤦♂️ Then out of frustration I recorded the second time
I applied the semantic approach for topic modelling approach and it's produce a better topicallity a better titling propositions when we use hf LLM with. I will try the last method for the db creation Thanks Mille
Coz the problem I’m facing is that the data is very scattered… like there is a webpage(1000url) that contains everything. But looking simply at one url and do the chunk won’t yield valuable information for query. So imagine I will have to access some parts of 5-6 url to create an useful document for retrieval. I think it’s because I don’t have a structured and organised documents so I resort to manual chunking. I know it’s very not efficient so I’m looking for ways to solve this problem.