LangChain: How to Properly Split your Chunks

Подписаться 173 тыс.

Просмотров 29 тыс.

50% 1

In this video, we are taking a deep dive into Recursive Character Text Splitter class in Langchain. How you split your chunks/data determines the quality of the answers you get when you are trying to chat with your documents using LLMs. Learn how to properly use text splitter in Langchain.
#llm #langchain #PDFchat
▬▬▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬
☕ Buy me a Coffee: ko-fi.com/prom...
|🔴 Support my work on Patreon: Patreon.com/PromptEngineering
🦾 Discord: / discord
▶️️ Subscribe: www.youtube.co...
📧 Business Contact: engineerprompt@gmail.com
💼Consulting: calendly.com/e...
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
LINKS: python.langcha...
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
All Interesting Videos:
Everything LangChain: • LangChain
Everything LLM: • Large Language Models
Everything Midjourney: • MidJourney Tutorials
AI Image Generation: • AI Image Generation Tu...

Опубликовано:

3 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 77

@CacoNonino Год назад

please make more videos like this one! Many people got into AI without coding background, we are missing more detailed videos on these topics!

@AJJU_OZA Год назад

Answer me... Promote Engineering's videos are for Developer (appreciation) only ??

@CacoNonino Год назад

@@AJJU_OZA well if it was i would not be here for so long hahahahahahaha What I meant that for those who don't have coding knowledge and want to do more then replicating github repos, this Hands-on type of video is phenomenal! In my case I am working in text based rpg game and the basic concept of this video was one I had yet to grasp! answered!

@CacoNonino Год назад

@@AJJU_OZA i mean if the channel also had a LLM Python focused course I would be one paying for it. I bet that there are ton of people changing carreers that also have need of more basic concepts in depth explanation videos like this one!

@ml-techn Год назад

@@CacoNonino what do you mean by LLM python focused course?

@CacoNonino Год назад

@@ml-techn I mean, I cursed economy but changed carreers midway to dataengeenering! Now i'am more and more building things on top of LLMs All my coding knowlegde came from using chatGPT in the last year, and I think it is the same for a lot of people, hence why the tutorial videos are so popular. Am i making sense? I mean, there a thousand videos outthere that mention splitting texting into chunks but not a lot explaining how that especifically is done how he made it here!

@parisneto 6 месяцев назад

Just found your channel and while I initially wanted to have you as a professor in a classroom ( maybe back in college 30yrs ago), I really think you are helping to create a better world for many with your content, careful explanation and examples and this is the true reason and mission for a teacher, congrats!

@deepaksingh9318 7 месяцев назад

and I think nobody can explain concepts in easier way than you do.. tried 10 different videos for checking how Recursivesplitter would go if para is chunk size.. and you explained it.. :) love it how you cover each and every aspects from learning point of view.. Thanks again. .

@engineerprompt 6 месяцев назад

Glad it was helpful. Make sure to watch the next one :)

@WinstonWalker-fc7ty Год назад

I’d love to see videos on both embedding size and modifying the text splitter! I’m particularly interested in strategies that would enable inclusion of citations, e.g. a medical article that includes numbered citations at the end of each sentence with the reference list at the end of the document.

@ShaneHolloman Год назад

Excellent to have someone break these concepts down so clearly. Keep going, this is great!

@RealEstate3D Год назад

First time I see content on the optimal chunk lengths. In addition it might be interesting on how to integrate metadata as for example on which page of a book, which url or which paragraph in a law text a text comes from or is within a text. These metadata also will take space in the retrieval context. Good work. Definitely go this road.

@asithakoralage628 Год назад

Thank you, you explain very clearly and I have been watching your content. They really good and honest. Please keep these types of videos.. thanks a lot.

@SachinChavan13 Год назад

Please keep making more such videos. I found this video very helpful..

@engineerprompt Год назад

More to come 😎

@yazanrisheh5127 Год назад

Finally understood this. I remember asking on discord and I think you also replied but the fact an entire video was made on this made it muc much much clearer. Thank you so much! Could you make a video about vectorstores and which one to use, how to know what to use, and the code behind it because I saw a couple like FAISS, chromaDB, deeplake etc... and for my chatbot, it's pretty much the last thing I have left to do but I still don't understand pretty much most of vectorstores for now.

@wassimsaioudi116 8 месяцев назад

Incredible ! Hope you'll provide more videos like this one !

@mdfarhananis8950 Год назад

Really useful Please continue making these

@adnanrizve5551 Год назад

Great Work! Very simple but really elaborative. Please create more videos in this for this series

@TheCloudShepherd 10 месяцев назад

Damn you explained that better in 3 mins that most other videos did in 30 mins

@engineerprompt 10 месяцев назад

glad it was helpful.

@e_hana_kakou Год назад

Appreciate all your content. I'd love to know more about chunking customization. Thanks! 🤙

@darshan7673 10 месяцев назад

Great Video, Thanks for creating the video!

@izainonline Год назад

Great Video to understand chunks and textsplitter

@engineerprompt Год назад

Thank you 🙏

@貴-b3w 9 месяцев назад

Great Video, Thanks for creating the video!😀

@hvbris_ Год назад

Good video - for the dataset I am working with I found that spliting by tokens produces better results but really depends on the data you're working with tbh!

@gangs0846 8 месяцев назад

Thank you!

@SmashPhysical Год назад

Great explanation, thanks, this will be super useful!

@kenchang3456 8 месяцев назад

Excellent explanation, thank you. Just curious, why this video is the only video in your Demystifying LangChain playlist?

@engineerprompt 8 месяцев назад

Thank you. Just way too many things to cover but now getting back to RAG. Will be making alot more content on it.

@AA_135 Год назад

Great explanation !

@unshadowlabs Год назад

I have seen a lot of videos on how to use these chunks with a vector database and have the LLM using RAG as a knowledge base. There seems to be very few videos on how to use the chunked data to fine-tune a LLM like llama 2 on this chunked data. I would love to see a video that covers the topic of using raw, or chunked data to fine tune a LLM without having to convert it into something like question and answer or instruct formatting .

@duncanprins9944 Год назад

Great! Much appreciated 😊

@weber1209rafael Год назад

Please create more content with in-depth information about how to this information in a smart way. Im currently building a domain specific knowledge base to create a "AI expert" in a certain topic and I am trying to find the right way to store alle the knowledge.

@hl236 7 месяцев назад

More videos on chunking and embedding please.

@guanjwcn Год назад

please continue with these. they are useful.

@surajthakkar3420 8 месяцев назад

Hello mate, Any chance you can make a video on Context aware chunking which can improve the quality of chunks/output drastically!

@RichardGetzPhotography Год назад

Yes please do a video on Embedding settings. I am currently using these. Parameters ---------- VECTOR_SIZE: int The size of the vector for the text embeddings (e.g., 300). WINDOW_SIZE: int The context window size for text embeddings, capturing larger contextual information (e.g., 20). MIN_COUNT: int The minimum frequency count for words to be considered in the text embeddings (e.g., 1). EPOCHS: int The number of training iterations for the Doc2Vec model (e.g., 500).

@JourneyMindMap 8 месяцев назад

thanks dude

@arkodeepchatterjee Год назад

really useful please continue making videos like this

@goncaavci1579 Год назад

please make a video about embedding size. you are awesome thank you for videos

@Ken129100 Год назад

Thanks for the video! What if you want to chunk a large PDF of 300 pages? How do you determine the chunk size? I mean, in your example you can observe the length of each paragraph by observation but might be hard to do it for large file. I would appreciate it if you share your opinion.

@nirsarkar Год назад

Please do create one for custom splitting. I have a particular document where I would like to define a chunk demarcated by special sequence.

@Zivafgin Год назад

Great content! Keep up please :)

@gerardorosiles8918 Год назад

Very nice video, I think anyone working on semantic search goes through the experience you described here. Have you seen a study that checks the performance of different embeddings with respect to the chunk size? Also, what are the different available models for embeddings? I have been using the faiss models, I have heard you mention another one. What would be a good strategy to pick one vs. another?

@jstormclouds Год назад

i feel i get the gist but interested in more on topic

@VerdonTrigance 7 месяцев назад

How to define my own list of separators? Can I set mupltiple separators for paragraphs and multiple for sentences at the same time?

@texasfossilguy Год назад

What about a dynamic chunk size as a potential future feature? How does this work for a large series of documents like textbooks and other pdfs like science articles, or legal documents? What is a "best guess" for the parameters?

@shivanshugautam1381 Год назад

Hi I am also having same problem. Do you have any idea how we can divide our document chunk efficiently.

@MattGoldenberg Год назад

Hmmm, curious why you're splitting by character count and not by token count? Our recursive splitter always bottoms out in token count based on the model we're using, as the model can't see character level data, and the token count is the limiting factor we actually care about when inferencing.

@PerFeldvoss Год назад

What if you can preproces the texts and reorganise sentences by "key subjects relationship" .... That is as a supplement to the original text, you can perhaps make chunks of texts that summarise different key subjects. The AI would produce a (creative) list of these subjects, and then use that list when running through the text again... (and you can then "make langchain know" what sentences actually belong together!)

@walidmaly3 Год назад

Please continue making videos like this. Any chance u can share the code as well?

@AJJU_OZA Год назад

Sir Promote Engineering's videos are for Developer (appreciation)...???

@mikelugarte Год назад

I have a CSV file with product descriptions and Ids. I need to query the descriptions with the user input in order to get the product Id. I am using CharacterTextSplitter split the full file into chunks with 1 line for each chunck. After that I want to do a similarity_search to get the proper lines of the CSV that contain the descriptions that are similar to the user input. Im using the " " separator to split the text by lines but, for whatever reason it doesn´t work some times. I´d love to see an example of CharacterTextSplitter with this kind of situation or how to use RecursiveCharacterTextSplitter to do the same

@TousifAhamedNadaf Год назад

Same issue I am also facing. I have managed to write a generic code for chunking, however I am able to get results only for small data sets not for large data sets. Did you manage to solve it ?

@rutvikghori2410 6 месяцев назад

Thank you! How I can resolve issues of splitting, suppose I have multiple files and I want to generate a summary individually

@engineerprompt 5 месяцев назад

In that case, look into summarization specific chains. Reduce map will be a good start.

@rutvikghori2410 5 месяцев назад

@@engineerprompt Suppose these are code files and I want to generate summary for all separately. What should I do?

@r0f115L4m Год назад

Thank you for your video. What program are you using to create your diagrams?

@amol5146 8 месяцев назад

Can you please explain how the chunk_overlap parameter works?

@engineerprompt 8 месяцев назад

Let's say you define the chunk size to be 1000 char with overlap of 200. In this case, the first chunk will be 1 - 1000 and the second chunk will start from 801-1800 because there is an overlap of 200. Hope this helps.

@amol5146 8 месяцев назад

@@engineerprompt Thank you! Does chunk_overlap also follow the default list?

@waelmashal7594 Год назад

If we check our docs and check the length of each paragraph and set the chunk size = the max length can help ? or maybe take the average length from all paraghraphs ? depend on the splitter what u think ?

@engineerprompt 11 месяцев назад

This might be dates but yes, that can be one approach. Another is to use regular expressions of there is a pattern in the data. There are now more advanced retrieval methods that can compress data in the documents to make them more relevant to the query. A lot is happening in this space

@subhashinavolu1704 Год назад

What if the pdf has tables too? I see the pdf loader in langchain is not reading the table. How to solve that? In case it is solved how does the recursive text splitter work with such tabular data

@vertigoz 4 месяца назад

The link no longer works

@computerauditor Год назад

🔥🔥🔥

@fra8156 Год назад

what about making a video using a very small LLM, that every pc can handle, using it on a very specific task, fine tuning it, and showing every steps from zero to hero and that we can work offline. In this way everyone can hands on this "lab" and learn by doing...

@MuhammadDanyalKhan 8 месяцев назад

had a question on this video i.e. how to split chunks: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-n0uPzvGTFI0.html .... How I can find best chunk size for financial statements?