Semantic Chunking - 3 Methods for Better RAG

James Briggs

Подписаться 68 тыс.

Просмотров 12 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

29 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 26

@wassfila 4 месяца назад

this is really promising, thank you. It's really hard to get an overview on cost/benefit for end results from a RAG end user perspective. Like a comparison table.

@looppp 3 месяца назад

great video

@kevinozero 26 дней назад

Very strange this keeps breaking sentences up mid-way through even though the sentence is conveying one message, like a clause in a contract, not impressed

@AGI-Bingo 3 месяца назад

Hi James, could you please cover how to do "citing" with rag? With option to open the original source. That would be cool ❤ Also if love to see an example for LiveRag, that watches certain files or folders for changes, and rechunks, embeddes, removes outdated and saves diffs. What do you think about these? Thanks a lot!

@tarapogancev 3 месяца назад

If you are using Pinecone or similar vector database, along with the vector entry you can usually also add specific metadata. I mostly keep the original text stored within that vector as a 'content' metadata field, and then add other fields for the file's name, topic etc. :) This way, you can cross-reference your data for the users to navigate easily.

@AGI-Bingo 3 месяца назад

got it so you could also add "filepath" and trigger opening the file, wonder if there's a way to jump and highlight a specific part of text after opening (i.e pdf) Also,@@tarapogancev do you know of a way to run diffs on files and delete/reupload all relevant chunks. Watching files and folders for changes, then triggering re-RAG embeddings, to keep everything automatically up-to-date. Thanks 🙏 👍

@tarapogancev 3 месяца назад

@@AGI-Bingo The idea of highlighting relevant text sounds great! I am yet to face the UI portion of this problem, trying to achieve similar results. :) I haven't worked with automatic syncs, but they would be very useful! So far, from what I've seen AWS Knowledge Bases and Azure's AI Search (if I remember correctly) both offer options to sync data manually when needed. It's not as convenient but I'm thinking it's not a bad solution either, considering it is possibly less work on the server-side, and maybe less credits for OpenAI or other LLM services. Sorry I couln't offer help on this topic, but I hope you come uo with a great solution! :D

@prasunkumar2106 Месяц назад

How can I use llama3.1 to achieve this?

@Piero-xi1yi 3 месяца назад

Could you please explain the logic and concept of your code? How does this compare with semantic_chunker from langchain / llama index (it use something like your comulative, using a sliding window of n sentences, and with an "adaptive" threshold based on percentile)

@BB-ou5ui 3 месяца назад

Hi! That's exactly what I was looking for and explaining with some personal implementation, and trying to implement something with different strategies from dense vectors... Have you considered using multivec models like ColBERT? To some extent, you could work with matrix similarities on bigger contexts... I'm also testing some weighted strategies using splade, but that's too early to make claims 😊

@ΛΑΦ 3 месяца назад

Can we use Ollama for the embedding?

@hughesadam87 2 месяца назад

I've been using a tool unstructured to split my documents into known sections (ie. title, abstract, pararaphs) - it can do the splitting. Do you think having these sentences apriori is helpful to chunking or it's better to just feed plaintex to the chunking strat and let it do all the grouping/separations etc...

@CBCELIMUPORTALORG 4 месяца назад

🎯 Key points for quick navigation: 📘 The video introduces three semantic chunking methods for text data, improving retrieval-augmented generation (RAG) applications. 💻 Demonstrates use of the "semantic chunkers library," showcasing practical examples via a Colab notebook, requiring OpenAI's API key. 📊 Focuses on a dataset of AI archive papers, applying semantic chunking to manage the data's complexity and improve processing efficiency. 🤖 Discusses the need for an embedding model to facilitate semantic chunking, highlighting OpenAI's Embedding Model as a primary tool. 📈 Outlines the "statistical chunking method" as a recommended approach for its efficiency, cost-effectiveness, and automatic parameter adjustments. 🔍 Explains "consecutive chunking" as being cost-effective and relatively fast, but requiring more manual input for tuning parameters. 📝 Presents "cumulative chunking" as a method that builds embeddings progressively, offering noise resistance but at a higher computational cost. 🌐 Notes the adaptability of chunking methods to different data modalities, with specific mention of their suitability for text and potential for video. Made with HARPA AI

@ariugarte 3 месяца назад

Hello, it's a fantastic tool! but I encountered some problems with tables in PDFs and with strings that use characters such as '-' to separate phrases or sections.I end up with chunks that are much bigger than the maximum size.

@lavamonkeymc 3 месяца назад

Where’s the advanced lamb graph video?

@talesfromthetrailz 3 месяца назад

How would you compare the Statistical chunker with the rolling window splitter you used for semantic chunking? Do you prefer one over the other? I'm designing a recommendation system that uses user queries to match to certain outputs they may want. Thanks!

@jamesbriggs 3 месяца назад

StatisticalChunker is actually just a more recent version of the rolling window splitter, it includes handling for larger documents and some other optimizations so I'd recommend the statistical

@maxlgemeinderat9202 4 месяца назад

Nice video! So e.g if i am reading in docs with unstructured io, i can then use the semantic chunker instead of a RecursiveCharacterSplitter?

@jamesbriggs 4 месяца назад

yes you can, there's an (old, I should update) example here github.com/aurelio-labs/semantic-router/blob/main/docs/examples/unstructured-element-splitter.ipynb ^ the "splitter" here is equivalent to the StatisticalChunker in semantic-chunkers

@KenRossPhotography 4 месяца назад

Super interesting - thanks for that! I'll definitely be experimenting with those chunking variations.

@jamesbriggs 4 месяца назад

Awesome, would love to hear how it goes

@samcavalera9489 3 месяца назад

Hi James, First off, I want to express my immense gratitude for your insightful videos on RAG and other AI topics. Your content is so enriching that I find myself watching each video at least twice! I do have a couple of questions that I hope you can shed some light on: 1) When using OpenAI’s small embedding model with the recursivecharctertextsplitter, is there a general guideline for determining the optimal chunk size and overlapping size? I’m looking for a rule of thumb that could help me set the right values for these parameters. 2) My work primarily involves using RAG on scientific papers, which often include figures that sometimes convey more information than the text itself. Is there a technique to incorporate these figures into the vector database along with the paper’s text? Essentially, for multi-modal vector embedding that includes both text and images, what’s the best approach to achieve this? I greatly appreciate your insight 🙏🙏🙏

@jamesbriggs 3 месяца назад

Hey thanks for the message! For (1) my rule of thumb is 200-300 tokens with a 20-40 token overlap, for (2) you can use the multimodal models (like gpt-4o) to describe what is in the image, then embed that - alternatively you could use an text-image embedding model but they don’t capture as much detail as what you could get from a multimodal LLM. Hope that helps :)

@samcavalera9489 3 месяца назад

@@jamesbriggs many thanks James 🙏🙏🙏

@jamesbriggs 4 месяца назад

📌 Code: github.com/aurelio-labs/semantic-chunkers/blob/main/docs/00-chunkers-intro.ipynb ⭐ Article: www.aurelio.ai/learn/semantic-chunkers-intro