PDFs present a unique challenge for chunking in retrieval-augmented generation (RAG) systems. Unlike other text formats that focus solely on the text content, PDFs also encode the exact position of text and visual elements on the page. This means that what appears to be a single, continuous piece of text to a human reader can actually be made up of multiple separate text fragments, each positioned at specific coordinates to look continuous. This is often done to fine-tune the appearance of the text, like adjusting the spacing between letters (kerning) and placing graphical elements with precision. While these positioning details enhance visual presentation for humans, they complicate text processing for machines. In many PDFs, especially those that are not properly formatted, text is not stored as a continuous stream. Instead, it's broken into blocks or lists of elements scattered across the page. This fragmentation can make it difficult to extract and chunk text in a coherent manner for machine processing, resulting in documents that are either well-organized or entirely jumbled, depending on their original layout and formatting.
I was waiting for this one, thanks!! Your videos are amazing and are really inspiring me to start contributing to the community with whatever I'm able to. Thanks a lot for the time and effort you put into these
Video ideals: My ideal is to make tiny LLMS to do specific task, for example solving a strategy for converting a PDF into structured data and then passing it to a fine tuning model for processing. I do this currently manually, look at the structure inside the PDFs to determine how it is constructed and then executing a script to convert the PDFs in bulk into structured data.
Matt, one quick comment is that its important to note max chunk size supported by each embedder. the problem with Sentence length variation is the chunk size limit can be violated e.g. 256, 512, 8196... If that happens the array produced by the embedder still works but text is dropped and of course the text that exceeded the limit will not be included in the array that represents the text. You search function e.g. cosineN will not work as you expect when producing matching chunks that best match. Dropped text can happen at many points including what the model can support... at no point will you get any warning this is happening... so engineering to the limits is super super important to get production ready results.
Thank you for another interesting video. I've been able to get a much better understanding of RAG in general, and the role of embeddings specifically. One thing that I'm curious to know more about is how embeddings work with other modalities. Ollama has support for 3 different embedding models for text, but I haven't been able to find much information on embedding models that support other modalities like audio or video. Specifically I would love to be able to use Ollama with RAG with image embeddings. As a sample use case of this, consider embedding your entire collection of Google Photos, and then to have a conversation with a model about your photos. For example "What did I see when I visited the Eiffel tower". Anyway - thanks for sharing your videos, much appreciated!
Wow. Excellent, clearly explained. Thank you. I'm trying to rewire my web dev brain into RAG and fine tuning. Can you explain how to fine-tune? I'd like to make an AI customer service chatbot for my wife's gym website. Thanks again.
Hello Matt, thanks a lot for your videos. I'm not using bun normally but I've heart a lot of good things about it. If I would try to use it on Windows 11, how can I try different models? Do I have them to download by myself or does your bun-script handle this? Or Do I have to use ollama as I did in one of your previous videos? Thanks for the hint to use small documents. I've made my first RAG steps by downloading our whole website (university) with gpt-crawler which results in a JSON file with the html content of 500 pages. Even for me as a human very bad content. import.ts worked fine. Is it normal that the answer takes minutes on a 16 core 4750 Ryzen Laptop without GPU?
That is a worthless solution for a few reasons. First it’s windows only. I don’t mind too much that it is a paid product but it’s a command line tool so not worth even looking g at for a rag solution.
I tried the exact same embed-model, main-model and source URL to run embedding on and gave the same query to search.ts but got weird results (when asked earthquake in Taiwan, model returned with info on apple products). Is anyone facing similar issue?