I am blown away at how much information is densely packed into this. You got yourself a new subscriber, sir. It’s staggering to think about how these technologies will shape the landscape for data and analytics. This is just the beginning.
Damn, I spent like a week researching this shit on my own, and have been working on almost exactly the same thing. Processing MDX files into embeddings etc. It’s really cool to see somebody doing almost the same exact thing. Makes me think I am really on the right track!
@@automioai In this project no PDF files were used - all documentation had been written directly in MDX. You'll have to do some research on ways to extract text from PDF files. Once you have that, I wouldn't bother with MDX at all - just generate embeddings directly on that content.
Prediction: This is gonna get a million views. Just saw fireship video about vector databases and wanted to understand embeddings. Before I could even search, this video was in the page. Though I wasn’t interested in a 40 min video (had a feeling I’ll just stop after 5 mins like I usually do) I ended up watching it all. The rabbit hole 🐇 🕳️ format is so naturally elegant. Clear end to end use case. I secretly don’t want to share it with anyone but I am forced to fulfill my prediction.
The clarity of this video while maintaining detailed granularity of the subject is very impressive and very appreciated. Thank you for making this video.
Found your channel while learning React Three Fiber, subbed with notifications immediately. Today I get a notification for a well-explained ChatGPT tutorial, right as I embark on building a similar thing. Fantastic continued work, thank you very much!
Ive been looking for this information for months. Such an excellent tutorial and I love that Supabase's code is all open source so i can actually clone it and read how it works in detail later. Thank you so much for the walk through. Super talented dude too - love the blender stuff.
@@RabbitHoleSyndrome why is the generate embeddings file so different in the video then what is in the repo now? I can't find anything you talk about in minutes 10-13.
Hey @@fraternitas5117! Supabase moves pretty quick - the code I references has been refactored now to support multiple knowledge sources (ie. more than just markdown). You can find the markdown specific code here: github.com/supabase/supabase/blob/1b2361c099c2573afa1fe59d3187343bb8f1bcab/apps/docs/scripts/search/sources/markdown.ts
What a fantastic video and content. I’ve gone through multiple videos trying to better understand embedding and how to work with ChatGPT in the best way for querying large amount of content and producing an analyzed response. I’m not a developer, have a background in computer science but I’m a software sales person that is curious about technology and I was able to completely understand your video and content. Subscribed, liked and will be watching more of your videos. Thank you!
Wow. I'm so thrilled to know that you were one of the ones behind that great feature. I've been using Supabase for 6 months, and have been pretty happy with it. Except for the docs and the transition to 2.0. I was blown away when I saw that it generated the code for me when I started writing its documentation
@@RabbitHoleSyndrome So far so good! I think one challenge is to know how we can check if a user's email address exists. (Or other specific user's metadata) I couldn't find it in the docs. There was a GitHub issue which said to store the user's data in a separate table as the auth table was private. I ended up doing that and haven't had any problems. Btw, thanks again for all the awesomeness!
The most amazing think is that I basically made a chatbot app in less than a week with only the help of GPT4, I had no knowledge of AWS services, PostgreSQL or python. Everthing you told in the video is what GPT4 told me. All of the serves and database are setup, it has memory, STT, TTS and Cognito login/register.
Fireship guy? Either way, its been pretty useful in terms of learning API's and how to connect them to my nocode builder. Spent hours trying to get things working and the Assistant basically told me what i was doing wrong and how to fix it. So well done with the implementation.
Dude, that's exactly what I was looking for. No more bullshit articles with clickbait titles, just DIY in the essence. Is there a way to support you through patreon or smth. ?
Insane, I love how you are able to do so many things. My laptop atm is unable to power every wish of mine (getting into 3D) but I hope I will soon be able to do so.
This is a glorious illustration. Thank you very much! I've been trying to find an example of doing this, and yours has put it all together for me! Subscribed!
28:37 Whenever I see examples of decoder (GPT) prompts starting with "You are a helpful finance advisor" or "You are an enthusiastic support rep", I can almost see the AI clearing its throat and sitting up straight and saying "right, ok". Gimme that can-do attitude, GPT!
One thing that might help is if the question result shows the links to the documents that it acquired the information from. Since you are currently fetching which document to run chatgpt on based on similarity of features, maybe you can change the prompt so that it also returns the link of the document that was deemed as a "similar document"
First time here. This is so well done. Subscribed. Your viewer number will explode! I like how you approached the topic in a very calm way without jumping on the "LLMs will take over the world" train :) You don't happen to have the clippy blender asset somewhere?
Amazing video! This is exactly what I was looking for a long time. You basically explains everything I wanted to know about how to create a search engine using open ai. But I have a few questions: How much did you spend on open ai embending API building this? How much supabase spends monthly with searchs using the open ai api? It is possible to use an open source embedding API instead of calling the open ai api ? Wouldn't it be less expensive than the approach you took?
Glad it was useful! As for costs, you may be surprised how inexpensive OpenAI embeddings are (at least I was). To put it in perspective, for the Supabase guides we currently have around 1500 page sections which total just over 220000 tokens. At OpenAI's current embedding price ($0.0004/1k tokens), that brought us to just less than $0.10 for the entire guide knowledge base (~one-time pre-processing). After that the average query is likely
This is a fantastic video! Thank you very much for sharing :D Quick question - Currently if the info is not in the documentation it responds "Sorry I don't know how to help with that". But how can we make it respond like this: "Sorry I don't have relevant info in the documentation but you can do something like this". For e.g. "I don't have any info about how to make banana pancakes in the documentation, but here is how you can make one...." Idea here is to make it act like chatgpt on top of the information provided. Keen to know more on this and thank you so much for making this video :D
I have not seen such a great video in a while! How wonderful have you explained the whole process 👍👍! Could you explain a bit more about how did you choose 0.78 as threshold for embeddings comparision? have you statisticized that wether the most relevant sections can be found with it?
Glad you liked it! 0.78 was a first-stab threshold that worked best based on a limited sample of test queries. I wouldn’t claim that this number is universal - almost certainly this could change by domain.
"pretty much every single open source project I've seen that has documentation uses either markdown or MDX". I'm pretty sure the embedding process and AI stuff is interesting... The real question is who would do an entire markdown for a specific open source project... For free!! I mean I understand the willingness to give code for free since you not only get exposure, but you get to show off your coding abilities... But documentation... Let's review the lifecycle of a project just to see how difficult is to materialize a piece of **Documentation.** First comes the goal of a project, what problem it aims to solve. Maybe the goal is to compete financially. Then the scouting process, this may be combined with actual development, even if you think you know everything you'll get lost... Finally the testing phase, the project has been "finished" and it is being tested on every possible configuration and all different devices. Finally comes the documentation. Now the best/worst part is that the Documentation is a WHOLE program on it's own, with it's own set of keywords and ruleset. You know what's my guess?? Only the projects actually making money... Or the ones with a wealthy owner are the ones getting documentation. In fact, the fact that this documentation is a 1:1 match with it's super nice web page, tells me that the web page came BEFORE the MDX doc... Which means the entire project was envisioned as a capital backed enterprise with investors and what not. EVEN IF the documentation process is something which is automated, the question is what type of input this automation accepts... No one will code in function of a good automated documentation... So I'm sure you either a) need to clean the automated autogenerated documentation b) a low payed intern does it. c) there is no such thing as being automated, you just need a line of coke and full goblin mode. d) there is actual people doing documentation for free... You want to know the worst thing? I've never found ANY documentation worth my time reading... None. Most documentation hide important information and make assumptions about what users will or will not do with the code. I've seen documentation from million dollars companies being COMPLETELY wrong about what the actual code does. Systems cannot be explained with words since they are 2 dimensional, you cannot reference, cross reference or encapsulate ideas, to abstract them and use them as functions... you actually need to see the source code to REALLY understand. What I see happening, or what I REALLY really hope happens is that the cost of doing menial jobs such as writing documentation will balloon because the market will notice it's importance in the process of AI embedding. Writing the documentation IS the real hard work.
Great content. I wonder if it can generate code as well. Lets say you are doing this for cloudflare products, kv, workers, durable objects etc all documents injected. Then give prompt like "generate a worker for x" and it will specifically generate with given docs etc..
Loved it, came for the embeddings and left with that and whole lot more! Too bad that the clippy did not made to the web on supabase :). Copyright problems?
Thanks for watching! Clippy did make it - but things move fast at Supabase 😅. This feature has evolved into a unified cmd+k menu and just renamed as “Supabase AI”.
What kind of vectors did you generate from chatGPT. Are they word vectors? You passed one whole section of the mdx so they are not word vectors but paragraph vectors?
Hey! Context injection does a couple things: 1. Primes the prompt with specific information we want GPT to reply with 2. We can always use up-to-date information - anytime we need to add/remove/update information in our knowledge base, it's as simple as a DB update. No need to re-fine-tune the model all over again
amazing! how does clippy update the vector db and process the text as embedding when NEW documentation is added? is it automatic? maybe i missed it in the video
Great question! The `generate-embeddings` script was designed to be diff-based. So next time you run it, it will pull in only the documents that have changed and re-create embeddings on just those. It currently works using checksums: 1. Generate a checksum for the content and store in the DB 2. Next time the script runs, compare the checksums. If they don't match, the content has changed and embeddings should be re-generated. The script runs on CI, so anytime documents change a GitHub Action will trigger the script. See this PR for details: github.com/supabase/supabase/pull/13936
@@RabbitHoleSyndrome Thanks you for making this content for us for free! I very much appreciate it and you motivates me to share knowledge with my friends :)
Thanks! PDF to MDX will be a tough task. But if your end goal is embeddings, you could consider pulling the content out of the PDF and generating embeddings on it directly without getting MDX involved.
This is amazing. Super clear and short explanation of embeddings. Great walkthrough the code touching on the relevant parts. Subbed the channel👍 I see the code is still in the Supabase repo - but the Clippy is not there? what happened?
Glad it was helpful! Edit: Just realized you might have been talking about the Clippy graphic on the site, not the code. Search and Clippy have been combined into the same interface - you can find Clippy by clicking search, then switch to “Ask Clippy”. Original: Things move quick at Supabase 😆 The Clippy frontend code got moved when search was upgraded to also use embeddings. After the refactor everything is just under “Search”: github.com/supabase/supabase/blob/0ecc238ad6d81202bb2301f7919b166a98929697/apps/docs/components/Search/SearchModal.tsx The backend Clippy logic is still in the same edge function: github.com/supabase/supabase/blob/0ecc238ad6d81202bb2301f7919b166a98929697/supabase/functions/clippy-search/index.ts
@@RabbitHoleSyndrome cool. You can feedback to Supabase, that a) this video significantly increased my interest in adoption of Supabase b) clipy icon is tooo small , it should be large and obstructive. Serious question Prompt Engineering: With Chat Completion API - do you do system persona and pack all prompt into user? Or would you break prompt like here into example 3 user entries? E.g.: Context, question, use markdown. Does it make a difference?
Good feedback 🙌 and great question about prompt engineering in the chat API. We are actually experimenting with this right now and trying to understand what produces the best results. At the moment we do a bit of both (system message and user message with a bit of prompt overlap between both). OpenAI says that the model doesn’t pay strong attention to the system message, so it may be better to use system message strictly to set identity and provide instructions & context in a user message.
Great video! How closely do the .mdx files have to match this structure before they can be processed into embeddings? Do they need to export the meta const, for example?
are there alternatives to openai to create these vectors? Dont really feel comfortable building something around a closed source api that is controlled by one vendor.
Really great question. You’ll want to look into sentence embeddings. There has been a lot of work on the OSS side with Sentence-BERT (SBERT) you can check out. You might also want to look into Universal Sentence Encoder (USE) and InferSent.
LlamaIndex actually uses OpenAI (text-embedding-ada-002) by default for embeddings today. They're more of a toolkit layer to assist with the workflow. There are many other alternatives though (which LlamaIndex supports via LangChain) that are worth checking out: langchain.readthedocs.io/en/latest/reference/modules/embeddings.html
16:10 would love an explanation on how embedding work with a document structure? I.e query is “summarize chapter 3”. The embedding sans retrieval don’t seem to capture the structure of the chunks that are contained in title chunk “chapter 3 “. All explanations on embedding I’ve seen all rely on the text content within a chunk.
Can you make the completion also tell which chunks have been used and link to them or get “read more” links or would you do that by just listing “top 5” matches from the context?
So if I understood correctly: The embeddings were only used to check for similarity between the user's input and the doc's content, in order to provide the prompt with relevant (text) context, right? Is there a way to provide the GPT model with the embeddings instead?
That’s correct. You could have used an alternate search method, but embeddings have a nice alignment with LLMs since they also use language model themselves. Unfortunately no there is not currently a way to inject embeddings directly into GPT today. Maybe this will change in the future or become available in open source models like LLaMa in the same way we’ve seen it happen with Stable Diffusion.
It wasn’t released yet (things move fast these days 😆). We’ve actually just updated to use gpt-3.5-turbo. Much cheaper, and better suited for multi-message chat-style interactions (though single prompt responses are still possible). Most difficult challenge with gpt-3.5-turbo has been getting it to work well with a prompt. Doesn’t seem quite as good at following the original instructions.
Hey - CSV files are definitely doable. It will mostly come down to how you plan to pre-process them. Perhaps you can import your CSV file into a table itself and generate embeddings on the content within it.
Loved the video validated a lot of the decisions we are making at work. I have a question however on the section about context injection. You mention that you search for relevant information to inject into the prompt. How do you accomplish the search part ? Is it using an index or a sql query amongst all columns ?
Glad it was helpful! The search is done through embeddings - we perform a similarity search between the embeddings generated from user's query and the pre-generated embeddings on the knowledge base (stored in a column using pgvector).
@@RabbitHoleSyndrome ahh so: 1) call OpenAi embedding api for the query 2) use cos sim to compare the query embedding against the stored embeddings 3) utilize the top results to inject into a prompt that we compile to send to OpenAi completion api ?
Hey great content! I was looking for exactly this. Do you have a discord for asking queries? I was making the same thing but for ".md" files not mdx and faced some issues.