8 Minutes LangChain OpenAI Beginner Tutorial | ChatGPT with your PDF

Подписаться 2,2 тыс.

Просмотров 19 тыс.

50% 1

An introduction to LangChain, OpenAI's chat endpoint and Chroma DB vector database. This is a step-by-step tutorial to learn how to make a ChatGPT that uses data from your own PDF with sources breakdown. This video provides a few diagrams explaining the architecture and concepts to show how easy it can be.
🔗 Links
Source code: github.com/edrickdch/chat-pdf
PDF: www.imf.org/en/Publications/W...
OpenAI: platform.openai.com/
LangChain: python.langchain.com/en/lates...
Chroma DB: docs.trychroma.com/
⏳ Timestamps
00:00 Intro
00:11 Overview Large Language Model
00:51 Introduce LangChain
01:03 Part 1 - Ingestion Architeture
02:25 OpenAI API Key
02:48 Part 1 - Ingestion Code
03:04 Parse PDF
03:39 Clean text
04:06 Create text chunks
04:30 Create embeddings + Store Chroma DB Vector
04:45 Part 1 - Recap
04:55 Part 2 - Conversation Architecture
05:45 Part 2 - Conversation Code
06:27 Creating a LangChain chain to chat
07:00 Demo ChatPDF
💌 Link to the newsletter
practical-ai-builder.beehiiv....
🙋‍♂️ Need help?
You can reach out by email at edrick@daceflow.ai if you're looking to improve your business workflows using AI or build an AI app.

Наука

Опубликовано:

27 июн 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 41

@igorg.8624 2 месяца назад

This is incredibly useful for those starting out in this career field. Thanks.

@garthgoldwater5256 Год назад

This is the first PDF-related python GPT code I've been able to find on github in two weeks that actually runs on basically the first try (had to manually add tiktoken for some reason, and the readme could use a note about a .env file). You've immeasurably improved my life. Thank you-and great overview of how it works!

@edrickdch Год назад

Appreciate the feedback! Good catch, will add something about the .env file

@garthgoldwater5256 Год назад

any idea why pipenv install wouldn’t grab tiktoken? im new to python

@edrickdch Год назад

@@garthgoldwater5256 In this case, tiktoken is a dependency used by the langchain library. We're not using it directly. It's most likely that there's something wrong in the langchain repo. I took a look at langchain's github repo and seems like you're not the only one that had to deal with this issue. github.com/hwchase17/langchain/issues/2814 github.com/hwchase17/langchain/issues/3811

@NeferLopez Год назад

Awesome video! Can't wait to see your upcoming vids 👏

@WarongkornTritipakit 11 месяцев назад

Thank you for clear and simple tutorial, this one is for the true beginner like me. This helps a lot.

@davidignatiusanyaeche8101 Год назад

This is 🔥 Great content about AI!

@ndch9226 Год назад

Wow thanks for this content! Helped me a lot understand

@stanrymkiewicz4118 Год назад

Love it

@xiosagikana28 Год назад

Good video, thank you so much for the tutorial. Though, I have a doubt about "text_to_docs" function, you said "text" is a "List[str]", but you pass as a paremeter the variable "cleaned_text_pdf", which should be a "List[Tuple[int, str]]" since is the result of the function "clean_text", or I'm missing something? It's just pycharm is telling me about this issue all the time.

@garvitakalra3324 Год назад

Thank you for this great tutorial. I was able to follow it and it is working as expected except for the tables in the pdf. could you please guide how can I get answers from the content in the table? when looking at the chunks text, the content of table is all mixed up.

@edrickdch Год назад

Parsing a table from a PDF is not straightforward. It's actually quite challenging due to the nature of how PDFs store data. PDFs are meant to be viewed instead of edited or parsed. Parsing tables would be pretty tricky, because we lose the visual layout. But it can be solved on a case by case basis if you look at the pattern of the parsed output of a particular table and rearrange it manually. Additionally this is a drawback of semantic search, performing structured queries (like querying a table) is not something it can do well using simply the specific architecture shown in this tutorial.

@nair38 Год назад

Thanks for this. Very helpful. I ran the code but I don't get the pdf specific responses. Just ChatGPT responses. Question: what economic impacts were unforeseen? Sources: Answer: I'm sorry, could you please provide me with more context? What specific event or situation are you referring to?

@m1ar1vin Год назад

Please post more

@Jonathan-rm6kt Год назад

This is a great tutorial, thank you. But how would you create the embeddings/vector store such that you can ask a question like "Summarize Chapter 3"? As the langchain documents point out, this is actually difficult because the document search isn't able to look for relevance by text content-- It needs to have a semantic understanding of the body of text contained in chapter 3. If anyone has an explanation for this, it would be great to hear!

@edrickdch Год назад

I think what you are asking specifically could be solved by using self-querying. You'd need to add metadata to your text chunks to indicate what chapter they represent. And then you could send the query "Give me all of chapter 3". You'd get back all the documents from chapter 3. Then you could use a summarize chain on them and use something like "map reduce" Just an idea, there are most likely many different ways to solve this. python.langchain.com/en/latest/modules/indexes/retrievers/examples/chroma_self_query.html

@Jonathan-rm6kt Год назад

@@edrickdch makes sense, thank you. For some reason, using retrieval metadata filter predicates doesn’t work that well for me, seems to just return too many irrelevant docs. Likely an error of mine though. I’ve heard the term “hybrid search” thrown around (combines semantic, keyword, and vector search) but haven’t seen much literature on it.

@eiyrm Год назад

as it was written in chat gpt that don't send any personal data! is it good practice or ethical to use openai api in business and real world projects ? what are your thoughts about it

@harinisri2962 Год назад

Hi , Thanks for the Tutorial. I have a doubt. If I ask questions which are not relavent to the input pdf, will it responds that it dont know, or will it give some random answers?

@edrickdch Год назад

It will respond that it does not know. This behavior is due to a prompt used in LangChain's source code that will tell the LLM to not make up answers and simply respond that it doesn't know if no relevant knowledge was found.

@user-kl3cz6ne4u Год назад

hey edrick could you make on video on setting up . Like how to get the .env file , chroma . This wld help anyone who is just begining

@edrickdch Год назад

I wrote a step-by-step guide on setting up the env, that you can find on Github: github.com/edrickdch/chat-pdf/blob/master/README.md Let me know if anything's unclear or if it doesn't work for you.

@anishpillai Год назад

Will it be safe to use this project for protected contents like company SOP documents?

@edrickdch Год назад

Not as is. To achieve privacy and retain ownership of the data, I would swap OpenAI's model for one you can run locally. And also swap the embedding provider for a local model (Hugging Face). I could make a video on that with more details if there is interest.

@anishpillai Год назад

@@edrickdch I think a video explaining that would be definitely helpful to a lot of people.

@lukaszmarchlewicz Год назад

by executing your code I am getting an error, can you help me? 'chromadb.errors.NoIndexException: Index not found, please create an instance before querying'

@edrickdch Год назад

is it the ingestion or conversation part? which line is throwing the error?

@westifer8838 Год назад

2:47 what is that UI? im having trouble following this part or where is that?

@edrickdch Год назад

It's my code editor. I use VS Code. You can learn more or download it here: code.visualstudio.com/

@vinsi90184 Год назад

How do you deal with tables in PDF?

@edrickdch Год назад

@codeboi3087 Год назад

it burnt through my openai api

@Sean-vv9kk Год назад

Promo>SM 🤩

@RedShipsofSpainAgain Год назад

This is an excellent tutorial, Edrick. But please: stop with the infantile thumbnail tactic of using a video thumbnail with some exaggerated facial expression in a sad ploy to get more views. We all know about this marketing RU-vid tactic and we're not 13 years old. Your content is excellent as-is; you dont need to sell out using the ol' thumbnail with faces and mouth agape with huge eyes to get viewers to watch. Keep making good content, and your channel will be successful. :-)

@aditisharma2831 11 месяцев назад

Hi I am facing error in executing ingest.py file it says- Traceback (most recent call last): File "D: ew_fuham\chat-pdf\src\ingest.py", line 132, in vector_store = Chroma.from_documents( File "D: ew_fuham\venv\lib\site-packages\langchain\vectorstores\chroma.py", line 412, in from_documents return cls.from_texts( File "D: ew_fuham\venv\lib\site-packages\langchain\vectorstores\chroma.py", line 373, in from_texts chroma_collection = cls( File "D: ew_fuham\venv\lib\site-packages\langchain\vectorstores\chroma.py", line 88, in __init__ self._client = chromadb.Client(self._client_settings) File "D: ew_fuham\venv\lib\site-packages\chromadb\__init__.py", line 107, in Client system = System(settings) File "D: ew_fuham\venv\lib\site-packages\chromadb\config.py", line 175, in __init__ if settings[key] is not None: File "D: ew_fuham\venv\lib\site-packages\chromadb\config.py", line 110, in __getitem__ raise ValueError(LEGACY_ERROR) ValueError: You are using a deprecated configuration of Chroma.

@loicquivron3872 Год назад

Hi, very nice too, I was trying to launch it and I got the following error after calling the inject python : Retrying langchain.embeddings.openai.embed_with_retry.._embed_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota Do we need a paid version of chat-gpt to get it running?

@edrickdch Год назад

You don't need the paid version of chat-gpt to run this. You've received a rate limit error. It means you've made too many requests in a short period of time. The API will refuse to fulfill further requests until some time passes. Did you only run the ingestion once, or did you run it multiple times in a row? You can also read more about it here: platform.openai.com/docs/guides/rate-limits/overview

@aditisharma2831 11 месяцев назад

@@edrickdch Hey I, have used API key from different accounts for this error but still facing the same error, can u help with it?

@aditisharma2831 11 месяцев назад

I am facing the same problem did u get any solution to this?