Just wanted to drop a comment to say thank you for creating and sharing this insightful video on how to use Langchain and chatGPT for web scraping and data extraction. The step-by-step demonstration using Python, Beautiful Soup, and Playwright was clear and extremely easy to follow. Keep up the excellent work and I'm looking forward to your future content. Thanks again!
I tried using function calling for invoice data extraction, but when the schema content and description got big I noticed a weird regression where the gpt will return a weird {text:nonsense} instead of the valid schema, for reference I was using gpt 3.5 1106
hi sorry for late reply! were u able to get this working? which version of python and package are u using? happy to help pls link a gist and i'll take a look! thank u!
output=await run_player ("url") fails for me ;dont know why .I even installed asyncio and tried ,if i remove await there it does not give correct output .everything else is fine ,why is this happening
So just to be sure I understand this correctly... - It will only scrape one page at a time, it won't do a full directory (say, a shared folder of Google docs) - You have to know in advance what information you want; it looks for that specifically and generates output based on your query - You cannot have it scan a number of pages/documents and *then* ask various questions about the content - The info that it scrapes is not persistent from one query to the next, much less from one session to the next - The scraped data is private to you, it does not get fed back into the model Is that right? Backstory: I'm an author. I'm looking for a way to feed all my manuscripts and copious notes, timelines, plot outlines, etc into a LLM and then be able to ask it questions about the content. Sort-of a virtual assistant dynamic story bible that helps me keep all my details straight without having to take time to dig for the info myself. (Like "What color are Karen's eyes?" or "In which book did Joe meet Captain Huffer for the first time?") I'm thinking GPT4All is my best bet for now, but boy all the demos I've seen of it are horrendously slow. I haven't yet found an online-hosted model that will 1) take that much data (we're talking multiple 100k-word novels, plus notes) and 2) keep it private to me, not feed it back into the model. (If you know of one, please tell! :) )
hi Carrie, yes youre right that this only scrapes 1 page at a time but could be easily be edited to scrape any number of pages. This video was more of a demonstration of web scraping and how you can get structured output from the things you scrape pretty easily (most likely to be subsequently stored in a db somewhere) I think that you are looking for something different and if i understand correctly you should be able to accomplish something fairly close to what you want with embeddings of all your works/ notes and a half-decent chat model the performance of the open source models compared to openai and even anthropic is just not in the same league at the moment unfortunately BUT you should try out a few OS ones (alpaca, vicuna, or one of the eleutherAI ones like gpt-j-6b) but you'll probably need quite a bit of ram i totally understand the desire to keep things local but OpenAI does state that they "will not use API data to train OpenAI models or improve OpenAI's service offering" but whether or not we can trust what they say is tbd in any case, i'm sorry couldn't be more helpful but i will mess around with this a little and see if i can come up with something over the next few weeks. thank you for your comment and sorry for the long reply 😅
@@petswolrd280 yup that's what i was getting at w embeddings. what do you recommend for good vector store? want to give pinecone another try it's been a while