Тёмный
No video :(

How to Scrape and Extract Data with Langchain GPT Function Calling 

tylerwhatsgood
Подписаться 2,5 тыс.
Просмотров 10 тыс.
50% 1

the extractooor
- say hi on Twitter: tylerwhatsgood_
- notebook: github.com/linuxandchill/scra...
#gpt4 #gpt3 #ai #python #webscraping

Опубликовано:

 

25 июн 2023

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 37   
@CalmCascade.
@CalmCascade. Год назад
Just wanted to drop a comment to say thank you for creating and sharing this insightful video on how to use Langchain and chatGPT for web scraping and data extraction. The step-by-step demonstration using Python, Beautiful Soup, and Playwright was clear and extremely easy to follow. Keep up the excellent work and I'm looking forward to your future content. Thanks again!
@fortestingpurposesonly2697
@fortestingpurposesonly2697 Год назад
love the videos! Enjoying the straight to the point and fun commenatry. Very honest and very helpful!
@SigmaScorpion
@SigmaScorpion 9 месяцев назад
I accidentally landed and now subscribed... you're lit 🔥
@BuildMagic
@BuildMagic Год назад
awesome, you’re a fogging genius
@the-web-scraping-guy
@the-web-scraping-guy 4 месяца назад
Great video
@jsfnnyc
@jsfnnyc Год назад
Great video. I learn so much just from reading other people's code.
@jsfnnyc
@jsfnnyc Год назад
PS: Texas represent!! Lolz
@tylerwhatsgood
@tylerwhatsgood Год назад
@@jsfnnyc haha thanks brother 🤠🤠🤠
@MZ-gf6fx
@MZ-gf6fx 10 месяцев назад
good job, kiddo! subscribed.
@nicolasmartinez7302
@nicolasmartinez7302 10 месяцев назад
Love your videos. Have you thought about making the code avalible through google collabs?
@brezl8
@brezl8 Год назад
@user-qn9sk4jy1p
@user-qn9sk4jy1p 6 месяцев назад
hey, you don't have to re-declare the function on each cell! also i would like to see if generating the schemas can also be done using the openai api
@MohamedJemai-pw6gn
@MohamedJemai-pw6gn 3 месяца назад
I tried using function calling for invoice data extraction, but when the schema content and description got big I noticed a weird regression where the gpt will return a weird {text:nonsense} instead of the valid schema, for reference I was using gpt 3.5 1106
@aadhilimam8253
@aadhilimam8253 10 месяцев назад
how can we use vectore store as input for langchain extraction chain ?
@user-qn9sk4jy1p
@user-qn9sk4jy1p 6 месяцев назад
also one needs to fidget with selenium or playwright instead of bs4 to navigate to/from pages
@shivamkumar-qp1jm
@shivamkumar-qp1jm Год назад
You can also give the option to save in the CSV file
@tylerwhatsgood
@tylerwhatsgood Год назад
yes totally!!
@MarxOrx
@MarxOrx 10 месяцев назад
Bro. 😂 i came for copper but i found G O L D.
@ronm3804
@ronm3804 Год назад
is this a better method than gpt-engineer ?
@uiucdsc
@uiucdsc 7 месяцев назад
i am getting a not implemented error from running the async playwright function. unable to figure out why
@tylerwhatsgood
@tylerwhatsgood 6 месяцев назад
hi sorry for late reply! were u able to get this working? which version of python and package are u using? happy to help pls link a gist and i'll take a look! thank u!
@vishnunair5837
@vishnunair5837 4 месяца назад
faced the same problem in my jupyter notebook for future reference, make sure to pip install or pip upgrade playwright & then playwright install
@mertzorlu386
@mertzorlu386 Год назад
Rate limite exceeded error from langchain after several tries what you recommend Tyler
@tylerwhatsgood
@tylerwhatsgood Год назад
did u get this working? sometimes the oai api is under load
@user-hc8cu2ww9i
@user-hc8cu2ww9i 11 месяцев назад
output=await run_player ("url") fails for me ;dont know why .I even installed asyncio and tried ,if i remove await there it does not give correct output .everything else is fine ,why is this happening
@tylerwhatsgood
@tylerwhatsgood 11 месяцев назад
can you put your code in a gist and share? i can take a look and try to diagnose, thank u!!
@whackojaco
@whackojaco 6 месяцев назад
Any possibility to use an open source LLM to achieve similar results?
@tylerwhatsgood
@tylerwhatsgood 6 месяцев назад
yes definitely! whats your favorite OS llm?
@whackojaco
@whackojaco 4 месяца назад
​@@tylerwhatsgoodAt the moment my preference is Mistral 7B.
@CristianOrihuelaTorres
@CristianOrihuelaTorres 3 месяца назад
u only have to change the llm used, langchain has all of them
@denzelcanvasYT
@denzelcanvasYT Год назад
would love to connect with you bro
@tylerwhatsgood
@tylerwhatsgood Год назад
hmu anytime!
@carriebartkowiak
@carriebartkowiak Год назад
So just to be sure I understand this correctly... - It will only scrape one page at a time, it won't do a full directory (say, a shared folder of Google docs) - You have to know in advance what information you want; it looks for that specifically and generates output based on your query - You cannot have it scan a number of pages/documents and *then* ask various questions about the content - The info that it scrapes is not persistent from one query to the next, much less from one session to the next - The scraped data is private to you, it does not get fed back into the model Is that right? Backstory: I'm an author. I'm looking for a way to feed all my manuscripts and copious notes, timelines, plot outlines, etc into a LLM and then be able to ask it questions about the content. Sort-of a virtual assistant dynamic story bible that helps me keep all my details straight without having to take time to dig for the info myself. (Like "What color are Karen's eyes?" or "In which book did Joe meet Captain Huffer for the first time?") I'm thinking GPT4All is my best bet for now, but boy all the demos I've seen of it are horrendously slow. I haven't yet found an online-hosted model that will 1) take that much data (we're talking multiple 100k-word novels, plus notes) and 2) keep it private to me, not feed it back into the model. (If you know of one, please tell! :) )
@tylerwhatsgood
@tylerwhatsgood Год назад
hi Carrie, yes youre right that this only scrapes 1 page at a time but could be easily be edited to scrape any number of pages. This video was more of a demonstration of web scraping and how you can get structured output from the things you scrape pretty easily (most likely to be subsequently stored in a db somewhere) I think that you are looking for something different and if i understand correctly you should be able to accomplish something fairly close to what you want with embeddings of all your works/ notes and a half-decent chat model the performance of the open source models compared to openai and even anthropic is just not in the same league at the moment unfortunately BUT you should try out a few OS ones (alpaca, vicuna, or one of the eleutherAI ones like gpt-j-6b) but you'll probably need quite a bit of ram i totally understand the desire to keep things local but OpenAI does state that they "will not use API data to train OpenAI models or improve OpenAI's service offering" but whether or not we can trust what they say is tbd in any case, i'm sorry couldn't be more helpful but i will mess around with this a little and see if i can come up with something over the next few weeks. thank you for your comment and sorry for the long reply 😅
@tylerwhatsgood
@tylerwhatsgood Год назад
haven’t tried it but mpt-30B looks promising!
@petswolrd280
@petswolrd280 Год назад
You can use vector db and retrieval chain
@tylerwhatsgood
@tylerwhatsgood Год назад
@@petswolrd280 yup that's what i was getting at w embeddings. what do you recommend for good vector store? want to give pinecone another try it's been a while
Далее
We're on the brink of another world browser war
4:29
Просмотров 738 тыс.
Industrial-scale Web Scraping with AI & Proxy Networks
6:17
Scrape any website with OpenAI Functions & LangChain
24:10