Тёмный

This script I threw together saves me hours. 

John Watson Rooney
Подписаться 86 тыс.
Просмотров 20 тыс.
50% 1

Finding out the best way to scrape data from a site is time consuming, this script uses selenium wire to view the network requests from a site and give you back a list of urls and json responses.
Proxies: nodemaven.com/...
Patreon: / johnwatsonrooney (NEW free tier)
Scraper API www.scrapingbe...
Donations: www.paypal.com...
Hosting: Digital Ocean: m.do.co/c/c7c9...
Gear I use: www.amazon.co....

Опубликовано:

 

28 сен 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 71   
@liketheduck
@liketheduck 6 месяцев назад
Fantastic “apprentice” content. This assumes a basic understand but also pushes the novice forward. I really appreciate it!
@Extrey
@Extrey Год назад
I didn't even know that selenium can be used like this, thank you very much, great work as always))
@jagdish1o1
@jagdish1o1 Год назад
I used seleniumwire for create a scraping bot. It’s a very good package to grab the backend requests. What i did was using selenium i logged-in than grab the cookies and the backend api ;) than i simply closed the browser and used the python requests lib to make the request to make thing little bit faster. Eventually, i dockerized everything and than i have this container image which i than pushed on aws ecr and run parallel on aws ecs. Pretty amazing.
@datacleaningchallenge2029
@datacleaningchallenge2029 Год назад
impressive, what's your email, need to ask you a question as relate to your code
@DerekMurawsky
@DerekMurawsky 5 месяцев назад
This is really great, and a great foundation, too. I can see this being extended to support so many things, too.
@pldvs
@pldvs Год назад
"Because. I. Don't. Care..." 😂😂
@JohnWatsonRooney
@JohnWatsonRooney Год назад
haha
@dubey_ji
@dubey_ji 7 дней назад
This is really good thank you so much for this tutorial
@zakariaboulouarde4591
@zakariaboulouarde4591 4 месяца назад
Hello thank you for the amazing video. Wanna ask please how can I bypass 403 forbidden, for cloudflare when I am requesting an Api? Thank you for all your efforts 🙏🏽
@jessejames3169
@jessejames3169 Год назад
Love your thought process behind writing this! It makes it easy to follow why you do a certain step, and if it’s necessary for others! Great vids keep it up!
@JohnWatsonRooney
@JohnWatsonRooney Год назад
Glad it was helpful!
@sandunwijethunga6787
@sandunwijethunga6787 Год назад
great video. thank you john❤
@Garycarlyle
@Garycarlyle 7 дней назад
How did this work without importing 'requests'?
@kocahmet1
@kocahmet1 Год назад
golden content here
@satyajeetkumar3993
@satyajeetkumar3993 Год назад
Hi John!! I really appreciate this new content. I have a query to ask. I was using selenium webdriver in chrome to fetch data from a website. The script is working just fine but after certain iterations, the driver is not working properly or the way it should. I am getting a NoneType error. I tried clearing the cookie and starting a new session and then continue from where I left off but it is still not working. Any suggestions on this?? I really appreciate it!! Thanks!!
@JohnWatsonRooney
@JohnWatsonRooney Год назад
hard to say but when i get problems like this i always check to see what the direct output from loading the page is, you could be hitting a captcha
@satyajeetkumar3993
@satyajeetkumar3993 Год назад
Actually that new page is loading properly. I didn't check for terminal output but the page is loading. After that when I am looking for an element on the same page which I know is available there, I am getting an error.
@tizianonakamader8177
@tizianonakamader8177 Год назад
Amazing content thank you
@JohnWatsonRooney
@JohnWatsonRooney Год назад
Very welcome
@StonedApe420
@StonedApe420 Год назад
Can it make complete copy of requests with url, headers and payload?
@satwikawasthi2002
@satwikawasthi2002 Год назад
What if api only called when any user action occurs then?
@JohnWatsonRooney
@JohnWatsonRooney Год назад
the next step to upgrade this would be to run the same but insert clicks on various page links first and check each one
@satwikawasthi2002
@satwikawasthi2002 Год назад
@@JohnWatsonRooney thanks for reply🙏 also most important thing post method api which accept custom keys in its headers or payload, will not give expected response, please make video of this thing for executing it.
@maloukemallouke9735
@maloukemallouke9735 Год назад
thank you, i am wondering if you wine money with this tools ????
@AndyTutify
@AndyTutify Год назад
Are you no longer using neovim?
@JohnWatsonRooney
@JohnWatsonRooney Год назад
I still use neovim, i decided to use VS Code for video demos as i thought it would include more people
@Niuroteya
@Niuroteya Год назад
I don't really get it.. I mean you can filter Network tab by link or a word "api" too if you want to. Plus this solution will not work for everything, but Network tab will. Other than filtering only needed requests this solution doesn't seem to do anything. And yeah, you can do a bit more advanced filtering here, but.. Does this really saving a lot of time for some kind of task? It's just hard to see how for me. Did I miss something? I'm making AJAX scripts dealing with forms for the past year+ and for me it would be absolutely useless.
@JohnWatsonRooney
@JohnWatsonRooney Год назад
I use it when I am given a URL and want to do some quick checks - saving any JSON output so I can search inside all from my terminal. I chose to semi automate something I was doing regularly is all.
@markbennett5626
@markbennett5626 Год назад
Maybe not for everyone but once scripted including user prompt for url, it'll be quicker than using network tab and much nicer response, plus can see adding the ability for the additional steps of recording session keys and further calls.. Thanks John
@user-tk5ir1hg7l
@user-tk5ir1hg7l Год назад
is this better than pupeteet network events?
@JohnWatsonRooney
@JohnWatsonRooney Год назад
I have limited experience with pupeteer, i expect it to be the same - although I prefer seelnium-wire to playwright for network events
@user-tk5ir1hg7l
@user-tk5ir1hg7l Год назад
@@JohnWatsonRooney ok, how about playwright network events, does it have similar functionality or would you still recommend going with seleniumwire
@spab87
@spab87 8 месяцев назад
Hi, thanks a lot, this was very helpfull to learn. I use contextlib.surpress, its actually faster than try/except and it looks better i think. Your function would look like this: import contextlib for request in driver.requests: with contextlib.suppress(Exception): data = decodesw( request.response.body, request.response.headers.get("Content-Encoding", "identity") ) resp = json.loads(data.decode("UTF-16")) resps.append(resp) return resps
@ХайлайтыДлиннойВоли
Can I bypass hqq.tv devtool blocking using this?
@valoclips2896
@valoclips2896 Год назад
Nice idea. But I will still prefer to log the requests via Network tab or Burp suite. The chromedriver detection will also kick in for some sites.
@JohnWatsonRooney
@JohnWatsonRooney Год назад
fair enough, it does have some uses but also limitations as you say.
@Septumsempra8818
@Septumsempra8818 Год назад
Anyone else update chrome on their pc and had all their scrapers break?😅
@MasoomNini
@MasoomNini Год назад
Hi John, big fan. Thanks for toturials ❤ I need to contact you on any social media, i need one site scrape help kindly
@ivanowdenis
@ivanowdenis Год назад
Hello John, could you make a video how to scrape data which a server send trough a websocket connection in live mode?
@bakasenpaidesu
@bakasenpaidesu Год назад
.
@darylhunt9070
@darylhunt9070 Год назад
good video . Do you capture keys for api in Selium wire as well. As some api use session keys
@JohnWatsonRooney
@JohnWatsonRooney Год назад
you can grab any headers and cookies yeah
@abdelrahmankhaled8239
@abdelrahmankhaled8239 4 месяца назад
complete noob here just started web scraping for some reason the seleniumwire import is giving me this error import blinker._saferef ModuleNotFoundError: No module named 'blinker._saferef' I've been searching online for help for hours. changed python versions (currently using the same one you're using in the video) nothing seems to work. please help thank you in advance
@DudethatGross
@DudethatGross 4 месяца назад
pip install blinker ?
@twelfth4927
@twelfth4927 6 месяцев назад
Guys, I'm watching with passion but for what it would be helpful? What are web-scrapers actually doing?
@DudethatGross
@DudethatGross 4 месяца назад
Gathering data that would otherwise be difficult to get without a proper API
@kite759
@kite759 Год назад
that's very useful, thank you
@AleksT28
@AleksT28 Год назад
i was working with selenium / selenium-wire until i could not debug the issue while selenium-wire is not listening the right port where selenium is running while dockerised.
@JohnWatsonRooney
@JohnWatsonRooney Год назад
that's interesting, i haven't tried dockerising it but i will keep an eye open for issues
@AllifIzzuddin
@AllifIzzuddin Год назад
So this is kinda like playwright network events right?
@JohnWatsonRooney
@JohnWatsonRooney Год назад
Yes same thing but I found it better to use
@KishanParmar-x4u
@KishanParmar-x4u Год назад
are you using JetBrains Mono font? If yes, then how it looks so thin?
@JohnWatsonRooney
@JohnWatsonRooney Год назад
it is yeah, I don't know I didn't do anything other than select that font sorry
@mitvpankaj2454
@mitvpankaj2454 Год назад
Great work bro!! And I have one question also if I want scrape Walmart everytime robot or human pop-up comes so can you please guide me how to Bypass this type of bot detection system? Thanks and love your content because of you i learned python!! 👍
@JohnWatsonRooney
@JohnWatsonRooney Год назад
Check out undetected chrome driver - there’s some good information for it that might help
@mitvpankaj2454
@mitvpankaj2454 Год назад
I tried bro but still it's showing the same issue if you have any reference or video can you please suggest me it'll be very helpful for me and other also :)
@linuxkerem
@linuxkerem Год назад
Are you using arch linux sir ? And thanks for the content ! 🥰
@JohnWatsonRooney
@JohnWatsonRooney Год назад
thanks! its actually just ubuntu + i3
@linuxkerem
@linuxkerem Год назад
​@@JohnWatsonRooney Wow, I guess my mind went straight to arch when I saw a hyperland style window manager 😁
@TheCulpritgamer
@TheCulpritgamer 6 месяцев назад
can you please share the script that you created for my future reference ??
@iamshiva003
@iamshiva003 Год назад
What is the vscode theme and the font used in this video?
@JohnWatsonRooney
@JohnWatsonRooney Год назад
github dark theme and jet brains mono!
@iamshiva003
@iamshiva003 Год назад
@@JohnWatsonRooney thank you
@TimoTalksTech
@TimoTalksTech Год назад
Amazing, just something I was looking for. Need to look into more if I could fetch all the IPs too
@AhmedThahir2002
@AhmedThahir2002 Год назад
Hi John! Love your work. Could you share the codes of your videos.
@markbennett5626
@markbennett5626 Год назад
Maybe John has the code available to Patreon members ;)
@AhmedThahir2002
@AhmedThahir2002 Год назад
@@markbennett5626Ohhhhh okay no issues hehe :)
Далее
This is How I Scrape 99% of Sites
18:27
Просмотров 80 тыс.
still the best way to scrape data.
41:01
Просмотров 16 тыс.
Офицер, я всё объясню
01:00
Просмотров 3,3 млн
TEAM SPIRIT: НОВЫЙ СОСТАВ. SEASON 24-25
01:31
Streamlit Elements You Should Know About in 2023
14:31
Просмотров 100 тыс.
Best Web Scraping Combo? Use These In Your Projects
20:13
The TSConfig Cheat Sheet
5:36
Просмотров 36 тыс.
Selenium Headless Scraping For Servers & Docker
16:22
Web Scraping with Python - Start HERE
20:58
Просмотров 36 тыс.
The most important Python script I ever wrote
19:58
Просмотров 199 тыс.