It's my first year in programming and there was nothing new actually. I don't even think that pain was worth it I'd just make the scraper in js and make it return a json string.
Nice video. It’s worth noting as well that many APIs will pageinate, so rather than checking how many total results exist and manually iterating over them - you just check to see if the ‘next page url’ or equivalent key exists in the results and if so, get that too until it doesn’t exist anymore, merging/appending each time until the dataset is complete 👍
I rarely praise anything, but this tutorial was SO good! Well explained, no filler. In 7 or 8 minutes you guided me through finding the hidden information I needed, which tools I need to use and how to automate it. This tutorial gave me enough confidence to try to write my first Python script! Within hours I built a scraper that can pull all metadata for a full NFT collection from a marketplace. Without this video it would have taken days/weeks to discover all of this
Because of this video, I was able to start my own rockets and satellites company. In only four hours, I started the company, launched thousands of rockets, and now I have my own interplanetary wireless intranet from which I can control the entire galaxy! Thanks again!
Please ignore my first comment. I checked out your first video in this series and learned about using scrapy shell to test each line of code. With that I found the bug in my code. The code worked PERFECTLY as advertised. Your the man! Much thanks!
Loved everything about this video! Great delivery style, production quality and interesting topic for me. First time visitor to this channel and not a Python user (thanks, RU-vid, for your weird but helpful predictive algorithms).
Wow, thanks for this excellent tutorial! I just spent all this time writing cumbersome Selenium code, when it turns out all the data I was looking for was already right there!
This video was the answer to my prayers! The next best option was to watch an one hour video and hope they would teach what you taught... In 10 minutes!!! 👏👏👏
This video came into my feed just a couple days after I used exactly this method to collect some data from a website. Very good info! This is much easier than web scraping. Unfortunately, in my case, the data I could get out of the API was incomplete, and each item in the response contained a URL to a page with the rest of the info I needed, so I had to write some code to fetch each of those pages and scrape the info I needed. But much easier than having to scrape the initial list, as well.
I like how you regularly start sentences with 'you might think' assuming we are all idiots. I approve, glad smart people, like you, make time to explain to us plebs how the world works. Apprecated.
Been using python for a couple years now as a picked up language and I really appreciate getting to see how someone experienced approaches these problems
I have tried this method, but sadly the site I am trying to scrape from returns "error": "invalid_client", "error_description": "The client credentials provided were invalid, the request is unauthorized." Am I out of luck?
Nice video! Used a similar method to collect European Court of Human Rights case documents since there is no official API. Glad to see such methods gaining popularity online, it’s so useful!
There is always something new to learn. I’ve been spending hours to grind such an information by hand-writing the whole program to get my result ;D Thanks!
Thanks for sharing! This has helped me a lot. After struggling for weeks with selenium, I was able to apply this technique fairly quickly, and am now using it as source to scrape ETF-composition data to feed directly into a PowerBI dataset. Much appreciated!
Nice tutorial on scrapping, some tricks I have been using myself, and some others never heard of until now thx for sharing!!! Small adjustments if I may (please don't take this as criticism) I think you don't need to loop over each product to copy it to your res, you can use extend instead, also I think the header didn't change so you can take it out the loop over pages
Greetings from Brazil! Thank you! I just had to adjust some of the quote marks on the header (there were some 'chained' double quotes (like ""windows"")), making some of the header's strings be interpreted by python as code, not text. Just had to change inner double quotes for single quotes (e.g. "'windows'") and it worked perfectly!). Can't wait to try your other tutorials! Once more, thank your very much!
I was like "hm, okay, yeah" to "HOLY SHIT, THATS THE DOPEST SHIT I'VE EVER SEEN" I'm starting to get into this niche and I intend to learn more Python and SQL (you know, Data Analysis stuff/jobs) and I'm doing a project to scrape NBA statistics but there are always some errors and it ends up taking a long time. BUT THIS IS GOLD CONTENT, KEEP IT UP
Nice tutorial. But one important thing you haven't mentioned is that most of such APIs usually have some sort of authorization (based on headers, referrer, token, key, whitelist, etc.).
@@Al3xdude19 You have to REALLY deviate from normal behaviour to catch blocks like that lol. If you fire off requests as fast as possible then yeah you'll probably get caught
John, a specific video about how to scrape React website would be nice. It uses a mix of html and JSON data on pages...just an idea. Keep up the good work loving it.
This was nearly exactly my job back in 2014/2015 for a giant e-com shoe company. Was always nice when you'd come across a brand that included their inventory count in their API. But yes selenium/watir all day lol
really nice juicy piece of knowledge. this XRH tab changes the game hopefully however there arise an issue of how to tackle the cookie-expire problem and if the api needs JWT token, private key
Instead of looping over the list and doing an append of each individual item, you can do list().extend(list()) which extends the list with the new list. The result of this is 1 list of dictionaries (basically an identical result to how you did it) but with less and cleaner code.
Great video! Thanks so much for sharing! I think you should consider some academic research program (if you havent't already). I am sure you would do an amazing work. Congrats and thanks again!
Hi there, I found your channel where each and every video delicately made for web scrapping and automation which helps me a lot as work with web scraping and web automation. I have a request, if possible then please make python data post methods on Stateful api v1 and how to mimic cookies and session to get the job done. Thank you.
Hi John. Amazing content as always. Do you think I can skip learning scrappy for now? Can I do most of the scraping tasks just by using BS and request html?
Complete rookie here. I’m trying to understand scraping to help access my lap times from the mylaps api utilizing my own interface. This is intimidating for a novice like me.
This works in a lot of cases were the API is open. However, in cases like Social Media Platforms were you have to have an account to access the API or a Wordpress Websites were the API is turned off it wont work. The best approach in these situations, is really just to use Selenium or anything close and try to crawl the pages with a delay.
Yea as you say anything that needs a login is much more tricky, in some cases you can pass the cookie and headers around and maintain the session but sometimes selenium/playwright is the best option
ive experienced being blocked when scraping a web site, but can i be blocked when i scrape through api? should i use sleep like when scraping an actual website? keep up the good work, not many wholesome web scrape content out there
okay, so should i get information for 10 different products if i can in one api call, instead of doing it 10 times and getting information about one product in each call, no matter if its the same amount of data, its easier for server to handle it?@@JohnWatsonRooney
Thank you so much for the tutorial. I have a question, how to get a Authentication value that include the header, can I do automatically and without selenium? In this moment, I get it manually in the network tab, further, the authentication value expire after of a time.
I don't think you can no - what I do is load up the page once with selenium, grab all the headers and cookies and use them in subsequent requests using this method.
Do you have any tips to get, for example, your own instagram followers? When I copy the cURL request and paste it in Insomnia, I don't get the same JSON as the browser, but instead I get some HTML which previews the Instagram logo. I assume it has to do with some authentication but I have know idea how to fix it.
Thanks for this - and other - videos, John. Super helpful! Regarding the cookie expiring, can you suggest a way to use playwright to programmatically generate the cookie used on the API request? I am assuming that cookie isn’t the same as the cookie used for the request of the html but maybe that’s wrong?