Easily the best video on web scraping in Python I've ever seen. Only 20 minutes, but it has more content than many 1hr+ tutorials. Also you've explained many useful cases (f.e. what if we don't have some element). Thank you!
This is excellent content. I've been browsing for hours looking for a clear and detailed explanation and was lucky enough to find your video. And only 20 mins long! Thank you for sharing!
Hi! I know that this is a really late comment but Ive been trying to follow every guide on the internet but somehow it wont return any contents of the site.I only get the divs with class, body, etc but no content inside of those. I use a log in and find_all. Like, I only get the top of the tree. Do you know what this problem is called/how to solve it? I know its pretty vague but I dont want to make you read a whole paragraph haha
Try saving/printing the whole response as text and check to see if the data you want is actually there - if not try the same site but use playwright to load it with a browser instead and see if that works
Thank you. Why I get errorr: name = soup.find('h1', class_='product-main__name').text.strip() AttributeError: 'NoneType' object has no attribute 'text' I wrote "r.text", but it didn't help
A few hours later, and your tutorial's already being implemented in my everyday web scraping / data cleaning work at the office. +1 Subscribed! Edit: Do you think you'll make a video combining the stuff in this video with using Jupyter Notebook to fill a CSV/Excel file?
Thanks for the lessons. I have a website that requires me to either input dates, then I get a list of links. I must click on each one and scrape a json file. I am doing this with Selinium because I am a beginner and Beutifulsoup doesn't seem to work. But its super slow. I think the site uses JavaScript. Is there a better way to do this than using a headless browser. Do you have a video that might help me.
Thank you but it would have been more valuable if there was some place to download the code from so we could study it at our leisure. I don't know about anyone else but I can barely read the code.
Is there any other way to scrape data from all the pages other than using the page number from the url. The website I am trying get data from does not generate new url for every other page
requests-html and requests library doesn't support aws lambda function. when I use urllib its work! my method: html = urlopen(baseUrl) soup = BeautifulSoup(html.read(), 'lxml')
Thanks John, that was a very helpful video. As an economics major, I really need to be able to gather lots of data and process it efficiently, so web scraping was just a natural thing to learn. Keep up the good work!
Great tutorial, just went through. An excellent progression from the last one as most of the scraping that I have wanted to do involves "digging in". I feel that I am finally learning as I noticed the issue with the rating as we were typing it through!
The entire code does not work with me by using request method but I used WebDriver browser method and work fine but the problem is that every time the page should be uploaded? which takes long time? I also used headless mode to avoid uploading the page every time but still does not work? I think the page should be uploaded first to fetch the data but why in your code it does not loading and return the data fast? when I use request method it return empty or none or [ ] ? I use exactly the same code as you? probably the headers is the issue? which headers did you use?
When you program web scrapers for work do you wrap the different parts of your code as functions and then call them from a main function like: get_Links(), get_products(), etc or just leave it as a long script because this simple enough? Also thank you so much for your content. I’m not a stem student but I was able to learn enough to build my own dataset for school even though I never programmed before. Thank you so much for taking all this time.
Hi John Thank you for these amazing videos my question is how can deal with the variable elements such images and save them in one cell in my csv file like this: Images: img1,img2,img3
Hi - Yes, hwoever you cant seperate them with a "," it would need to be something else if its going to be a csv file still. If you look through all the images and concatenate the names together: images = img1 + "-" + img2 that would work
Fantastic video, not only is it easy to follow along but explanations afford a genuine learning opportunity, rather that just a simple copy and paste. With myself being new to python, a big thanks is in order!
Hello, thank you very much for the video, is there a way to talk to you directly to see if you can help me with an extraction that is proving impossible for me and I don't understand why it doesn't come out: S
When we send a vet request we include headers with it to identify ourselves, plus other information. It’s those headers we are adding the user agent too in this case. You would also include headers when you POST too, they go along with the request
What a great training video. Thank you John. You are a great instructor. Explained well, easy to follow, clear and uses a real life example (real life challenges one would come across). Lots of ahaa moments that I had been struggling with including the what if an element is not present how to keep your program running (using Try: Except:) as opposed to your program coming to a complete stop. How easy was that? Only took me 6+ days of searching.
I have seen a lot of videos which are related to scrapping, i want to learn one more thing that how to scrap the reuters website using proxies. I want learn how to extract headline, date, paragraph for each article. Here is also load more articles button which is used for the next article, i also want to learn about it how to use it . Can you make seperate video for it please.
Thank you it was so useful. I have a question. I want to crawl product data and at the same get product description that is a link on another page. how we can crawl product description when it is on another link?
I get this really word bug where it thinks i set have set my page to 60 products per page. I know you said you had the same problem but you then set it to 20. For some reason no matter what i change it to. It still is stuck at 60
my url doesn't change with any of the actions. the page with the data and the page i see when i see first going to website has the same url. i fail at the first request. i don't know where to go from there. any suggestions?
Nice video really helped! Thing is I got this message after trying to scrape a website: "Pardon Our Interruption As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen: You've disabled JavaScript in your web browser. You're a power user moving through this website with super-human speed. You've disabled cookies in your web browser. A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running...." It works if you haven't made any requests for a while, but if you do too many it will start giving this error again... Anyone knows how to solve this? Really appreciate it!
finally one good tutorial about store data scraping.. your contain is better than that paid udemy content on python scraping .. can u can make a video for news articles/blogs scraping bulk full article,
@@JohnWatsonRooney Btw how to save it automatically into well sctructured data *.txt, or *.csv without having the data looks like this: { 'name': 'hsjjdhdkdkhdkdk', 'rating': 'gorbhdjdjdjjdjdj', 'url': 'hskskfldldjdnddk' ....} ?
John, your video was fantastic! I appreciate the clear explanation, but I'm curious: will your approach work for any website? Looking forward to your insights!
Hello John, I am watching your videos with great interest. Many thanks for them. Decided to follow this playlist. But looks like method of two years old doesn't work at the present time. So I decided to use requests_html to render pages. it works very slow. Then I tried to use Playwright. Looks better but there is a pop-up asking you to accept cookies and after several pages it doesn't work. Then thanks to your video of hidden API I found it. And it works. But I get banned after some pages. Return first 524 reply and after even just mess of symbols. I may share with you my code. May be I did some faults. Or may be if you still think it is interesting for you to check how this site might be scraped and make update? It would be great.
how can i fix this problem? im new here. name = soup.find('h1', class_='product_title').text.strip() AttributeError: 'NoneType' object has no attribute 'strip'
That means that nothing is being found, try removing the .text and .strip and see what happens - also double check the html to make sure that’s the correct tags and class for the element you are looking for
@@JohnWatsonRooney tnx for the response bro i already fix the problem i just replace productlinks.append(baseurl + link['href']) with productlinks.append(link['href'])
This is what I was looking for. Most youtubers just made a video about how to scrape the first page but didn't tell of how to fetch the data for each product and then do pagination. Now it's very clear to me. I am getting one error:" 'NoneType' object has no attribute 'text' " after 30 or 40 iterations. I wonder what does that mean? I tried checking the solution to it on stackoverflow but the code shows the same error. And yes this is very useful for beginners to Intermediate. Keep making such videos. I have subscribed to your channel.
thanks for your kind words! it sounds like maybe you reached the end and got all the pages? see what happens when you go to the last page in the browser
@@JohnWatsonRooney Yes. I have got all the pages. I have one more question to ask. I am putting down the link below. I am trying to extract the company details such as Name, Telephone number etc. The html code shows that it is with a list tag and then within each list tag, there is span tag with itemprop. I am trying to use span and itemprop but I am not getting the result that I want to. idn.bizdirlib.com/node/5290
Looks like I get rate limited(getting a 403 Forbidden) ->(in the second loop)after looping over each link to get the name, reviews, and price. It got me all the links but I get a forbidden when looping over each link to get the data. Any tips?
this is amazing tutorial , but this line r=requests.get(testlink,headers=headers) give me 403 status code why? can you tell me ?! i put the latest versions from headers .
Hey dude jsut stared my first internhsip, and this video has been immensly helpful! I really appreicate the effort put in and all the useful tips. Thanks!
Thanks for this brilliant video.. Actually, i have followed the steps and everything was fine but when creating the df I got this error : InvalidSchema: No connection adapters were found would you please assist 😇