Web Scraping with Python: Ecommerce Product Pages. In Depth including troubleshooting

John Watson Rooney

Подписаться 87 тыс.

Просмотров 169 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

1 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 274

@ahomes6329 3 года назад

Hi. I'm new here. When I run print(productlist), I'm getting back an empty list. What could be the problem?

@m1sti_krakow 3 года назад

Easily the best video on web scraping in Python I've ever seen. Only 20 minutes, but it has more content than many 1hr+ tutorials. Also you've explained many useful cases (f.e. what if we don't have some element). Thank you!

@JohnWatsonRooney 3 года назад

Thank you!

@_RamjiG 2 года назад

need some help here! I executed until 4.30 time stamp but only got empty list but you got the link for items.

@rw569 3 года назад

When I run the code at 4:26 for some reason all I get it [ ] - can anyone help?

@ceciliasanchez658 3 года назад

productlist =soup.find_all('li', class_='product-grid__item')##the items now are in a list, just change this line , the class name has already changed

@aksontv 4 года назад

Thank you sir, and please add more advance tutorials in this playlist

@fatimaelmansouri9338 3 года назад

This is excellent content. I've been browsing for hours looking for a clear and detailed explanation and was lucky enough to find your video. And only 20 mins long! Thank you for sharing!

@tablesawart2728 Год назад

I applaud you for your clarity. At 8:50 I run the program and get only this: '[ ]' (the two square brackets for a list)... Why??

@Sara.Saavedra Год назад

me too I get the same. did you solve it?

@datascienceanalytics5512 2 года назад

How can I avoid the problem: " Max retries exceeded with url"? NIce video tho, Thanks a lot!

@JohnWatsonRooney 2 года назад

Thanks! Best way in that case would be to slow down your requests using something like time.sleep()

@lautje4919 Год назад

Hi! I know that this is a really late comment but Ive been trying to follow every guide on the internet but somehow it wont return any contents of the site.I only get the divs with class, body, etc but no content inside of those. I use a log in and find_all. Like, I only get the top of the tree. Do you know what this problem is called/how to solve it? I know its pretty vague but I dont want to make you read a whole paragraph haha

@JohnWatsonRooney Год назад

Try saving/printing the whole response as text and check to see if the data you want is actually there - if not try the same site but use playwright to load it with a browser instead and see if that works

@daddy_eddy 2 года назад

Thank you. Why I get errorr: name = soup.find('h1', class_='product-main__name').text.strip() AttributeError: 'NoneType' object has no attribute 'text' I wrote "r.text", but it didn't help

@studywithrobin2715 3 года назад

A few hours later, and your tutorial's already being implemented in my everyday web scraping / data cleaning work at the office. +1 Subscribed! Edit: Do you think you'll make a video combining the stuff in this video with using Jupyter Notebook to fill a CSV/Excel file?

@jasonkesterson2402 3 года назад

I was able to get the rating by splitting on the and taking the index of [0] I'm not sure if that s the best way but it worked. :)

@phantomsixtrading7094 3 года назад

Can you post a screenshot please?

@victormaia4192 3 года назад

very insightful, was nice to follow, now I'll try to do something similar with my projects to extract infos from ads, thanks!

@roberttuttle4284 Год назад

Thanks for the lessons. I have a website that requires me to either input dates, then I get a list of links. I must click on each one and scrape a json file. I am doing this with Selinium because I am a beginner and Beutifulsoup doesn't seem to work. But its super slow. I think the site uses JavaScript. Is there a better way to do this than using a headless browser. Do you have a video that might help me.

@expat2010 4 года назад

Thank you but it would have been more valuable if there was some place to download the code from so we could study it at our leisure. I don't know about anyone else but I can barely read the code.

@JohnWatsonRooney 4 года назад

Of course, I always try to remember to put the github link in - I’ll find it and add it in

@expat2010 4 года назад

@@JohnWatsonRooney Thanks. I'll check back :)

@MuhammadFarhan-jr3cg 2 года назад

hey bro thats amazing video

@riteshpatel-yz7rd 3 года назад

Please tell me google map data

@HabibKhan-kj8um 3 года назад

You're fucking amazing ! Kudos for such an awesome explanation. This is what I was looking for. Hats off to you

@grub_taless7561 2 года назад

Is there any other way to scrape data from all the pages other than using the page number from the url. The website I am trying get data from does not generate new url for every other page

@JohnWatsonRooney 2 года назад

I suspect it’s being loaded by Ajax / JavaScript- if you check my channel for some of my JavaScript scraping methods it should help

@BodrumDrone 3 года назад

requests-html and requests library doesn't support aws lambda function. when I use urllib its work! my method: html = urlopen(baseUrl) soup = BeautifulSoup(html.read(), 'lxml')

@SejuaniMedio Год назад

Awesome man. Thanks a lot!

@barzhikevil6873 4 года назад

Thanks John, that was a very helpful video. As an economics major, I really need to be able to gather lots of data and process it efficiently, so web scraping was just a natural thing to learn. Keep up the good work!

@Neil4Speed 4 года назад

Great tutorial, just went through. An excellent progression from the last one as most of the scraping that I have wanted to do involves "digging in". I feel that I am finally learning as I noticed the issue with the rating as we were typing it through!

@JohnWatsonRooney 4 года назад

Excellent glad it’s helped you improve!

@adarshdessai6752 3 года назад

Amazing 😻 thanks bro. You have made a scraping lot easier.

@DATA_MACHINE22 2 года назад

very beautiful and from scratch.👏👏

@enngennng5633 2 дня назад

The entire code does not work with me by using request method but I used WebDriver browser method and work fine but the problem is that every time the page should be uploaded? which takes long time? I also used headless mode to avoid uploading the page every time but still does not work? I think the page should be uploaded first to fetch the data but why in your code it does not loading and return the data fast? when I use request method it return empty or none or [ ] ? I use exactly the same code as you? probably the headers is the issue? which headers did you use?

@dragon3602010 3 года назад

does it worth it to do it with scrapy that kind of webscraping? thanks

@JohnWatsonRooney 3 года назад

I’d say use scrapy for any serious projects, but for more one time scraping jobs it doesn’t matter so much

@icedgodz428 4 года назад

can you please go over what a user agent is? and why it was necessary to include it? Thanks

@icedgodz428 4 года назад

Bump

@amith_1923 Год назад

Just joined your channel, hope to learn more and thanks for the video

@atsource3143 2 года назад

Hi John, just wanted to know is there any way to scrap hidden div tags/elements using playwright, beautifulsoup etc? Thanks

@fernandodaroynavarro4231 9 месяцев назад

Hello @atsource3143, did you find the answer to this? I have the same problem about scraping hidden tags.

@sourabhrananawareyujfestbw9858 4 года назад

Best Video ever on web scraping ....#liked #commented #Subscribed #Love From India

@Nafke 3 месяца назад

When you program web scrapers for work do you wrap the different parts of your code as functions and then call them from a main function like: get_Links(), get_products(), etc or just leave it as a long script because this simple enough? Also thank you so much for your content. I’m not a stem student but I was able to learn enough to build my own dataset for school even though I never programmed before. Thank you so much for taking all this time.

@mylordlucifer 3 года назад

Best Ever codes

@8.8.8.8 3 года назад

thanks

@ibramou6200 Год назад

Hi John Thank you for these amazing videos my question is how can deal with the variable elements such images and save them in one cell in my csv file like this: Images: img1,img2,img3

@JohnWatsonRooney Год назад

Hi - Yes, hwoever you cant seperate them with a "," it would need to be something else if its going to be a csv file still. If you look through all the images and concatenate the names together: images = img1 + "-" + img2 that would work

@raymondnepomuceno8815 2 месяца назад

Great content john, new subscriber here

@JohnWatsonRooney 2 месяца назад

thanks, welcome

@dgy_are 5 месяцев назад

Hey, i want to scrape a website, but i want to render the whole page in my own website, like the idea of iFrame, how to do that please?

@durgeshkhade2417 Год назад

Good man!

@huguititi 11 месяцев назад

Hi, i found this tutorial on Oct '23 and the example page seems to block all the posible User agents, i only get the 403 response

@rolf8107 6 месяцев назад

hello, I had a question, which packages do you use within preferences because I get all kinds of error messages when using your code.

@goodkidnolife Год назад

Fantastic video, not only is it easy to follow along but explanations afford a genuine learning opportunity, rather that just a simple copy and paste. With myself being new to python, a big thanks is in order!

@sallaka87 Год назад

Hello, thank you very much for the video, is there a way to talk to you directly to see if you can help me with an extraction that is proving impossible for me and I don't understand why it doesn't come out: S

@saadsarawan4846 2 года назад

Why is it that when we want to pass in the user agent in our header, are we using a request.get, shouldn't it be a request.post?

@JohnWatsonRooney 2 года назад

When we send a vet request we include headers with it to identify ourselves, plus other information. It’s those headers we are adding the user agent too in this case. You would also include headers when you POST too, they go along with the request

@saadsarawan4846 2 года назад

@@JohnWatsonRooney Thanks a lot for clarifying. Your videos are amazing, they've literally been my guide to getting better at scraping. Thx man

@alexlytle089 4 года назад

I really love your videos bro. For scraping webpages do you prefer beautiful soup or selenium ??

@hanibech 2 года назад

Please, John, I want plugins or extensions scraping product store site without requesting API

@gomesgomes8206 3 года назад

What a great training video. Thank you John. You are a great instructor. Explained well, easy to follow, clear and uses a real life example (real life challenges one would come across). Lots of ahaa moments that I had been struggling with including the what if an element is not present how to keep your program running (using Try: Except:) as opposed to your program coming to a complete stop. How easy was that? Only took me 6+ days of searching.

@JohnWatsonRooney 3 года назад

Glad i could help Gomes!

@CinemaAcademy-cl4ym 8 месяцев назад

I have seen a lot of videos which are related to scrapping, i want to learn one more thing that how to scrap the reuters website using proxies. I want learn how to extract headline, date, paragraph for each article. Here is also load more articles button which is used for the next article, i also want to learn about it how to use it . Can you make seperate video for it please.

@yummywithali 3 года назад

Thank you it was so useful. I have a question. I want to crawl product data and at the same get product description that is a link on another page. how we can crawl product description when it is on another link?

@JohnWatsonRooney 3 года назад

Hi, sure, scrape the url for where the product description is and request that data from there within a for loop and add to your data

@ellisbenm 4 года назад

Really valuable stuff. First web-scraping vid I’ve seen that goes into building a database with the scrape contents.

@JohnWatsonRooney 4 года назад

Thanks!

@alexlytle089 4 года назад

I get this really word bug where it thinks i set have set my page to 60 products per page. I know you said you had the same problem but you then set it to 20. For some reason no matter what i change it to. It still is stuck at 60

@alexlytle089 4 года назад

I forgot to put the f after get. r = requests.get(f'www.thewhiskyexchange.com/c/35/japanese-whisky?pg={new}#productlist-filter')

@emreakyuz766 3 года назад

which distribution is that linux/gui?

@JohnWatsonRooney 3 года назад

It’s Manjaro with KDE plasma

@abby-cv4xc 8 месяцев назад

my url doesn't change with any of the actions. the page with the data and the page i see when i see first going to website has the same url. i fail at the first request. i don't know where to go from there. any suggestions?

@MdAlImran-jv8xb 3 месяца назад

Thanks for such a well explained video! Please give me the code .

@youshaiqbal5649 2 года назад

What if href="javascript:void(0);" ?

@leeeugene7913 2 года назад

Hi, please i get stuck on this issue, i just follow along your tutorial but this became problematic for me

@djuzla89 3 года назад

Never subscribed so fast, your explanation during work is priceless, and the speed if just perfect

@laxmanprasadsomaraju4438 Год назад

how to scrap sales in a day of over all groceries sold and to scrap everyday data from 2014-present

@ephraimmotho887 Год назад

Always enjoy such practical tutorials... Thank you so much for your efforts💯❤

@OddM1nd 2 года назад

Nice video really helped! Thing is I got this message after trying to scrape a website: "Pardon Our Interruption As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen: You've disabled JavaScript in your web browser. You're a power user moving through this website with super-human speed. You've disabled cookies in your web browser. A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running...." It works if you haven't made any requests for a while, but if you do too many it will start giving this error again... Anyone knows how to solve this? Really appreciate it!

@XxMrPlaystation3Xx 3 года назад

.text.strip() produces an error for me, I’m unsure as to why?

@mohammedzareefw203 4 года назад

finally one good tutorial about store data scraping.. your contain is better than that paid udemy content on python scraping .. can u can make a video for news articles/blogs scraping bulk full article,

@JohnWatsonRooney 4 года назад

Thanks for your kind words! Yes I have videos planned for covering scraping different sorts of sites in the near future

@farhadkhan3893 2 года назад

Awesome,, Thank you

@ariwansrisetya5714 3 года назад

Thank you mate! Very helpful.

@ikyou1901 3 года назад

Thanks, now I can destroy this world by using python 😊

@JohnWatsonRooney 3 года назад

Have fun...?

@ikyou1901 3 года назад

@@JohnWatsonRooney yeah, (evil laughing) thanks 😊

@ikyou1901 3 года назад

@@JohnWatsonRooney Btw how to save it automatically into well sctructured data *.txt, or *.csv without having the data looks like this: { 'name': 'hsjjdhdkdkhdkdk', 'rating': 'gorbhdjdjdjjdjdj', 'url': 'hskskfldldjdnddk' ....} ?

@captainfortnitegaming2889 Год назад

Can you please share this code in the description or some other website please???

@JohnAndrewDomondon Год назад

do webscrapping on yourtube links

@SadamFlu 3 года назад

Bro... You're the man. That was so well explained! You don't fuck around, you just hit it!

@JohnWatsonRooney 3 года назад

Thanks!

@forbidden_lion 3 года назад

Why always whisky? Do you love it so much?

@JohnWatsonRooney 3 года назад

Honestly mostly because that site has well structured html and pagination so I kept going back to it. And yes I do like whisky

@forbidden_lion 3 года назад

@@JohnWatsonRooney Understandable! Just to let you know.. My whole day went on to give you more watch hours xD. Love you for the content!

@AlaEddine19176 3 года назад

i want to collect the data products from web site to excel file can you help me plz

@shreyasdeodhare2574 7 месяцев назад

John, your video was fantastic! I appreciate the clear explanation, but I'm curious: will your approach work for any website? Looking forward to your insights!

@JohnAtkinson-ww8qe Год назад

Nice! These methods are the exact reason why I started my journey today in learning python

@deepak7751 3 года назад

Finally after browsing for 3 hours I found someone clearing doubts beautifully. Thank you for sharing such a nice video.

@JohnWatsonRooney 3 года назад

Thank you!

@deepak7751 3 года назад

@@JohnWatsonRooney Sir Can i get your email id i am having a query for which i need your help. Thank you

@lightsearching Год назад

Hello John, I am watching your videos with great interest. Many thanks for them. Decided to follow this playlist. But looks like method of two years old doesn't work at the present time. So I decided to use requests_html to render pages. it works very slow. Then I tried to use Playwright. Looks better but there is a pop-up asking you to accept cookies and after several pages it doesn't work. Then thanks to your video of hidden API I found it. And it works. But I get banned after some pages. Return first 524 reply and after even just mess of symbols. I may share with you my code. May be I did some faults. Or may be if you still think it is interesting for you to check how this site might be scraped and make update? It would be great.

@business5707 3 года назад

John very valuable content. thanks to share with the community

@kishonoi191 4 года назад

Ive started Taking Python more Seriously to improve my Hacking

@RodrigoLobatorodrigo 3 года назад

Simply awesome!! Really good job :D

@intesar_taieb 3 года назад

amazing video thank u

@im4485 3 года назад

Very nice... Straight to the point

@pepecopter 2 года назад

Great tutorial thank you

@johnkennethadolfo5295 3 года назад

how can i fix this problem? im new here. name = soup.find('h1', class_='product_title').text.strip() AttributeError: 'NoneType' object has no attribute 'strip'

@JohnWatsonRooney 3 года назад

That means that nothing is being found, try removing the .text and .strip and see what happens - also double check the html to make sure that’s the correct tags and class for the element you are looking for

@johnkennethadolfo5295 3 года назад

@John Watson Rooney i'd try removing the .text and .strip and it gave me an output "none" is that mean the script is working?

@johnkennethadolfo5295 3 года назад

@@JohnWatsonRooney tnx for the response bro i already fix the problem i just replace productlinks.append(baseurl + link['href']) with productlinks.append(link['href'])

@SeanWilston 3 года назад

Thank you John. Very clear and useful information

@pandharpurkar_ 3 года назад

thanks John. Stay healthy..! Good concept clearing skills you have

@ashishtiwari1912 4 года назад

This is what I was looking for. Most youtubers just made a video about how to scrape the first page but didn't tell of how to fetch the data for each product and then do pagination. Now it's very clear to me. I am getting one error:" 'NoneType' object has no attribute 'text' " after 30 or 40 iterations. I wonder what does that mean? I tried checking the solution to it on stackoverflow but the code shows the same error. And yes this is very useful for beginners to Intermediate. Keep making such videos. I have subscribed to your channel.

@JohnWatsonRooney 4 года назад

thanks for your kind words! it sounds like maybe you reached the end and got all the pages? see what happens when you go to the last page in the browser

@ashishtiwari1912 4 года назад

@@JohnWatsonRooney Yes. I have got all the pages. I have one more question to ask. I am putting down the link below. I am trying to extract the company details such as Name, Telephone number etc. The html code shows that it is with a list tag and then within each list tag, there is span tag with itemprop. I am trying to use span and itemprop but I am not getting the result that I want to. idn.bizdirlib.com/node/5290

@AbdihanadMohamed Год назад

Looks like I get rate limited(getting a 403 Forbidden) ->(in the second loop)after looping over each link to get the name, reviews, and price. It got me all the links but I get a forbidden when looping over each link to get the data. Any tips?

@YanhangQiu 3 года назад

Amazing!! Sooo helpful!!

@spearchew 3 года назад

excellent video. Better even than corey schafer and that's saying something. Subbed for sure.

@fazlaynur4509 3 года назад

soup.find_all('div', class_ = 'item') (run) = [ ] bro, how can i overcome above problems

@yslsupa3865 3 года назад

@Daily Ragnar I have the same problem - the structure of the website has been changed. There is an ul and li to list the items instead of div

@irfanshaikh262 2 года назад

The concurrent.futures applied to this has accomplished the task like a wonder. Thanks again John

@aadityasharma6855 Год назад

facing cloudfare problem please help

@smsmaboeella Год назад

this is amazing tutorial , but this line r=requests.get(testlink,headers=headers) give me 403 status code why? can you tell me ?! i put the latest versions from headers .

@Sara.Saavedra Год назад

Thank you very much, I just when I print(productlist) I get [ ] instead of the list. what should I do? please help!

@ahmadtaufik2658 Год назад

i get same problem

@KwadLife 4 года назад

Hey John, superb explanations! Any chance to get the code as a txt for simple testing? How to export the data into an excel sheet?

@JohnWatsonRooney 4 года назад

github.com/jhnwr/webscrape-ecommerce-pages sorry i missed this - i should really get into the habbit of uploading all my code..

@KwadLife 4 года назад

@@JohnWatsonRooney Thanks John! Highly apprechiate that 🤞

@muhammadsuleman7339 5 месяцев назад

this code works for every ecommerce website?

@funtoosh3864 2 года назад

Good.... But how to scrape the pages 1 by 1 according to search till 2k-5k products and all...?

@rodgerthat6287 2 месяца назад

Hey dude jsut stared my first internhsip, and this video has been immensly helpful! I really appreicate the effort put in and all the useful tips. Thanks!

@SAMWICK-fl1hi 2 месяца назад

i had this error b'400 - Bad request'

@TelstarB Год назад

How do you choose the best library for the scraping? Time? Complexity? Grat video btw

@AhmadNuaymi 7 месяцев назад

Hey, does this work on variants.

@arvinhoss 4 года назад

what if the href is a full url? meaning it contains the baseurl?

@JohnWatsonRooney 4 года назад

That’s fine just remove the part that adds the base url and save the whole url by itself

@ahmedsayed7138 2 года назад

Thanks for this brilliant video.. Actually, i have followed the steps and everything was fine but when creating the df I got this error : InvalidSchema: No connection adapters were found would you please assist 😇