Great video. I think 3 things are REALLY worthy of note because the answer to the failures isn't necessarily that the proxies were blocked: - Sometimes the proxies weren't able to be connected to in the first place - Sometimes the Server can't handle so many back to back requests AND Some proxies are set up by malicious actors to gain unauthorized access to connecting hosts.
Ive been binging your content because of an amazon scraper I'm working on. Can't help but giggle at the consistent struggle you have in typing "requests" All in good fun! Keep up the great content!
@@JohnWatsonRooney the only multitasking I do is copy what you type while trying to understand the concept/s behind it, which you have done exceptionally well. Thanks for being a virtual tutor, John! (Still having issues with bot detection with my bs4+requests, tried time.sleep and randomizing user agent with the fake user agent library, though)
9:28 this part changed my programming outlook to drastic extent. The wonders you can pull using the threaded approach is just sublime. Thanks John once again. It seems im running out of gratitudes to give you the due credit.
Thanks to the man who has all the answers to my questions. Man, you have the well of wisdom on behalve of scraping/python, JSON and what matters in that field. Although I am a pro IT guy for many many years (PM, Consultant, Architect and Advisor) this field of expertise is rather unexplored for me, but following your video's made it crystal clear to me. Thanks again for sharing
Beautiful content as always! I tried scraping the site in this video using bs4 as it's the only framework I know as of now. I hope you make a video on scraping this site as you said in the video.
Thankyou John! It's getting more interesting everytime you upload videos. By the way, can you start using Request+BeautifulSoup+Splash sooner in the future, especially in setting up? I'm have a couple of questions for you next Q&A series, I'm excited!
Loved the way you explain, this is the first time I've came across your content and I enjoyed learning every second, Will this script be also applicable for socks5 proxy?
Thanks! Do you know any services with rotating proxies that rotate every 60 seconds and were you can choose mix geo or a specific country? I have mine from proxy-stоrе but this is my first service, want to have cheap alternatives and to find out other options
Hey John, I am a Uni student studying Data Analytics. Currently doing a unit on "Data Acquisition" and your videos are far better walking through the complexities of web scraping than this current course! I'm doing enough web scraping now where I think it is beneficial for me to start looking at paid for rotating residential proxies. Do you have a service that you recommend? Even if you have affiliate links. If you don't have any links, I think it would be beneficial to seek out such sponsorship possibilities soon.
Hey John! new subscriber here..! Im enjoying your channel very much, I have one suggestion though, in most of your videos you refer to previous ones and say that you're going to post the links somewhere but you don't. As a newcomer it is a bit difficult to find the video you're referring to since well your thumbnails and titles are in general, similar. Links will help new subscribers drive through your content smoothly. Cheers!
Hi John, great video and thank you for your time and effort for creating these videos for us. I was wondering if you added the updated version of this video as you mentioned because I could not find any other tutorial on proxy on your site.
Hey! thank you for such a detailed video. Is it possible for me to skip Captchas by rotating working proxies on a website? Or is there a more efficient method to do it?
I think so yes. It's important to have working proxies but also to act like a real user as much as possible - use complete and real headers, don't send to many requests and rotate through proxies in randomly, not in an order
Hello even request.get response value 200 for a url and it looks like proxy is working but when we load a this website etc then it's always show can't access, load timeout,rendering timeout..etc. So do we have any way to check those proxy could work as normal? Thank you so much
Unfortunately most free proxies are blocked from the main websites so that could be your issue. You can try to find some that do work but in my experience it can be tough
Hi John, all your content is very helpfull like always. Can you make, well I supposed is possible. When you're scraping some site and after a few requests you get block or ask for some verification code, can you skip that current proxys and get another proxys from list of proxys? thank you!
Hi Jonathon. Sure that is very possible - instead of trying to handle the error of gettgin blocked i would jsut rotate through each proxy for each new request. You can spread the load out that way
I had the below error, and solved it by going into the documentation and used the example under proxies to setup the proxies. Maybe the requests library changed a bit since. "requests.exceptions.InvalidURL: Proxy URL had no scheme"
Thanks John for yet another useful video - I'm new to web scraping & have been blocked from a site I want to scrape, I was wondering & Im sure there are packages out there to save the full content of a website locally so we can scrape with no issues & Im not talking about big sites such as Amazon - do you think this is possible if so why no one else is talking about it? how would you go about it please?
Hello, is there a way to make it so the proxies are constantly changing via a api?? For example you have a 10k list of proxies with numerous sources, but the proxies get updated every 5 minutes
Not entirely sure what you mean but if you can request a proxy list, store them and use them for a few minutes, the request again and update that would work. Easiest solution would be to download the proxy list every 5 mins and store in a file, and use that file to import new proxies into your scraper
Can i ask something, with this technique we can still use Session from Requests to scrape faster?? or by using proxies we have to establish a new connection with the server from the start with every request?
Yeah that’s right, the proxy only changes your ip on each separate request - so if you are using a session it wouldn’t work, you have to create a new connection each time
Thanks for this video. I'm using requests_html for my scrapper, do you know what is the equivalent of (print(r.json)) ? I'd like to be sure that the scrapper is using the right proxy. Thank you!
Great video! How would I go about getting the equivalent of a r.json response (What IP used is what I want to know) when targeting a URL like Google for example, where the .json will not work?
i try your code, the problem is 2:59, the response is 200, when i use print(r.json) there is error so go to except, but without json, proxylist show working, please tell me why print(r) and print(r.json) are different result
I'm currently scraping Facebook with Selenium for my final project. (I can't use the API for many reasons and I can't change the source as my project depends solely on facebook: if you're going to say it's illegal) I switch user-agents but should I use proxy too? I get blocked quite often and I'm fairly new to this.
It's really helpful John. I just wanna ask, is it possible if we use openVPN, thank you. I just wondering openVPN for requesting, I think it could be awesome, please.
really informative video.... but is it possible to use proxy for python program or module?.... i mean, can i use proxy for smtplib python module etc?..... sir, if you have any solution or reference please tell me.....
Great Video. I tried to use a proxy available online and returned back with a 200 status code. But if then try to print the text (page.text), I get a nonetype object. Can you help me why this would be a case
thanks that was awesome thank you do you suggest any method for searching around 20000 words in a day in google and get the results? without getting blocked?
I know this video is specifically about requests, but can this be done using normal Selenium? I know the HOST:PORT proxy configuration for Selenium works, but can Selenium proxies be configured using a proxy network configuration (USER:PASSWORD@PROXY:PORT)? From my research, questions on the internet, and support tickets with Chrome Driver and Selenium, it sounds like this isn't possible:
i got this error "requests.exceptions.ProxyError: HTTPSConnectionPool Max retries exceeded with url: /ip (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(54, 'Connection reset by peer')))" with every proxies i use. plese help me
Thank you very much for this amazing tutorial. As for the code of scraping proxies, I tried to export the proxylist to csv file and it is ok, but I noticed that the value 0 is on the first row ( I recognize that this is the index of the column of the dataframe from pandas package). I tried searching how to get rid of this index of column but there is no luck. How can I get rid of the index of column of the dataframe df = pd.DataFrame(proxylist) df.to_csv('Table.csv', encoding='utf-8', index =False) This works for the indexing of rows not for the columns.
Hello John, I am trying to implement this code with a list of proxies that already work in a csv file and the code runs without any errors but does not give me an outcome whatsoever. I believe my issue is originating from the extract function and I was hoping you could lend me a hand if that is possible. I am looking forward to the sequel of this video you said you would make so I can further understand. Thank you
It's possible, the downside is as far as i am aware you need to close the browser and start a new one each time you rotate through the proxy. Adds a lot of time
I have tried so much proxies and I didn't find a working one. What's the best approach to get a working proxy? .. Another question: I have tried my ip address and the port and tried as a proxy but got failed too !!!
The free ones never really seem to work! unfortunately I believe you need to use a paid service and It's something I want to check out in the future, but haven't used right now