Recursion ➰ for Paginated Web Scraping

Подписаться 353 тыс.

Просмотров 31 тыс.

50% 1

Sponsored by: Brilliant, thanks! Be one of the first 200 people to sign up with this link and get 20% off your annual subscription with Brilliant.org!
brilliant.org/DevTips/
We figure out how to deal with the paginated search results in our web scrape. RECURSION is our tool - not as difficult as you might think!!
🗿 MILESTONES
⏯ 00:12 Fika 🍪
⏯ 13:10 Extracting the next page number with regex
⏯ 16:50 Encounter with prettier... 🌋
⏯ 18:39 ➰ Recap
⏯ 20:15 TIME FOR RECURSION 😎
⏯ 29:00 Quick Google rant 🌋
⏯ 29:23 ➰➰ Rerecap by Commenting the Code
See the previous episode where we explain Puppeteer and finding the data to scrape
▶️ • Web Scraping with Node...
The code used in this video is on GitHub
🗒 github.com/chhib/scraper/tree...
Puppeteer - Headless Chrome browser for scraping (instead of Phantom JS)
🔪 github.com/GoogleChrome/puppe...
The editor is called Visual Studio Code and is free. Look for the Live Share extension to share your environment with friends.
💻 code.visualstudio.com/
DevTips is a weekly show for YOU who want to be inspired 👍 and learn 🖖 about programming. Hosted by David and MPJ - two notorious bug generators 💖 and teachers 🤗. Exploring code together and learning programming along the way - yay!
DevTips has a sister channel called Fun Fun Function, check it out!
❤️ / funfunfunction
#recursion #webscraping #nodejs

Опубликовано:

11 окт 2018

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 86

@justvashu 5 лет назад

I would have used the “next” button at the navigation and use its href to get the next page until there are no more next pages

@OfficialDevTips 5 лет назад

Great idea!

@PavelIsp 4 года назад

@@OfficialDevTips yeah, that would make sense since a lot of sites have a different strategy for pagination :)

@aquibkadri4984 4 года назад

@justvashu how to get that total count?

@simoneicardi3967 5 лет назад

.. and you just answered my question on the previous video! Thanks! I enjoyed so much this two on web scraping.

@naansequitur 5 лет назад

These two web scraping vids are awesome! Would love to see one on building a crawler 🕸

@hafidooz8034 4 года назад

Whats the diffrence btw scarping and crawler?

@pacopepe92 4 года назад

@@internet4543 did you do something like that? im trying to do that

@Paltibenlaish 3 года назад

@@hafidooz8034 I think crawler doesnt get the content just hit urls, not sure

@yjk22 5 лет назад

Awesome lesson, really practical

@gmjitendra 4 года назад

Thank you so much David for this amazing scrapping video.

@kasio99 5 лет назад

Love this video - learned so much and the guys are entertaining to listen to. Thanks

@TheUKFishingGuy 4 года назад

Awesome stuff!

@AbhishekKumar-mq1tt 5 лет назад

Thank u for this awesome video

@drewlomax7837 5 лет назад

David, great video. As for that h1 tag... they have a history of funny h1 tags on these landing pages. A little over a year ago, before the "360" rebranding changed their marketing site, I was looking at how they formatted their markup for SEO on one of their product pages. I noticed that the h1 tag was in the markup and said, for example, "Google Tag Manager...", but it was not visible to the user. If I remember correctly, on desktop the h1 tag had display:none attached to it. Then, once the hamburger menu breakpoint was crossed, it was still display:none; until you opened the menu, at which point display:none was removed and the h1 tag was wrapped around an img element with an image of the stylized "Google Tag Manager..." The actual text "Google Tag Manager..." in the h1 tag was hidden with CSS and probably used as a fallback. After some researching on Matt Cutts blog I found out that this is semi-okay to do.

@charlyecastro 5 лет назад

Great Vid! You guys should go over Docker next

@amanpreet-dev 4 года назад

Nice tutorial and very well explained

@avecho123 3 года назад

Great tutorial, thank you so much for sharing! I am wondering how to design the function to stop when a certain length of found products has been reached (e.g. when 50 total partners are found, stop the recursion and proceed to other parts of the code) ?

@Paltibenlaish 3 года назад

this is amazing thnk you so much xxxx

@kryptic100k3 3 года назад

Amazing thanks.

@jolyonfavreau3160 4 года назад

Thanks you!!! excellent video that really helped when trying to figure out puppeteer, moreover recursion! I did find that the count in recursion didn't like numbers over 9 so i added these two lines to account of any pagination number. ``` const digit = currentPageNumber.toString().length; const newStreet = street.slice(0, -digit); ``` thanks again for a well timed video that saved the day :)

@codenikninja9814 5 лет назад

Subscribed!

@caiolins2495 4 года назад

Thanks!

@spoooget 5 лет назад

Im impressed that you didnt get an error saying 'browser is not defined'!

@OfficialDevTips 5 лет назад

You mean because it is used in the beginning of the function? The function is not run until it is called. At *const partners = extractPartners(firstUrl)*, that’s when we need browser to be defined. And it has been just above. The code is not run from top to bottom!

@spoooget 5 лет назад

Yeah! because my thought was that your exractPartners would need to know what browser is as it is evaluated. But I'm happy to be proven wrong - it's a great way to learn. I really enjoyed this web scraping series. Hope the fika was tasty ;)

@Soundtech98 5 лет назад

David, please bring back the music when you timelapse :) Interested to see where this project is going. Keep it up, always looking forward to the next episode of this series.

@OfficialDevTips 5 лет назад

Yeah cool! I’ll try doing that more - it just takes time so I try to get something out even though I don’t have the time to add the finishing touches

@trendYou 4 года назад

Thanks! Hmm silly-questions-section here : the first rule of scraping is "be nice", dont overload servers etc , wouldn't it be nicer if we first copied all result pages and scraped them locally? what's the general approach?

@alexzanderflores4185 5 лет назад

Why all the regex stuff over just passing the page number as an argument and creating the URL in the method?

@OfficialDevTips 5 лет назад

Why call a variable X instead of Y? It is just one way of solving it. There are thousands of ways. Here we were lucky the pattern was so simple, for the next site it may not be. Sure though, it could be done differently, using regex for this exact example was perhaps slightly overengineered. In programming there is never only one way of solving something. I like to not stuff parameters into the function, think it looks neat and is simple to understand what’s going on when browsing through the code. By passing the URL it is simple to scan and understand “aha the function will use that URL and get partners out of it”.

@arzievsultan 5 лет назад

I would say "aha next page" but not "aha next url", you took a worse solution and now you have to explain why it is good.

@Laek4 5 лет назад

While the cat's away the fika comes out to play.

@congthanhinh7987 4 года назад

thanks :like

@kainarilyasov4644 5 лет назад

Did the same thing with another web site. All is same. But sometimes it returns me an empty array [ ] , sometimes it scrapes only 10 pages even there are 14 pages. Why is it so? I am so tired.

@sridharnetha6003 4 года назад

Hi David! On my pagination Url has no page parameter. Is there any way to scrape for Ajax response? The required content is loading from AJAX/ Client-side.

@jordihoven2349 3 года назад

Would love to see how you would save the data into a JSON or .txt file or even FireBase

@AntonKristensen 5 лет назад

Could use the http request status code to stop the recursion... Could probably also create another instance of the puppeteer that runs paralell to check if there is a next page, instead of using the same instance, would perhaps double the speed.

@OfficialDevTips 5 лет назад

Regarding the first suggestion the site still returns 200 so that won’t work in this particular case. If we were to do this on thousands of pages and multiple sites - yes that’s a cool idea. At this stage though I think that is a bit too much overoptimization.

@Djzaamir 5 лет назад

Hi david, i guess it''d be more interesting and catchy if you add some sound effects to the intro ;)

@JohnnyMylot 5 лет назад

Hello everyone, I think it does not work anymore. The class '' Compact '' is no longer there. How to fix that? I try with '' Landscape '' and it returns me an empty array in any case.

@gauravthakur3085 5 лет назад

Which text editor are you using?

@DrempDK 5 лет назад

It's Visual Studio Code.

@pjmclenon 3 года назад

hello i have other basic python web scape code that saves to csv file and so what is the added code so we can save to csv file plz ? Lisa and thank you

@constantinyt4845 4 года назад

I think u could simply use a while loop until the function returns an empty array

@tanvorn9323 4 года назад

This is what while loop is used for

@JBuchmann 4 года назад

I'd like for you to deploy this (maybe to Firebase hosting, using Firebase Cloud Function). You probably would come up with an annoying CORS error, so I'd be interested to see how you resolve it. For myself, following the CORS tips in the Firebase Cloud Functions docs doesn't seem to help with web scraping with Puppeteer. :(

@cjoshmartin 5 лет назад

Just use a set, each time you update the set with the pages new to the list

@OmgImAlexis 4 года назад

I would have just gotten the 404 page, used that to know the amount of pages, fetched all in parallel. Not really sure why recursion would be needed here.

@djsamke384 5 лет назад

Hi David. Can you please address the legality around the concept of web scaping. I mean, after I watched your last video with Matius, I got really excited and did some examples of my own. However i later found out that web scraping can carry legal consequences if done wrong. So I read the terms of use of a few websites and i found out that web scraping is prohibited in all of them. So can you also advise us on how to use this properly because we can go to jail because of ignorance. Otherwise thanks for the videos

@OfficialDevTips 5 лет назад

It depends heavily per jurisdiction. In Sweden even personal information like how much you earn, where you live etc, is public, so we are pretty used to that. I'm sure it is more restricted in other countries. As we argued in the previous video, if the content is there in the public domain, it ought to be available to anyone, server or human alike. Still any publisher is of course allowed to do what they please. If they want to block your IP because you drain their resources or do things they suspect are not OK, they have all rights to do that (I presume!).

@djsamke384 5 лет назад

Cool. So just to be clear, breaching a company's terms of use will not result in a "cyber-crime" prison sentence or a fraud charge? So its safe to say the worst that could happen, apart from being sued for copyright infringement, if you reuse the content, is getting blocked?

@OfficialDevTips 5 лет назад

I can't give legal advice. I don't know where you're located. It depends on the jurisdiction. But definitely you should beware of the terms of use. Often this is common with APIs. Many allow for a lot of fun things... Until you read the terms of use for the API. :( Swedish sites typically do not have terms of use (I don't know if it is implied through our constitution somehow) so Mattias and I are not very used to that even being an issue.

@djsamke384 5 лет назад

Oh ok cool... Thanks for the videos!!!

@4franz4 5 лет назад

Are you from eastern europe?

@ksubota 5 лет назад

Sweden

@4franz4 5 лет назад

@@ksubota I like your name. The best das in the week 🙂

@djsamke384 5 лет назад

Can you do a video on how to track a user's (maybe of a your website) exact location using either IP, MAC address or any other way, except the lame geolocation from javascript which requires use permission. Please dont just get the address of the server farms which is as far as i went... Please try to get the user's exact location, like when we use google Earth, we must see the user's house or office, depending where... Awesome videos, You know Im a subscriber!!!!

@mikequinn 5 лет назад

So let me see if I have this correct. You want to be able to see a users exact location (house/office/whatever) without them granting express permission that allows you to do so? I don't think so.

@djsamke384 5 лет назад

Oh cool... not for perverted reasons... solely with the intention of improving user experience

@GifCoDigital 5 лет назад

lol would you like their passwords and door keys as well?? WTF dude there are reasons this is NOT possible.

@djsamke384 5 лет назад

Ah maybe you don't think on my level brah... you can have sensitive information and not misuse it. I do have their passwords if they log into my website don't be dumb brah

@djsamke384 5 лет назад

@@GifCoDigital btw it is possible using mac address

@Mordredur 5 лет назад

Wouldn't an infinite loop work that breaks when there is a 404?

@OfficialDevTips 5 лет назад

Sure it would work. It's the never-ending discussion "what is best, a loop or a recursion". Many argue loops are easier to comprehend and that's why you should use them. But that also has had the effect people rarely use recursion and do not understand it. The purpose of the video is to explain recursion through an example you can relate to - instead of the traditional Fibonacci numbers example. Who uses that in the real world? JavaScript is a functional language and the recursive approach fits better, in my opinion. Haskell, also a functional language, doesn't even have loops, you have to recurse. (It is still a 200 OK response in this example. I take it you mean when reaching a page with no items in it.)

@Mordredur 5 лет назад

DevTips Thanks for the answer. An infinite loop solution would have been a boring video :)

@OfficialDevTips 5 лет назад

Also I expect to crawl nested structures later on. Like a category tree structures. Then it would also be more difficult with the loops.

@ronijuppi 5 лет назад

@@OfficialDevTips I had to write a recursive and a non-recursive function that prints a binary tree for a tech interview. Writing that without recursion was surprisingly hard. I always just defaulted to recursion on that one.

@pawanpoojary2339 4 года назад

10:23 when your code runs in first attempt without any error.

@stephenjames2951 5 лет назад

Use your regex capture group to get the url before the page number - not hard coded

@OfficialDevTips 5 лет назад

Great suggestion!

@thiagovilla970 4 года назад

How so?

@logandarsee4433 4 года назад

I have a serious, and only slightly related question. The truth is I am not a coder, I am renting software via ParseHub. I can use the software just fine, but the website I am using to scrape, despite having tens of thousands of desired results, has a page limit of 15. There is no way I can get the amount of information I need from such small scrapes. Is there anyway to bypass this page limit and gain access to the totality of the actual results, oppose to pitiful amount I am actually able see at this time?

@logandarsee4433 4 года назад

www.bbb.org/search?find_country=USA&find_latlng=41.308563%2C-81.051155&find_loc=Nelson%2C%20OH&find_text=Contractors&page=14&touched=9 This is the kind of result I am talking about.

@mazwrld 5 лет назад

Yoo can you guy do basic Java programs.. I studying it in school

@GifCoDigital 5 лет назад

lol change schools!! Dont waste your time learning that crap. And definitely dont ask a JavaScript dedicated channel to teach it.

@FraJaiFrey 5 лет назад

Great videos! Does anybody know a channel that is fun like this one ir fun fun function but that uses Python?

@pureretro5979 5 лет назад

Using a regular expression to pull out the page number seems a little odd to me rather than simply passing the function a base URL and the initial page number. Still, a great video.

@OfficialDevTips 5 лет назад

I'm in a Regex Anonymous. I use it to make coffee.

@ConquerJS 5 лет назад

LOL

@FredoCorleone 4 года назад

The meat of this is something like: _return (extractedParners.length) ? partner.concat(extractParners(id++)) : []_

@sazzadhossainnirjhor5582 3 года назад

If there is no page=1 or page=2 . ? and paginating with (javascript:GoPage(2) or javascript:GoPage(3)) how can i scrap ? (jobs.bdjobs.com/jobsearch.asp?fcatId=1&icatId=) please anyone have any idea about that then suggest me. i stuck here.. Thanks in advance

@mtheoryx83 5 лет назад

Kinda just imagining their SEO/Analytics folks freaking out at the web hits all over their site! Just to keep developers on their toes, send over an IE6 web user agent. (I'm kidding, don't do that)