Тёмный

Who is FASTER? Scrapy vs GO vs ME 

John Watson Rooney
Подписаться 88 тыс.
Просмотров 7 тыс.
50% 1

Check out ScrapingBee for youself here: www.scrapingbe...
A fun look at scraping 3 ways, to see which is fastest, Scrapy, GO, or my own ASYNC code
Scraper API www.scrapingbe...
Patreon: / johnwatsonrooney
Donations: www.paypal.com...
Hosting: Digital Ocean: m.do.co/c/c7c9...
Gear I use: www.amazon.co....

Опубликовано:

 

11 окт 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 18   
@Cheerfulnag
@Cheerfulnag Год назад
So... Basically for python if execution time is important - Async httpx, if speed of development - Scrapy. Cool to see such tests. I didn't know that Scrapy is so much slower.
@jeroenvermunt3372
@jeroenvermunt3372 Год назад
This is true if scraping was only that. However, unfortunately in practice you will have to deal with a lot more stuff where scrapy can be more or less beneficial
@Cheerfulnag
@Cheerfulnag Год назад
@@jeroenvermunt3372 Yeah, I forgot to mention a scenario where Scrapy doesn't have what's necessary for the project :)
@Jorge86797
@Jorge86797 Год назад
Scrapers built with scrapy and custom asyncio based solution presented here - are not (logically/algorhythmically) equal and from my point of view It is not fair to compare it directly as here. On 4:01 we see that custom asyncio based solution has get_total_pages method: After processing of first catalog page - scraper will send request request to all remaining catalog pages. While on parse method in Your scrapy solution 0:43 we see only one (next) catalog page request created per processed catalog response. it means that request to page4 will be created/send only after processing of catalog page3; page5 - after page4 and etc. It will cause additional delay and it will increase idle time of scrapy application and this part is logically different comparing to Your custom asyncio based code. However if implement this approach in scrapy and adjust it's concurrency settings to: "CONCURRENT_REQUESTS": 150, "CONCURRENT_REQUESTS_PER_DOMAIN": 150, And (the most important) "LOG_LEVEL": "INFO" Result performance of this (at least in my tests it dropped from 25 secs to 11-13, I mean "elapsed_time" stats parameter in scrapy) may look closer to other presented options. It is not something unusual or unique in scrapy. I quite frequently see application of this approach.. in stackoverflow questions (scrapy tag) and on other scrapy community channels - implemented in start requests (with hardcoded number of pages inside cycle) or by setting in start_urls as list comprehension (f string with indices from range..)
@marcosolari1012
@marcosolari1012 Год назад
Hi John, have you considered making a video on scraping and downloading images using scrapy (if you haven't already). I'm having a hard time in this area
@JohnWatsonRooney
@JohnWatsonRooney Год назад
That’s not something I’ve covered actually- last I checked it was hard to rename them once scraped, I will have a look again and see!
@marcosolari1012
@marcosolari1012 Год назад
@@JohnWatsonRooney Thanks John, let me know if anything comes of it please
@Cheerfulnag
@Cheerfulnag Год назад
Oh, htop on the background, fellow Linux user.
@DaveThomson
@DaveThomson Год назад
yeah but its just there for effect, no one needs to run htop if they're not actively looking at it
@Cheerfulnag
@Cheerfulnag Год назад
@@DaveThomson And?
@davidmurphy563
@davidmurphy563 Год назад
Title should read "who's"; not "whose".
@newvortexmind2738
@newvortexmind2738 Год назад
Hi how can I use scrapy to scrape a html response I get from using scraper api
@whitebullet3246
@whitebullet3246 Год назад
You wouldn't happen to have a discord channel or some form of community place would you?
@JohnWatsonRooney
@JohnWatsonRooney Год назад
I don’t at the moment, I’ve been considering a discord
@zdinbalti
@zdinbalti Год назад
How about RUST.
@JohnWatsonRooney
@JohnWatsonRooney Год назад
I thought about including rust but I’ve only just scratched the surface of the language and didn’t think it would be good to include on that basis
@sarimali9843
@sarimali9843 Год назад
Hello John, i hope your fine, just i want to know how can i resolve (reCAPTCHA - Google)
Далее
Hurricane Milton: Storm damage in Fort Myers, Fla.
01:05
Авто уровни Happy Glass level 604 - 606
00:49
This is How I Scrape 99% of Sites
18:27
Просмотров 116 тыс.
Level Up Your Golang: 5 Concepts You Need to know
19:22
I tried 8 different Postgres ORMs
9:46
Просмотров 416 тыс.
This Is Why Python Data Classes Are Awesome
22:19
Просмотров 809 тыс.
Beginners Should Think Differently When Writing Golang
11:35
This script I threw together saves me hours.
13:38
Просмотров 20 тыс.
15 Python Libraries You Should Know About
14:54
Просмотров 393 тыс.
Why I Use Golang In 2024
9:21
Просмотров 325 тыс.