Website to Dataset in an instant

Подписаться 79 тыс.

Просмотров 7 тыс.

50% 1

1000 items in one API request... creating a dataset from a simple API call. I enjoyed this one, there will be a part 2 where I clean the data with Pandas.
This is a scrapy project using the sitemap spider, saving the data to an sqlite database using a pipeline.
Join the Discord to discuss all things Python and Web with our growing community! / discord
If you are new, welcome! I am John, a self taught Python developer working in the web and data space. I specialize in data extraction and JSON web API's both server and client. If you like programming and web content as much as I do, you can subscribe for weekly content.
:: Links ::
My Patrons Really keep the channel alive, and get early content / johnwatsonrooney (NEW free tier)
Recommender Scraper API www.scrapingbee.com?fpr=jhnwr
I Host almost all my stuff on Digital Ocean m.do.co/c/c7c90f161ff6
I rundown of the gear I use to create videos www.amazon.co.uk/shop/johnwat...
Proxies I recommend nodemaven.com/?a_aid=JohnWats...
:: Disclaimer ::
Some/all of the links above are affiliate links. By clicking on these links I receive a small commission should you chose to purchase any services or items.

Наука

Опубликовано:

16 мар 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 28

@shubhammore6332 Месяц назад

I never comment on youtube videos but this has been so helpful. Thank you. Subscriber++

@stevenlomon 3 месяца назад

Super neat!! Also as a Swede I chuckled at "this is a pretty standard e-commerce site" when talking about Sweden's most valuable brand haha

@JohnWatsonRooney 3 месяца назад

haha! yeah huge brand..! thanks for watching

@superredevil12 23 дня назад

love your video man, great content!

@cagan8 3 месяца назад

Just followed, great content

@graczew 3 месяца назад

Good stuff as always. I will try use this with fotmob website. 👍😉

@jayrangai2119 2 месяца назад

You are the best!

@matthewschultz5480 Месяц назад

Thank you very much John, great series - I am a bit stuck between this video and the cleaning with Polars video in taking the JSON terminal output and converting for use in Polars. Is there a def and function I can add to the code to output to csv (or JSON)? I considered importing csv and json libraries and creating a def and print but unsure on this step. Many thanks again

@LuicMarin 3 месяца назад

I bet you can't make a video on how to avoid cloudflare websites, not simple test cloudflare website but proper ones where cloudflare detection works properly

@negonifas 3 месяца назад

not bad, thanks a lot.

@mattrgee 3 месяца назад

Thanks! Another really useful video. What would be the best way to either remove unwanted columns or extract only the required columns then output a json file containing only the required data? This and your 'hidden API' video have been so helpful.

@JohnWatsonRooney 3 месяца назад

thanks! you could remove the keys from the json (dict) in python before loading to a dataframe, or if you are going to use the dataframe remove them there buy dropping columns

@TheJFMR 3 месяца назад

I use polars instead of pandas. Everything improved with rust will have better performance ;-)

@ying1296 3 месяца назад

thank you so much for this! i always had the issue of trying to scrape data from sites which paging is based on "Load More"

@JohnWatsonRooney 3 месяца назад

Glad it helped!

@mohamedtekouk8215 3 месяца назад

Kind of magic thank you very much 😭😭😭 Is this can be used on scraping multiple pages ??

@rianalee3138 3 месяца назад

yes

@RyanAI-kk1kv 3 месяца назад

I'm currently working on a project that involves scraping Amazon's data. I have tried a few methods that didn't work which led me to your video. However, when I loaded amazon and looked through the JSON files, I couldn't find any of them that included the products. Why is that? What do you recommend I should do?

@viratchoudhary6827 3 месяца назад

I discovered this method three years ago🙂

@milesmofokeng1551 3 месяца назад

How long had u been using linux or archlinux distro would you recommend it?

@JohnWatsonRooney 3 месяца назад

3 years full time, dual boot/on and off for 10+. I use Fedora at the moment, seems to be a good mix. Unless you rely on windows specific software for work, or play games, 100% linux. Only thing I don't do on linux is edit videos, and that's for convenience.

@heroe1486 2 месяца назад

@@JohnWatsonRooney Most games are more than playable thanks to proton now tho, the only drawbacks are the ones with really intrusive AA like Valorant's one.

@JohnWatsonRooney 2 месяца назад

@@heroe1486 yeah its good to see, last thing i played was PoE and that was absolutely fine

@EmonNaim 3 месяца назад

😘😘😘

@schoimosaic 3 месяца назад

Thanks for the video, as always. In my attempt, the website's response didn't include a 'metadata' key. Instead, the page restriction was specified under the 'parameter' key, as shown below. Despite setting the 'pageSize' to 1000, I only received a maximum of 100 items, which suggests a system preset limit by the admin. I'm uncertain about how to bypass this apparent restriction of 100 items. params = { ... ... 'lang': 'en-CA', 'page': '1', 'pageSize': '1000', 'path': '', 'query': 'laptop', ... ... }

@JohnWatsonRooney 3 месяца назад

there will be a restriction within their API, I was surprised the one in my example went up so high, 100 seems about right. you will have some kind of pagination available to get the rest of the results