Web Scraping with Beautiful Soup - Make Databases from Scratch

Просмотров 71 тыс.

% 1 743

In this video we'll extract information from web pages and store it in a CSV file.
STEP 1. We'll scrape a webpage with Beautiful Soup.
STEP 2. We'll fine-tune the extracted information with Regex.
STEP 3. We'll store the information in a DataFrame.
STEP 4. We'll save the DataFrame to a CSV file.
Webpage URL:
docs.python.org/3/library/random.html
Jupyter Notebook Code:
github.com/MariyaSha/WebScraping
Read Blog Post in Medium:
medium.com/@mariyasha888/web-scraping-with-beautiful-soup-2c45a731df2e
Read more about Beautiful Soup in the Documentation:
www.crummy.com/software/BeautifulSoup/bs4/doc/

Наука

Опубликовано:

3 июн 2020

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 113

@iansjackson 3 года назад

Could I respectfully mention a slight efficiency in getting onto one line "item = item.text.replace(' ', ' ').

@PythonSimplified 3 года назад

But of course, Ian! 😃 I always appreciate your input, especially when something can be done more efficiently! Thanks for taking the time and sharing your solution 😊

@iansjackson 3 года назад

@@PythonSimplified thank you

@jesusarmentasegura Год назад

U r so clean at explaining and makes it look so easy. U makes me feel like a senior programmer although I barely know how to turn on my PC. Thank you very much.

@jaznarossi3331 3 года назад

This was INCREDIBLY helpful! Thank you for explaining all of this so clearly.

@PythonSimplified 3 года назад

Thank you so much Jazna, glad I could help! 😁

@webscrapingwithandy2110 4 года назад

⭐️ Timestamps ⭐️ 0:59 - import libs 1:33 - load html code from a url 2:37 - using Chrome Inspector (DevTools) 3:06 - find all function names 4:53 - find all function descriptions 7:02 - store the data inside a DataFrame 8:50 - export the data into a CSV file

@MrJackod91 2 года назад

I saw a 3 hours course about this but it was hard for me, now you did it so easy and I understood everything!!! Thanks!!!

@tav9755 3 года назад

Big respect: Beautifulsoup in 10 mins. Wow

@securitydogma2961 2 года назад

I love this channel. I'm learning so much from these fantastic tutorials.

@dorotamarkowska5542 3 года назад

It is so simple if you are explaining it! Thanks a lot.

@PythonSimplified 3 года назад

You're welcome, I'm glad you liked this tutorial! 😀

@leonardoemanuelbaizre9705 3 года назад

How do you do to program so easy? How do you reason so easy? The code comes out for you. You remember each function of each library, you remember everything very easily. How did you manage to learn Python with such skill? I really like your videos, I learn a lot. Greetings from Argentina.

@gummypotatoes2887 2 года назад

It's not memorization, it is knowing how computers work. If you fully understand what is happening in each line written, you will remember the syntax.

@johntessier6459 3 года назад

I appreciate you making this, I am currently using this and your example code to do some scraping for work. I have no real prior knowledge of using python, although I do have a background in computers. This is very helpful, although I have to scale the script to work for multiple sites. Thank you. :)

@srini3828 3 года назад

Excellent, you are my teacher, explanation is very simple.

@markslima1557 Год назад

very cool video i learned a lot! note, for the EXAMPLE section, the tag has changed from 'div' to 'section' that will help. thanks

@neilgyverabangan6989 2 года назад

The best! Thank you soo much for the tutorial! I hope you'll make a video about scraping LinkedIn profiles from a search result. Thanks!

@mayankmaurya9990 2 года назад

leaning from you is a beautiful task

@billreed1606 2 года назад

This was good, clear, and concise. Thank You

@rockbinary8520 3 года назад

you are my ideal programmer

@huuNguyen-bm2ju 3 года назад

Beautiful teacher and code clear.

@chiranjeebroychowdhury7759 3 года назад

Another nice and simple video showing the basics of BeautifulSoup. But viewers should be aware that not every website on the Internet should be scraped. People should first verify which portions of a website are allowed to be scraped and then run the scraper, otherwise there will be trouble.

@PythonSimplified 3 года назад

Thank you so much Chiranjeeb! 😁 I wouldn't be so sure about the "allowed" part though, as there's nothing you can do with web scraping that can't be done manually with your keyboard and mouse! If something is not meant to be copied - you won't have the option to interact with that particular element, it's quite easy to do with simple HTML/CSS! But again - as long as you can do it with your mouse and keyboard - why would web scraping be any different? To support my claim, if web scraping actually got anybody in trouble - I would be in prison a long time ago along with all the cyber security experts, marketing advisors and software testing individuals. There is nothing illegal about automation and scraping if you're using it for educational or personal purposes! When you try to distribute this information, however, this is where you might get in trouble with copyrights - but as long as you keep the usage within the "fair use" guidelines - you don't even need to ask permission from the original owner to use his content. This allows us to freely use the internet without worrying about downloading pictures or coping text/code that was created by others 😉 In terms of social media platforms though - you are correct! The terms of service of some platforms may "not-allow" you to crawl or scrape - but they can't get you into legal trouble no matter how hard they try!!! They sure can delete your account or restrict it - but this is why I always recommend using a dummy account instead of your personal account whenever you're running bots 😁 If they delete your account - just open a new one and keep scraping like a boss! 🤪

@chiranjeebroychowdhury7759 3 года назад

@@PythonSimplified What you are saying makes sense. As long as you don't distribute you are good to go. Also websites have a robots.txt file. I think that has more info on which sub-domains might be restricted.

@PythonSimplified 3 года назад

@@chiranjeebroychowdhury7759 yup! Or you can also include a robots meta tag inside the section of a webpage to prevent some of the scraping. It's great to combine with the robots.txt! With that said - you can always bypass it with with fancy Python commands or services like ScrapingBee that use proxies. Alternatively, you can even use Selenium to bypass these restrictions as it generates a browser window, which makes it very hard to distinguish between a legitimate user and a bot 😃 Headless browsers are a bit more problematic from that point of view...

@chiranjeebroychowdhury7759 3 года назад

@@PythonSimplified Yeah I am going to watch your Selenium videos this week. Never really got around to learning it. It's not really much used in academics. I am actually a CS teacher so I never felt the need of learning Selenium because it's not in many curriculums. But I do realize now that it's actually pretty interesting!! And I am so glad to see that you interact with commenters so promptly! Maybe you could consider having a Discord server or something similar.

@keithrfield Год назад

@@PythonSimplified Hello. Anytime you browse a webpage you are downloading content. Anything you see on a webpage is currently on your computer. So, you don't even need to specifically request that content is downloaded as it is downloaded by default. If downloading content was illegal, ever user of the Internet would be a criminal. The issue comes when you use someone else's content for profit. I once found a picture of a squirrel on the net that I used in a website I was creating for a pest control company. Several months later the website owner received a letter from a law firm asking for damages. FOR A PICTURE OF A SQUIRREL!!!!! It was very unethical as the there were no damages. The picture was taken by someone on vacation and they didn't suffer any losses by its use on the website. We ignored the letter and that was the end of it.

@_mytube_ 3 года назад

I like that relaxing music.. as this can be stressful

@belamonetatheanalyst Год назад

so easy to understand! thanks!

@sebastianmt02 3 года назад

Beautiful Teacher !!, thank you very much !

@nczioox1116 4 года назад

Nice! I've been using BS and selenium to scrape and store in a Heroku Postgres database. Its been an awesome small project to play with

@PythonSimplified 4 года назад

I've been a huge fan of BeautifulSoup for a long time, and today I found out it actually has a "daughter" library named Mechanical Soup, which offers very similar functionality as Selenium. I thought I should mention it, as I'm still in awe with how simple it is! :)

@begaal9840 3 года назад

Somehow my learning ability skyrocketed

@PythonSimplified 3 года назад

Glad I could help! 😁😁😁

@khayamkhan2702 4 года назад

I appreciate this but now we have ParseHub in which data scraping is much easier. you need to write a small program for re-updating the data and boom.ParseHub makes it way more easy to scrap the data. Anyhow its was a lovely tutorial. Keep it up !

@PythonSimplified 4 года назад

Thank you so much Khayam! I'll definitely check out ParseHub, I'm a big fan of easier methods! Thank you for letting me know :)

@vijaykumarlokhande1607 3 года назад

Good Work. It is the shortest one and the useful one.

@PythonSimplified 3 года назад

Thank you, I'm glad you liked it! 😃

@kasperandreasen4110 2 года назад

Boom and then i got into webscraping, thanks.

@angelfoodcake1979 3 года назад

As usual, a great tutorial to follow along. I don't know though why my lists are both len 25 :D I got confused about the jumping to a second soup.findAll. I would have liked to see that second example finished to the end to where you get the DataFrame, but it's straight back to the previous example. Specifically, I would have liked to see more on re usage to filter out all the stuff around the useful information because it comes with more I feel like. I found that straight to_csv also works also in colab, it then shows up in the left panel under files from where it can be downloaded. But connecting your own drive is more elegant, of course. Fab job. I'm really enjoying these tutorials you put out! Keep them coming :)

@mmchaava 3 года назад

very good presentation, easy to follow. thank you

@moisessenju2592 3 года назад

Very good, thank you for sharing!

@PythonSimplified 3 года назад

You're welcome, enjoy! 😃

@tikka8558 4 года назад

Great work!! very helpful

@PythonSimplified 4 года назад

Thank you!! I'm really glad I was able to help! :)

@rafaelsantos5332 4 года назад

Great job!

@PythonSimplified 4 года назад

Thank you! :)

@thesouthsidedev1812 3 года назад

You just earned a new sub keep them videos coming

@PythonSimplified 3 года назад

Thank you so much The SouthSideDev, welcome aboard! 😃

@cccccccccccccccc132 3 года назад

Legal, curti bastante, muito interessante e bem explicado, vou usar bastante seus vídeos como guia

@kangna2268 2 года назад

terima kasih, sangat membantu saya yang sedang belajar.

@elastvd7503 2 года назад

Great, thank you

@avazart614 3 года назад

Beautiful Soup, Beautiful Dress !

@nikosroom1913 3 года назад

You are amaing thank you so much.

@KhalilYasser 3 года назад

Thanks a lot for this awesome tutorial.

@PythonSimplified 3 года назад

You're absolutely welcome, enjoy! :)

@kikiryki Год назад

Hi Mariya, thanks for your tutorial. Would you please explain in a new video, how to iterate a search in a web page search box (business directory), from a list of about 30 company names stored on a Excel column, scraping some specific data from results html (eg. VAT number, adress) in a new csv or xls file? Thank you!

@curiositymars6688 4 года назад

Superb

@PythonSimplified 4 года назад

Thank you! :)

@pawesauga440 3 года назад

great! thank You

@philtoa334 3 года назад

Very nice.

@dimitrichakma 2 года назад

what should i watch? code or you

@digigoliath 4 года назад

Great Job Mariya. Beautiful Girl presents Beautiful Soup. But seriously, great content & quality work. Enjoyed this video very much.

@PythonSimplified 4 года назад

Thank you so much!! :D I'm glad you liked the video and I really appreciate your lovely feedback! :)

@digigoliath 4 года назад

@@PythonSimplified You're welcome. Good work must be appreciated. I came here from the FB Python group.

@PythonSimplified 4 года назад

@@digigoliath that's awesome! I'm there all the time 😀

@kelaskita6765 3 года назад

Hai Mariya, thanks a lot for your tutorial in advance, for the next tips, could you please share us how we can scrap a dataTable on a web page. as you may know, dataTable usually has lots of pagination for its table. let assume it will show 1000 rows and per table only show 50 rows, then, how to scrap the rest of 950 rows, I can inform you that it's come from a single URL.

@jettsprogrammingchannel7832 3 года назад

Jeez this really puts the 'Simp' in Python Simplified srsly Have you seen some of these comments Great Tutorial tho

@mirfaramarzhussaini3413 3 года назад

Cool

@greyblack3835 3 года назад

Sweet

@idk____idk6530 3 года назад

Just one question How many Day it Take me To become Like You in python 😔 (

@northwindx79 3 года назад

Hello, can you use the same code in another web server instead of google? thanks

@iftakharhussain 4 года назад

how would you scrape data from dynamically loading webpages and store them in a My SQL database

@PythonSimplified 4 года назад

Hi Iftakhar, I'm currently working on a video of pulling data out of Instagram, which is based on React. Once I reach satisfying results, I'll definitely let you know. The MySQL part, however, is not something I'm planning on getting into in the near future, it's a bit more complicated than what I can teach at the moment, my apologies on that. Try asking in Python groups on FB, there's lots of support among the Python community and it might be something that other users will find helpful.

@PythonSimplified 4 года назад

Your best bet would probably be Selenium, here's a link to the documentation: selenium-python.readthedocs.io/ It all depends on the exact web language you're looking to scrape, every module deals with different languages.

@samiarahman5763 3 года назад

I face problem. i Written as same as ur code but its say urllib is undefined. How i fix this problem. Please Help me to solve. I badly need to solve this now.

@roshanrajsingh4838 3 года назад

I'll love to know more about you. Where can I connect with you?

@PythonSimplified 3 года назад

Hi Roshan, thank you! I've deleted most of my social media accounts due to privacy concerns, however, I'm still on Linkedin: linkedin.com/in/mariyasha888/ and I still didn't delete my Instagram account, I might use it in the future (or I might move to the middle of the woods and delete it too 🤣): instagram.com/mariyasha888/ I'm working on an official Python Simplified website, where we can communicate just like on social media, but with no sinister use of our data by Tech giants 😁 I'll keep you posted!

@kostas6915 2 года назад

Hello a technical question: can we web scrap an html table when a login is required for access to that page? (assuming we have the credentials ofcourse)

@keithrfield Год назад

You can do this using MechanicalSoup

@dury10 4 года назад

hi using your script i have this 2 err line 2, in from bs4 import BeautifulSoup as bs ModuleNotFoundError: No module named 'bs4' p.s. I'm using PyCharm

@PythonSimplified 4 года назад

Hi Dury! I'm not using PyCharm so I'm not sure if I'm the best person to ask... but from what I've seen on Stack Overflow, PyCharm might have an issue with the bs4 library. May I suggest you use a different library altogether? Mechanical Soup is based on Beautiful Soup, but it's way more convenient and can do so much more! :D Here's a link to the documentation: mechanicalsoup.readthedocs.io/en/stable Or you can check out my video on Mechanical Soup: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-drDdb1MBBfI.html Let me know if it worked out! :)

@worldmusic6221 3 года назад

thnx

@ratheeshkaippada8274 Год назад

Could you please help me.. I was extracting the value from the html 5.89 I want to get the value 5.89

@PythonSimplified Год назад

You can either go for the innerHTML or text attribute, but if you're scraping a table - there's no point using Beautiful Soup, you can do it much quicker with Pandas. I have a super simple tutorial on it, you can check it out here 😃: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-oF-EMiPZQGA.html

@MuhammadAbdullah-uv1vt 3 года назад

You are also Beautiful like Beautiful Soup💖

@geopolitica5106 3 года назад

gracias preciosa

@pacisjules1391 3 года назад

I love you so much

@hopelesssouldisappointment5121 3 года назад

can u explain about those modules (re, urllib.request) first and then continue explaining about webscraping with bs4 that will be helpful 😅😅😅

@davidrowlands8962 3 года назад

Are you related to WebDev Simplfied?

@PythonSimplified 3 года назад

Nope :) Unfortunately I'm not even familiar with the brand/channel... anything cool you recommend to check out?

@liftcarryfetish1296 2 года назад

Your CLEAVAGE is so Beautiful. I mean Beautiful Soup 🔥

@thelostman5625 3 года назад

What is this intro tune name?

@章鱼-e4p 3 года назад

小姐姐能加上字幕吗？有点听不懂

@UMARUMAR-bl9bc 2 года назад

I swear to God I’m in love with you….

@badral-balushi9907 2 года назад

only if u didn't this is actually being done to create a database, it would made me feel discouraged, as I never understood before the use of webscraping ... thanks 😄