Applying Software Engineering Principles To Your Data Science Tasks In Python

StrataScratch

Подписаться 48 тыс.

Просмотров 8 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

19 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 55

@subinivi 2 года назад

The Best 3 videos of data migration I have ever seen before. Very impressive and stepwise explanations for all three videos. Thanks a lot.

@stratascratch Год назад

Glad you enjoyed it!

@Sreenu1523 Год назад

This is one of the best tutorial ever seen. I have been searching this kind of tutorial. Thanks. How to send csvfile, table as input parameter instead of read all files from folder. Please share link or video which can help

@Gautam-lo5zy Год назад

I really like these types of projects. helpful to get an understanding of how real-world projects works.

@majafuntv4538 3 года назад

You explain your thought process which is much more valuable. You’re a true blessing! Thank you so much!

@stratascratch 3 года назад

Thank you! Trying my best.

@prateek2159 3 года назад

Hey Nate, your videos are just too good. I love how your channel is so dedicated towards real word data science. By the way I noticed that you started a video series, "For your Data Science Project" and I really want you to continue making videos for this particular series because there's literally no one on RU-vid with such guidance on DS projects and I have been looking for one since a very long time because I have my placements just after 12 months and I really want to make a full stack data science project. Thank you.

@camelrow 3 года назад

Amazingly helpful and easy to follow. Love this series on automating common tasks with Python. Can you do a series on automating with Python on calling an API and storing the JSON output in a database? Thank you again!

@stratascratch 3 года назад

I'm glad you like this series. I wasn't sure if it's a topic people liked it or found boring. But I'll aim to do a few more. My next one can definitely be automating an API call to storing the data into a database. I have a few SQL videos in queue right now but I'll aim to create another python video some time early next year. I think what I'll also do is speed up the coding process too. Correct me if I'm wrong but you don't actually need to see me coding so I might just show the code one line at a time and explain it. If you have a strong opinion about it one way or another, let me know.

@camelrow 3 года назад

@@stratascratch Personally, the coding part is very helpful to me, especially how you describe each step and each piece of your code. When you do it live it is slow enough for me to process what you are doing and understand. If you jump ahead and skip over the coding, it's too fast for me and I'll have to figure out each piece of what you are doing (lots of pausing). I'm a novice, so that's my bias. Thank you!

@stratascratch 3 года назад

@@camelrow That is really insightful and helpful. I will definitely keep the coding part in. Thanks for your input!

@grzegorzzawadzki3048 3 года назад

@@stratascratch Hey Nate, I think this is one of the best data science channels out there. I've spent the last few months learning DS from kaggle and tutorials on udemy/yt, but you're the first person to code the way I'd like to learn. Unfortunately, most people focus on the DS part, so they completely ignore good software development practices. I would love to see more series on model building or cleaning data.

@stratascratch 3 года назад

@@grzegorzzawadzki3048 Thanks so much! Glad you enjoy these videos. I agree with you on creating more videos on model building and cleaning data. I wish I had the time to create those videos =(. The python videos like this one takes so much effort and time that I am never able to do much else. I'll think about some other DS topics to create into videos! Thanks for the kind words and for watching my videos!

@mysteriousbd3743 3 года назад

Thanks for part 3, I love this tutorial Series.

@stratascratch 3 года назад

Thanks for watching. If there are any requests, please let me know and I'll try to make a video about it. Also, let me know if you think the coding is too slow and should be faster. I'm not always sure if people want to see me actually type code or if they would rather just seem me copy and paste the code in to make the video go faster.

@Davidkiania 2 года назад

Nate this content is soo good you don't even have to ask viewers to subscribe anyone you values it as much as we do will subscribe and do whatever it takes to keep in touch. This is extremely great content may you continue to sour in everything you do!

@stratascratch 2 года назад

Thank you! I'm glad you're enjoying the content

@christopherwilhoite7856 2 года назад

Great content! Very thoroughly and clearly explained. I appreciate you taking the time to make such great content! I would love to see more of these series because they are not only educational, but implementable!

@classkori5507 3 года назад

Useful tutorial,think you so much

@stratascratch 3 года назад

thank you! Happy to take any ideas and feedback

@karunakaranr2473 Год назад

Thank you for your time and effort to make this tutorial. Really helps.

@torontodataguy 2 года назад

@StrataScratch I would love to see the part 4, where you get the data from the database for some data analysis project( maybe simple data analysis)

@andrefbillette2774 3 года назад

Great series!

@stratascratch 3 года назад

Thanks for watching!

@esamelhosiny5615 Год назад

Thank you so much! Nate, for all your help ❤ could you please make projects from scratch about APIs & pipelines - in sales or any major you want I see your video about one and only and any advice for learning because I don't know what should i learn first in these fields.

@stratascratch Год назад

Absolutely! We are creating many data science project videos coming out this year so I'm sure you'll find one interesting =)

@esamelhosiny5615 Год назад

@@stratascratch Insha'Allah, Thank you, bro 😍❤

@nargisparvin4267 3 года назад

Thanks Sir

@stratascratch 3 года назад

Thanks for watching!

@sweety143sas 2 года назад

Hey Nate, I am getting this error while moving the files to the datasets directory: mv 'Customer Contracts$.csv' datasets ^ SyntaxError: invalid syntax Dataset directory is created but the files are not moving

@sweety143sas 2 года назад

I tried with this command def move_csv_files(csv_files,origin,target): # move files to directory print(origin) print(target) # Fetching csv files from origin to target directory for csv in csv_files: shutil.move(origin+csv, target) return

@sirojbekalimboyev6730 2 года назад

Hi everybody!! how can we count every 10000rows loaded from csv file to table in time interval?

@sweety143sas 2 года назад

Any idea what should be the approach for oracle using cx_oracle, as for ddl statements it is not allowing to use bind variables.

@stratascratch 2 года назад

I use Oracle only sporadically and not very extensively. I would advise you to post your question on Stack Overflow. Someone should be able to help you there. Try also Oracle Communities: Welcome | Oracle Communities and cx_Oracle community on Github: Issues · oracle/python-cx_Oracle

@diaconescutiberiu7535 2 года назад

Awesome video (series). You, explaining the code.. and even going from beginner (video 1), advanced (video 2), expert (video 3) make these such a valuable asset for data analysts/scientists. I would really love for you to continue with automatizations like this. These beats any udemy/cursera videos... any day!. Is it possible to get a 4th video with more SQL stuff, such as: update data in the DB, with a new csv file (maybe something that gets updated daily), or append some new data to existing ones (with overwriting whatever gets duplicated)? I would also love some automatization with openpyxl (or similar libraries you are familiar with; maybe creating some charts with the cleaned data). I appreciate the efforts you've put into this videos (i've purchased the lifetime access on the platform) and I will make sure others will learn about your channel!

@stratascratch 2 года назад

Thanks for your support! there are some vids here - ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-fklHBWow8vE.html and ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-77IVf0zgmwI.html that cover the topics you're talking about. The only difference is that it uses an API to collect data. I would follow the same guidelines but use a CSV and pd.dataframe rather than make an API call to collect the data. But the topics like overwriting with new data, etc is covered.

@diaconescutiberiu7535 2 года назад

@@stratascratch I'm having an issue with a specific csv i work with (i tested the final code with other csvs and they work just fine). I'm getting this error: "QueryCanceled: COPY from stdin failed: error in .read() call: UnicodeDecodeError 'charmap' codec can't decode byte 0x81 in position 6589: character maps to CONTEXT: COPY sfs_tool_t1, line 1" . It passes these prints: "opened db succesfully", "csv was created succesfully", "file opened in memory" ... and next is the error ... any thoughts (googling doesn't provide with any helpful suggestion)

@diaconescutiberiu7535 2 года назад

I've identified the columns that generates the issue (3 of them, containing text, like sentences). My csv is an export from a sharepoint/list. I suspect, while downloading some of the characters get mess up, so probably there is some issue with the encoding of that text. Perhaps i should do something with the text within those columns (some cleaning)

@stratascratch 2 года назад

@@diaconescutiberiu7535 Definitely the encoding. Try to force UTF-8 and that should fix it. I like to export from gSheets because they have the cleanest CSVs and I've never had an issue loading a csv I exported from Google. I always have problems loading csvs that are exported from Excel. Hope that helps

@jordang8135 3 года назад

Very nice series with lots of helpful info. Any reason you use Postgres over others? Do you find it easier to work with?

@stratascratch 3 года назад

The real answer is because I randomly chose it over 10 years ago when I was just starting and just kept with it. It's also super easy to deploy on AWS. And postgres is better than other free options like MySQL for analytics. Here's an article about the comparison (hackr.io/blog/postgresql-vs-mysql). On the job, most companies use industry grade dbs like HIVE, Greenplum, Snowflake, and MS-SQL server. There's only slight differences in terms of syntax.

@diaconescutiberiu7535 2 года назад

How should we go about if we need to do some cleaning inside the CSV files? I have multiple CSV files which i need to ul to my postgre db; files have different columns ... so the cleaning is different for each column. If i just run this: "dataframe["country"] = dataframe["country"].replace(to_replace=['Kingdom','States','Kong','Emirates','Rico'], value=['United Kingdom','USA', 'Hong Kong','UAE','Puerto Rico'])" it will do the job and upload only the file that does have a country column, but it will fail for the other csv... script is giving me Keyerror:'country" (which is obvious as the other files don't have it)

@stratascratch 2 года назад

Could you add an if/else statement that will ignore columns that do not exist in other CSVs?

@diaconescutiberiu7535 2 года назад

@@stratascratch Worked like a charm. Thank you!

@AzureCz 2 года назад

This video is almost two years old and I have no hopes of getting an answer, but here it comes: i don't understand DBs completely, but wouldn't be an bad practice to start a connection multiple times on the for loop? for what I know, I'd start the connection above the for and them proceed with the for. What do you think? Would it work?

@stratascratch 2 года назад

It could work. But why not just keep the connection open -)

@AzureCz 2 года назад

@@stratascratch as far as I understood the for will connect at each round, and disconnect at the end of it. did I get it wrong or something? hahah What I thougth: connect_function() for_loop_function() disconnect_function() The way it is on the video, the connetion happens Inside the for procedure. Like we see on 22:58, the function "upload_db" is inside the for. I thought that starting a connection multiple times on the for loop could be a bad thing, and would be better to start it before the for. As I noted up there 🤔🤔

@stratascratch 2 года назад

@@AzureCz No it's not a big practice to open a connection only when you need it. To be honest, it really doesn't matter unless you're opening the connection for a long period of time as you might get a timeout.

@AzureCz 2 года назад

@@stratascratch Thanks. I was thinkinh about paid APIs when you have a limit of data to pull from. but now I got it :D

@steven345lll1 3 года назад

Thank you for a great tutorial! How do you automate running scripts? Do you manually run the python script every time you want to upload csv files into AWS database or do you use any other scheduling software like airflow or Jenkins to do that so you don't need to worry about running it manually? Are you planning on covering the subject about scheduling your script that runs periodically? As for making the interactive dashboard that ingests real time data, how can I refresh database automatically so what's displayed on the dashboard is real-time? Let's say that operators run a machine that outputs csv files with specific format which will be stored in the internal server in our company. I basically want my web application to ingest csv files in that directory and get them into aws database and displays the aggregated result real time. Any ideas? Sorry for the long questions but I would appreciate your help!

@stratascratch 3 года назад

Automate running scripts is the hardest thing about the process. It's not hard because it's technically difficult. It's hard because there are so many tools out there that can do it for you. I usually rely on whatever tools my company uses for automation which has included Jenkins, Airflow, and Domino. For personal use, I either will use Airflow (create an AWS EC2 instance and install Airflow) or just manually run it each time I need it. I wasn't going to cover scheduling scripts in my series. But you're spot on with mentioning Jenkins and Airflow.

@steven345lll1 3 года назад

@@stratascratch In case you are running Airflow in AWS EC2 instance, you still need to upload csv files manually from local to EC2 instance first and then use Airflow to move them into AWS Postgres right? Is there a way you can read files on a local machine from EC2 instance?

@stratascratch 3 года назад

@@steven345lll1 Yea that's totally fair. For CSV files, I'm not aware of any automated way to upload them from local to EC2. I supposed you could write an airflow py script that will ping your local for CSVs (but probably best to ping a Google Drive account or something more static). Otherwise, I would build the pipeline so that it doesn't even use CSV files. All data goes from API to db and all the transformations would be done on airflow. I've never had to implement this use case but it does seem plausible.

@steven345lll1 3 года назад

@@stratascratch Thank you for your suggestion!