This is one of the best tutorial ever seen. I have been searching this kind of tutorial. Thanks. How to send csvfile, table as input parameter instead of read all files from folder. Please share link or video which can help
Hey Nate, your videos are just too good. I love how your channel is so dedicated towards real word data science. By the way I noticed that you started a video series, "For your Data Science Project" and I really want you to continue making videos for this particular series because there's literally no one on RU-vid with such guidance on DS projects and I have been looking for one since a very long time because I have my placements just after 12 months and I really want to make a full stack data science project. Thank you.
Amazingly helpful and easy to follow. Love this series on automating common tasks with Python. Can you do a series on automating with Python on calling an API and storing the JSON output in a database? Thank you again!
I'm glad you like this series. I wasn't sure if it's a topic people liked it or found boring. But I'll aim to do a few more. My next one can definitely be automating an API call to storing the data into a database. I have a few SQL videos in queue right now but I'll aim to create another python video some time early next year. I think what I'll also do is speed up the coding process too. Correct me if I'm wrong but you don't actually need to see me coding so I might just show the code one line at a time and explain it. If you have a strong opinion about it one way or another, let me know.
@@stratascratch Personally, the coding part is very helpful to me, especially how you describe each step and each piece of your code. When you do it live it is slow enough for me to process what you are doing and understand. If you jump ahead and skip over the coding, it's too fast for me and I'll have to figure out each piece of what you are doing (lots of pausing). I'm a novice, so that's my bias. Thank you!
@@stratascratch Hey Nate, I think this is one of the best data science channels out there. I've spent the last few months learning DS from kaggle and tutorials on udemy/yt, but you're the first person to code the way I'd like to learn. Unfortunately, most people focus on the DS part, so they completely ignore good software development practices. I would love to see more series on model building or cleaning data.
@@grzegorzzawadzki3048 Thanks so much! Glad you enjoy these videos. I agree with you on creating more videos on model building and cleaning data. I wish I had the time to create those videos =(. The python videos like this one takes so much effort and time that I am never able to do much else. I'll think about some other DS topics to create into videos! Thanks for the kind words and for watching my videos!
Thanks for watching. If there are any requests, please let me know and I'll try to make a video about it. Also, let me know if you think the coding is too slow and should be faster. I'm not always sure if people want to see me actually type code or if they would rather just seem me copy and paste the code in to make the video go faster.
Nate this content is soo good you don't even have to ask viewers to subscribe anyone you values it as much as we do will subscribe and do whatever it takes to keep in touch. This is extremely great content may you continue to sour in everything you do!
Great content! Very thoroughly and clearly explained. I appreciate you taking the time to make such great content! I would love to see more of these series because they are not only educational, but implementable!
Thank you so much! Nate, for all your help ❤ could you please make projects from scratch about APIs & pipelines - in sales or any major you want I see your video about one and only and any advice for learning because I don't know what should i learn first in these fields.
Hey Nate, I am getting this error while moving the files to the datasets directory: mv 'Customer Contracts$.csv' datasets ^ SyntaxError: invalid syntax Dataset directory is created but the files are not moving
I tried with this command def move_csv_files(csv_files,origin,target): # move files to directory print(origin) print(target) # Fetching csv files from origin to target directory for csv in csv_files: shutil.move(origin+csv, target) return
I use Oracle only sporadically and not very extensively. I would advise you to post your question on Stack Overflow. Someone should be able to help you there. Try also Oracle Communities: Welcome | Oracle Communities and cx_Oracle community on Github: Issues · oracle/python-cx_Oracle
Awesome video (series). You, explaining the code.. and even going from beginner (video 1), advanced (video 2), expert (video 3) make these such a valuable asset for data analysts/scientists. I would really love for you to continue with automatizations like this. These beats any udemy/cursera videos... any day!. Is it possible to get a 4th video with more SQL stuff, such as: update data in the DB, with a new csv file (maybe something that gets updated daily), or append some new data to existing ones (with overwriting whatever gets duplicated)? I would also love some automatization with openpyxl (or similar libraries you are familiar with; maybe creating some charts with the cleaned data). I appreciate the efforts you've put into this videos (i've purchased the lifetime access on the platform) and I will make sure others will learn about your channel!
Thanks for your support! there are some vids here - ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-fklHBWow8vE.html and ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-77IVf0zgmwI.html that cover the topics you're talking about. The only difference is that it uses an API to collect data. I would follow the same guidelines but use a CSV and pd.dataframe rather than make an API call to collect the data. But the topics like overwriting with new data, etc is covered.
@@stratascratch I'm having an issue with a specific csv i work with (i tested the final code with other csvs and they work just fine). I'm getting this error: "QueryCanceled: COPY from stdin failed: error in .read() call: UnicodeDecodeError 'charmap' codec can't decode byte 0x81 in position 6589: character maps to CONTEXT: COPY sfs_tool_t1, line 1" . It passes these prints: "opened db succesfully", "csv was created succesfully", "file opened in memory" ... and next is the error ... any thoughts (googling doesn't provide with any helpful suggestion)
I've identified the columns that generates the issue (3 of them, containing text, like sentences). My csv is an export from a sharepoint/list. I suspect, while downloading some of the characters get mess up, so probably there is some issue with the encoding of that text. Perhaps i should do something with the text within those columns (some cleaning)
@@diaconescutiberiu7535 Definitely the encoding. Try to force UTF-8 and that should fix it. I like to export from gSheets because they have the cleanest CSVs and I've never had an issue loading a csv I exported from Google. I always have problems loading csvs that are exported from Excel. Hope that helps
The real answer is because I randomly chose it over 10 years ago when I was just starting and just kept with it. It's also super easy to deploy on AWS. And postgres is better than other free options like MySQL for analytics. Here's an article about the comparison (hackr.io/blog/postgresql-vs-mysql). On the job, most companies use industry grade dbs like HIVE, Greenplum, Snowflake, and MS-SQL server. There's only slight differences in terms of syntax.
How should we go about if we need to do some cleaning inside the CSV files? I have multiple CSV files which i need to ul to my postgre db; files have different columns ... so the cleaning is different for each column. If i just run this: "dataframe["country"] = dataframe["country"].replace(to_replace=['Kingdom','States','Kong','Emirates','Rico'], value=['United Kingdom','USA', 'Hong Kong','UAE','Puerto Rico'])" it will do the job and upload only the file that does have a country column, but it will fail for the other csv... script is giving me Keyerror:'country" (which is obvious as the other files don't have it)
This video is almost two years old and I have no hopes of getting an answer, but here it comes: i don't understand DBs completely, but wouldn't be an bad practice to start a connection multiple times on the for loop? for what I know, I'd start the connection above the for and them proceed with the for. What do you think? Would it work?
@@stratascratch as far as I understood the for will connect at each round, and disconnect at the end of it. did I get it wrong or something? hahah What I thougth: connect_function() for_loop_function() disconnect_function() The way it is on the video, the connetion happens Inside the for procedure. Like we see on 22:58, the function "upload_db" is inside the for. I thought that starting a connection multiple times on the for loop could be a bad thing, and would be better to start it before the for. As I noted up there 🤔🤔
@@AzureCz No it's not a big practice to open a connection only when you need it. To be honest, it really doesn't matter unless you're opening the connection for a long period of time as you might get a timeout.
Thank you for a great tutorial! How do you automate running scripts? Do you manually run the python script every time you want to upload csv files into AWS database or do you use any other scheduling software like airflow or Jenkins to do that so you don't need to worry about running it manually? Are you planning on covering the subject about scheduling your script that runs periodically? As for making the interactive dashboard that ingests real time data, how can I refresh database automatically so what's displayed on the dashboard is real-time? Let's say that operators run a machine that outputs csv files with specific format which will be stored in the internal server in our company. I basically want my web application to ingest csv files in that directory and get them into aws database and displays the aggregated result real time. Any ideas? Sorry for the long questions but I would appreciate your help!
Automate running scripts is the hardest thing about the process. It's not hard because it's technically difficult. It's hard because there are so many tools out there that can do it for you. I usually rely on whatever tools my company uses for automation which has included Jenkins, Airflow, and Domino. For personal use, I either will use Airflow (create an AWS EC2 instance and install Airflow) or just manually run it each time I need it. I wasn't going to cover scheduling scripts in my series. But you're spot on with mentioning Jenkins and Airflow.
@@stratascratch In case you are running Airflow in AWS EC2 instance, you still need to upload csv files manually from local to EC2 instance first and then use Airflow to move them into AWS Postgres right? Is there a way you can read files on a local machine from EC2 instance?
@@steven345lll1 Yea that's totally fair. For CSV files, I'm not aware of any automated way to upload them from local to EC2. I supposed you could write an airflow py script that will ping your local for CSVs (but probably best to ping a Google Drive account or something more static). Otherwise, I would build the pipeline so that it doesn't even use CSV files. All data goes from API to db and all the transformations would be done on airflow. I've never had to implement this use case but it does seem plausible.