Many Thanks.. you are simply superb... one of the best resources available on internet...best part of all workshops you share is its always having practical content... truly appreciable...Many Thanks...
wow wow wow ... just awesome Sir... Thank you so much for this beautiful time consuming job for all the beginners to learn from your knowledge... Thank you once again🙏🙏
Once again, this is a great tutorial. Thank you. I was wondering what is your view on running Spark ETL on both AWS Glue and Amazon EMR Spark cluster, what would be your preference between these two services assume the AWS cost isn't of concern?
if you keep cost aside - the primary difference is - 1. Glue is Serverless . EMR is IaaS 2. Glue has scheduling, workflow mechanism in place. EMR needs support from other services like CloudWatch and StepFunctions. 3. Glue support scala, pyspark and python shell only. EMR support wider frameworks such as Hive, Pig and HBase. So, my recommendation is to use Glue if working around scala, python and pyspark. But if you are using Hiv or Pig like programs, EMR is the choice. Hope it helps,
Hi first of all thank you for this video. my question is while i successfully created cluster and notebook but my jupytor notebook says kernel error. unable to solve it. my cluster is ready to use.
Sir can we get a dedicate playlist to master EMR or any other open source resources for more help to learn from scratch like you instructed here with the pattern of teaching new things and implementing at the same time, if possible plesas prepare a dedicated EMR targeting playlists. Jai Hind
Not sure I get the question. Why would you call notebook using boto3 to the job? if you want some data processing; simply create EMR task and submit it. Hope it helps.
I tried the workshop by myself. I followed all the steps carefully. When I tried PySpark programming for running tasks using Notebook; I click on run and nothing happens. I do not see anything in the output folder. Please help
for the step 5 when you write code in Jupyter notebook. Can you please share the output of the each of the code statements you are running. That might give me some clue.
@@AWSTutorialsOnline I tried again. I tried the first line of code(to import library). I copied the code and clicked run(as per steps in the tutorial), it does not give me any output and directly jumps to the new line.
Hi - I published a workshop which can help you. Here is the link - aws-glue-pyspark-lab.s3-website-eu-west-1.amazonaws.com/labs/ It talks about working with Glue Data Catalog and Redshift cluster. But the same code can be used with Postgresql as well. Hope it helps.
@@AWSTutorialsOnline thanks for the inputs. But, if we use jdbc connection in dynamic frame to write the data into rds will get performance issues. Is there any way to do this?
Suppose I have 1 Master and 1 Core Node in EMR. [ df = spark.read.csv("s3://...../demo.csv") ] I submit this task in EMR. After executing this line of code I should have data in the dataframe. But is that demo.csv data getting saved in HDFS also? If yes, then how can I find that demo.csv data in HDFS. And if no, then where does the data store after reading from S3.
Sorry Rishi, I somehow missed your comment. Apologies for that. The dataframe data is stored in HDFS and dataframe is a way to access the data. Dataframe provides a lazy load mechanism to access and process data stored in HDFS.