Welcome to PySparkPulse, your go-to channel for mastering PySpark from beginner to advanced levels! Join me as we explore comprehensive tutorials, breaking down key concepts with practical examples. Dive into real-world applications, and gear up for the toughest interviews with our in-depth coverage of PySpark and data engineering interview questions. Whether you're a novice or seasoned pro, PySparkPulse is your gateway to mastering the art of big data processing. Subscribe now to stay ahead in the dynamic world of PySpark and elevate your data engineering skills. Let's spark innovation together! #PySpark #DataEngineering #PySparkPulse
query = spark.sql("with cte as(select p.prescription_id,p.doctor_id,p.medication_id,m.medication_name from medications as m inner join \ prescriptions as p on m.medication_id == p.medication_id)\ select medication_name,count(distinct(doctor_id)) as diff_docot_count\ from cte group by medication_name having count(distinct(doctor_id))>=3 ")
For 2 question it doesn't mention to calculate sum for previous day and current day ,I beleive it should be per policy type sum not specific days ,please check. If you have avg why we need sum because sum=2*avg?
Please check the below link www.linkedin.com/feed/update/urn:li:activity:7160308705211645952?updateEntityUrn=urn%3Ali%3Afs_feedUpdate%3A%28V2%2Curn%3Ali%3Aactivity%3A7160308705211645952%29
QUESTION1 ------------------- print('INPUT DATAFRAME') data=[(1,"Alice","123 Main Street,New York,USA"),(2,"Bob","456 Oak Avenue,San Francisco,USA"), (3,"Carol","789 Pine Road,Los Angeles,USA"),(4,"David","321 Elm Lane,Chicago,USA"), (5,"Emily","654 Maple Drive,Miami,USA")] schema=("emp_id","emp_name","emp_add") df=spark.createDataFrame(data,schema) df.show(truncate=False) print('OUTPUT DATAFRAME') from pyspark.sql.functions import split df1=df.withColumn("street_name",split("emp_add",",")[0])\ .withColumn("city_name",split("emp_add",",")[1])\ .withColumn("country_name",split("emp_add",",")[2]).drop("emp_add").show(truncate=False) print('O/p by using getItem() function') print('In PySpark, the getItem() function is used to retrieve an element from an array or a value from a map column within a DataFrame') df2=df.withColumn("street_name",split("emp_add",",").getItem(0))\ .withColumn("city_name",split("emp_add",",").getItem(1))\ .withColumn("country_name",split("emp_add",",").getItem(2)).drop("emp_add").show(truncate=False) THEORY -------------- In PySpark, the getItem() function is used to retrieve an element from an array or a value from a map column within a DataFrame. The split() function returns an array column in PySpark, which can be further processed using array functions or exploded into multiple rows if needed.Each element of the array corresponds to a substring resulting from the split operation.The limit argument in split() specifies the maximum number of splits to perform. This is particularly useful when you want to limit the number of resulting substrings after splitting the original string. QUESTION2 -------------------- from pyspark.sql.functions import row_number,col from pyspark.sql.window import Window print('INPUT DATASET') data = [ (1, 'Math', 90), (1, 'Science', 93), (1, 'History', 85), (2, 'Math', 85), (2, 'Science', 79), (2, 'History', 96), (3, 'Math', 95), (3, 'Science', 87), (3, 'History', 77), (4, 'Math', 78), (4, 'Science', 91), (4, 'History', 90), (5, 'Math', 92), (5, 'Science', 84), (5, 'History', 88), ] schema=("id","subject","Marks") df=spark.createDataFrame(data,schema) df.show() windowSpec = Window.partitionBy("subject").orderBy(col("Marks").desc()) df1 = df.withColumn("row_number",row_number().over(windowSpec)) print("OUTPUT AFTER APPLYING ROW_NUMBER() WINDOW FUNCTION") df1.show(truncate = False) print("OUTPUT OF STUDENT ID WHO SCORED TOP IN EACH SUBJECT") df1.filter(df1.row_number == 1).select("id","subject").show() THEORY ------------- PySpark Window Ranking functions row_number() window function gives the sequential row number starting from 1 to the result of each window partition. rank() window function provides a rank to the result within a window partition. This function leaves gaps in rank when there are ties. dense_rank() window function is used to get the result with rank of rows within a window partition without any gaps. This is similar to rank() function difference being rank function leaves gaps in rank when there are ties. This is the same as the LAG function in SQL. The lag() function allows you to access a previous row’s value within the partition based on a specified offset. It retrieves the column value from the previous row, which can be helpful for comparative analysis or calculating differences between consecutive rows. This is the same as the LEAD function in SQL. Similar to lag(), the lead() function retrieves the column value from the following row within the partition based on a specified offset. It helps in accessing subsequent row values for comparison or predictive analysis.
Good information. Constructive criticism: You were explaining too fast. Chart explained can be part of 1 video and practical can be in another video. This way 2 videos of 10 to 12 minutes could have been helpful. Best of luck 👍🏻
There can be multiple reasons for this even if spill to disk is possible There are some tasks which requires certain execution memory and if it is not present it may lead to oom Or If we cache some bigger data the it may also lead to oom.
So UnionBy is used when there is Interchanged or Switched/Swapped Columns in a given DataFrames isn't it Sir?. So that It will arrange Columns Linearly
you haven't faced any python coding question but still on safer side if we need to prepare for just python coding questions then what all topics of python u will suggest?
Hey Priyam! Explain the process of thoughtwork. How u cleared? Regarding notice period,How to approach, How did you approach. Thanks and appreciated for all the contents
Hi Sir, pyspark: joinDf = df.alias('a').join(df.alias('b'), col('a.ManagerId')==col('b.Id')) final_df = joinDf.groupBy('b.Id', 'b.Name').agg(count(col('a.Id')).alias('No_of_reportees')).filter(col('No_of_reportees')>=3) final_df.show() %sql with cte as( select ManagerId, count(*) as no_of_reports from office group by ManagerId having no_of_reports >=5) select o.Name as Emp_Name, c.no_of_reports from cte c join office o on c.ManagerId = o.Id Emp_Name no_of_reports Jane Smith 6 David Lee 5
%sql with result as ( select medication_id, count(distinct(doctor_id)) as count_distinct_doc from prescriptions group by medication_id having count_distinct_doc >=3) select m.medication_name from result r join medications m on r.medication_id = m.medication_id