Тёмный

pyspark scenarios 2 : how to read variable number of columns data in pyspark dataframe  

TechLake
Подписаться 46 тыс.
Просмотров 15 тыс.
50% 1

Опубликовано:

 

27 окт 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 38   
@krishnachaitanyareddy2781
@krishnachaitanyareddy2781 Год назад
Excellent video thanks for sharing sir
@mohansonale8020
@mohansonale8020 8 месяцев назад
Really it's amazing knowledge sharing video....
@jaydeeppatidar4189
@jaydeeppatidar4189 2 года назад
That was good basic and must know questions. Thanks for sharing !
@_jasonj_
@_jasonj_ Год назад
Great video, this is exactly what I need but I have a question. When I split my data which is something like 1234|5678 using | as the delimiter instead of , why does my result format like: ["1","2","3","4","|","5","6","7","8"] instead of ["1234","5678"] ? *EDIT* - found the solution, I guess pipe delimiter needs to be escaped in the split statement as "\\|" for it to work properly.
@sravankumar1767
@sravankumar1767 2 года назад
Simply superb 👌 👏 👍
@lakshminarayana3168
@lakshminarayana3168 10 месяцев назад
Thanks for sharing the knowledge ❤
@sumantadutta5485
@sumantadutta5485 2 года назад
This implementation is very specific to a scenario and assumptions. In real scenarios, one will not receive csv data with different number of values in the field. Here, assumption is all data will be in correct schema order like id, name, address, email, phone. Then you can map correct value to correct column. We are just not showing which value belongs to which field but assuming it implicitly. Also, without schema no downstream application will be able to handle this data as it will never know which column contains what data. Processing with json could be best way to handle dynamic schema.
@TRRaveendra
@TRRaveendra 2 года назад
Thats true for on-premises data warehousing projects migration to cloud. when it comes to advanced analytics projects and if source system is IOT and Machine generated data then you can expect different types of csv files with header,without header, multi header, variable no of headers on Network based companies serverside and machines generated data.
@Someonner
@Someonner 2 года назад
There are enough egotistical idiots trying to flex while taking an interview who ask suck sort of questions.
@explorewithaj4580
@explorewithaj4580 2 года назад
Excellent👏
@ravulapallivenkatagurnadha9605
@ravulapallivenkatagurnadha9605 2 года назад
Nice video
@mohitmotwani9256
@mohitmotwani9256 Год назад
Defining a schema in csv can also solve the problem, but very interesting video. Thanks
@mehmetkaya4330
@mehmetkaya4330 2 года назад
Great! Thanks
@TRRaveendra
@TRRaveendra 2 года назад
Ur welcome
@manognachowdary9797
@manognachowdary9797 Год назад
Thank you for educating! Is there a video to dynamically select specific column names from source dataset and rename as per target to find mismatches in datasets. If any please provide.
@akashsonone2838
@akashsonone2838 Год назад
Getting error while running it on PyCharm for i in range (splitable_df.select(max(size("Splitable_col"))).collect([0][0])) : TypeError: 'int' object is not callable what could be the reason for the same?
@harishkanta3711
@harishkanta3711 Год назад
Hello, can you please tell while getting df.select(max(size('splittable_col'))).collect()[0][0]---...why did we add collect()[0][0] at the end.
@TRRaveendra
@TRRaveendra Год назад
Collect() will convert data into row format. So slicing in Python. First row and first item
@harishkanta3711
@harishkanta3711 Год назад
@@TRRaveendra thank you
@mohansonale8020
@mohansonale8020 8 месяцев назад
Another question is can we rename the above five column names of final dataframe with different names
@krishnachaitanyareddy2781
@krishnachaitanyareddy2781 Год назад
Can I get real time project on azure for interview with azure data factory and azure data bricks. Please let me know
@purnimabharti2306
@purnimabharti2306 2 года назад
What if we want to give names of the column as well dynamically?? Instead of c01 something else
@TRRaveendra
@TRRaveendra 2 года назад
use toDF() or withColumnRenamed() functions for new column names or renaming columns.
@starmscloud
@starmscloud Год назад
try this : You can create a dictionary of old and new columns and use it dynamically just like below . from pyspark.sql.functions import col colDict:dict = {"col0":"id","col1":"name","col2":"address","col3":"email","col4":"phoneno"} df1.select([ col(column).alias(colmaps.get(column,column)) for column in df1.columns]).display()
@chandramouli8407
@chandramouli8407 2 года назад
Sir any tutorials available for Main frames for beginners please share me the link thank you sir🙏🙏
@shubne
@shubne Год назад
You can also use .option('mode', 'permissive')
@srikantha7290
@srikantha7290 2 года назад
Hi Sir , I'm working in IBM TSM backup domain from past 6 years.Im plaining to switch my career into Azure Data Engineering Course.Please suggest best way & training with job support & Please suggest. For Azure Data Engineer :- 1. SQL server & T-Sql Queries 2. Azure Fundamental s 3. Azure Active directory 4. Azure data factory 5. Azure Synapse analytics 6. Synapse studio 7. Azure storage (BLOB) 8. Big data analytics (ADLS), 9. ADLA 10. U - SQL 11. Azure data bricks 12. Azure valuts Your covering all topics ? Please answer and share complete details for the course. Kindly do the needful. I'm sent mail also sir please check and reply 🙏.
@jaydeeppatidar4189
@jaydeeppatidar4189 2 года назад
You should go for DP 900 and AZ 900 cerfications first. You will get knowledge about azure resources by doing this two certifications. After that you should either go for DP 203 or you can start learning Azure data factory. You may also learn Apche spark/hadoop to be a data engineer but if you specifically want to be an Azure data engineer then you should better go for Azure data factory and Azure synapse analytics. The azure key vault, Azure functions and Azure active directory are basic things don't be panic about this. The real Azure data engineer mainly work on Azure data factory, Azure databricks, Azure data lake gen2. I would suggest to go for Apache spark and databricks which is popular now a days. Azure data factory + Azure databricks is must for Azure data engineer. Apache spark + databricks + Basics of hadoop and hdfs should be sufficient for a data engineer as a starter. Please note that in data engineer Field you must have strong SQL knowledge.
@srikantha7290
@srikantha7290 2 года назад
@@jaydeeppatidar4189 Thank you so much for reply.Sir your providing any training ? Please share contact details via mail , I sent alredy please check and reply sir 🙏.
@jaydeeppatidar4189
@jaydeeppatidar4189 2 года назад
@@srikantha7290 No I am fresher but went through this situation. I was also confused for this type of questions in early stage of my carrier so I thought to share my knowledge with you as well so that you can atleast start learning. I would suggest to learn from Udemy for better experience and well structural learning.
@srikantha7290
@srikantha7290 2 года назад
@@jaydeeppatidar4189 ok thanks
@vaibhavverma1340
@vaibhavverma1340 Год назад
Hello sir, How to write same code in pyspark (Pycharm IDE)??
@TRRaveendra
@TRRaveendra Год назад
Create spark session and specify datafile location from ur local system
@aryic0153
@aryic0153 Год назад
in scala ?
@mdfurqan
@mdfurqan Год назад
there is one catch what if the value itself is separated by comma(delimiter)
@TRRaveendra
@TRRaveendra Год назад
Then data need to be requested with a double quote or single quote
@mdfurqan
@mdfurqan Год назад
@@TRRaveendra let's suppose it's generated data, do we have something to handle this at the time of writing code
@TRRaveendra
@TRRaveendra Год назад
@@mdfurqan use read mode option as DROPMALFORMED and it will reject if any bad data.
Далее
Купил КЛОУНА на DEEP WEB !
35:51
Просмотров 2,9 млн
Купил КЛОУНА на DEEP WEB !
35:51
Просмотров 2,9 млн