Great video, this is exactly what I need but I have a question. When I split my data which is something like 1234|5678 using | as the delimiter instead of , why does my result format like: ["1","2","3","4","|","5","6","7","8"] instead of ["1234","5678"] ? *EDIT* - found the solution, I guess pipe delimiter needs to be escaped in the split statement as "\\|" for it to work properly.
This implementation is very specific to a scenario and assumptions. In real scenarios, one will not receive csv data with different number of values in the field. Here, assumption is all data will be in correct schema order like id, name, address, email, phone. Then you can map correct value to correct column. We are just not showing which value belongs to which field but assuming it implicitly. Also, without schema no downstream application will be able to handle this data as it will never know which column contains what data. Processing with json could be best way to handle dynamic schema.
Thats true for on-premises data warehousing projects migration to cloud. when it comes to advanced analytics projects and if source system is IOT and Machine generated data then you can expect different types of csv files with header,without header, multi header, variable no of headers on Network based companies serverside and machines generated data.
Thank you for educating! Is there a video to dynamically select specific column names from source dataset and rename as per target to find mismatches in datasets. If any please provide.
Getting error while running it on PyCharm for i in range (splitable_df.select(max(size("Splitable_col"))).collect([0][0])) : TypeError: 'int' object is not callable what could be the reason for the same?
try this : You can create a dictionary of old and new columns and use it dynamically just like below . from pyspark.sql.functions import col colDict:dict = {"col0":"id","col1":"name","col2":"address","col3":"email","col4":"phoneno"} df1.select([ col(column).alias(colmaps.get(column,column)) for column in df1.columns]).display()
Hi Sir , I'm working in IBM TSM backup domain from past 6 years.Im plaining to switch my career into Azure Data Engineering Course.Please suggest best way & training with job support & Please suggest. For Azure Data Engineer :- 1. SQL server & T-Sql Queries 2. Azure Fundamental s 3. Azure Active directory 4. Azure data factory 5. Azure Synapse analytics 6. Synapse studio 7. Azure storage (BLOB) 8. Big data analytics (ADLS), 9. ADLA 10. U - SQL 11. Azure data bricks 12. Azure valuts Your covering all topics ? Please answer and share complete details for the course. Kindly do the needful. I'm sent mail also sir please check and reply 🙏.
You should go for DP 900 and AZ 900 cerfications first. You will get knowledge about azure resources by doing this two certifications. After that you should either go for DP 203 or you can start learning Azure data factory. You may also learn Apche spark/hadoop to be a data engineer but if you specifically want to be an Azure data engineer then you should better go for Azure data factory and Azure synapse analytics. The azure key vault, Azure functions and Azure active directory are basic things don't be panic about this. The real Azure data engineer mainly work on Azure data factory, Azure databricks, Azure data lake gen2. I would suggest to go for Apache spark and databricks which is popular now a days. Azure data factory + Azure databricks is must for Azure data engineer. Apache spark + databricks + Basics of hadoop and hdfs should be sufficient for a data engineer as a starter. Please note that in data engineer Field you must have strong SQL knowledge.
@@jaydeeppatidar4189 Thank you so much for reply.Sir your providing any training ? Please share contact details via mail , I sent alredy please check and reply sir 🙏.
@@srikantha7290 No I am fresher but went through this situation. I was also confused for this type of questions in early stage of my carrier so I thought to share my knowledge with you as well so that you can atleast start learning. I would suggest to learn from Udemy for better experience and well structural learning.