Тёмный

Data Engineering Interview | System Design 

The Big Data Show
Подписаться 111 тыс.
Просмотров 26 тыс.
50% 1

Опубликовано:

 

27 окт 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 28   
@vineetjain7518
@vineetjain7518 7 месяцев назад
This is quality data engineering.. There is something for everyone
@TheBigDataShow
@TheBigDataShow 7 месяцев назад
Thanks a lot Vineet. Keep learning 👏🎉
@abdulsami-xn6ss
@abdulsami-xn6ss 7 месяцев назад
❤❤❤
@ishwarkoki1119
@ishwarkoki1119 Месяц назад
I have compiled all the Questions covered in the interview. Hope that helps some folks in their prep thanks Ankur bhai for the inteview sessions. Data modeling: What is the fact table, dimensions table and the difference between the two. What is slowly changing dimensions ? Type 2 and what are the rest of the types? What is the difference between row based file format and columnar file format? What is the difference between csv, tsv vs parquet, orc, avro? Disk level storage understanding? What are different logging strategies? What is a write ahead log? What is data lake ( ADLS GEN2, S3 ) and what is a data warehouse( aws redshift, snowflake) Why do we need 2 different things ? What is delta lake, apache hudi or apache iceberg ? Apache spark: What is the difference between RDD, dataframe, dataset APIs in spark? What is the difference between spark context and spark session ? What are the optimizations we use while using pyspark ? Convert to parquet, orc, avro then use predicate pushdown What is partition pruning? What is a partition in spark ? What is caching in spark? What arr different join strategy in spark? What is adaptive query execution in spark? how Do we know a certain part of code is creating problem for us in spark? How do we debug the spark application In short ? Let say you have 1000 lines of code and you are getting an error While joining or a transformation. Should we wait for the error to be produced or we do some analysis while writing code ? How do we handle data increments in future and make sure our code is robust ? Doesn't throw error if data size is Increased ? Spark works in memory only, then why do we need caching in spark? What are different caching techniques in spark? What is persist ? As dataframe is immutable in spark, every transformation that we do creates a new dataframe in memory. Now, it is impossible for spark to hold all the dataframes throughout the job execution because that will soon eat up all the execution memory, So, spark is smart enough to erase the dataframes which are not being used anymore (this is called Garbage collection). So, if we don't explicitly ask spark to cache a dataframe i.e. persist the dataframe in memory, then it will soon be erased and during the next actions, the entire dag will have to be executed from the beginning i.e. reading the data from disk and doing all the transformations(which have already been executed at the time of executing previous actions). So, to avoid the recomputation of the entire dag and start from where we left, we cache a dataframe How do we decide when to use which form or persist ? In memory caching or disk caching ? There is a scenario when you cached the result in memory but while running code in future data is getting huge, will It keep caching or will it spill to the disk Or throw an error? What is lazy evaluation in spark? What is catalyst optimizer how does it work? why do we need to use optimization techniques as there is lazy evaluation, catalyst optimiser and bla bla blah.. ? AWS: What is SQS and SNS ? What is AWS step function? Why is it used ? How do we trigger a step function? What is lambda function ? What is AWS EMR? What is it used for? What it's Limitations? What is AWS GLUE how is it better than say EMR and why do we use it? How do we handle Metadata using GLUE? Explain how a data catalog works? What is AWS Athena? Why is it used ? What problem does it solve? What is AWS Dynamo DB ? Why is it used? What problem does it solve? How do we decide when to use a RDBMS or a NOSQL DB like say Dynamo or another DB ? what is elastic search? How is it different from other DBs? SQL: Window functions in SQL? Why are they used ? What are the different types of window functions? We have group by also for creating data partitions in sql but what Is the need of window functions ? System Design: Problem statement: We have a request from the HR department that they want to tackle harassment At workplace for that they would need the conversation Data from teams calls, teams chat, slack, zoom chat etc. So that a datascience or gen AI team can consume the data and make data models and solutions around the data ?
@brownwolf05
@brownwolf05 7 месяцев назад
For the Last question of sharing data to external systems then we can create a public portal with role based access and then we can give a data retrieval request form and the email to which we can send the mail through SMTP server and the data will also be available to get downloaded from the portal and the data will be always filtered as in the request form there'll be a field for user for which you need data and the range from which you need the data for that user. Kindly share your thoughts on this approach
@TheBigDataShow
@TheBigDataShow 7 месяцев назад
This is a really good approach and answer Arnav✨👏👍
@brownwolf05
@brownwolf05 7 месяцев назад
we can optimise this more bit by digging more into and try to filter out the data in initial stage as this will be text data which will be moslty consumed in json format through api's which is slow to read, so rather than taking one time dump from api we can have an incremental load pipeline to dump the data in document object based db so that we can fetch it in lower latency when needed then rest of the process. correct me in places where i can improve my observation and thinking capabiltiy @@TheBigDataShow
@priyankashaw2956
@priyankashaw2956 7 месяцев назад
All the available interviews on this channel are great and helping. Thank you for uploading and keep going.
@TheBigDataShow
@TheBigDataShow 7 месяцев назад
Thank you Priyanka 🙇🙏🎊
@Sandip_Patle
@Sandip_Patle 7 месяцев назад
Got a great understanding over Big Data interviews today again. Thanks for such a useful content.
@TheBigDataShow
@TheBigDataShow 7 месяцев назад
Thank you @Sandip_Patle . Thank you for your kind words.
@Sandip_Patle
@Sandip_Patle 7 месяцев назад
Sir, can I get a video where the candidate explains its data engineering project related to RETAIL DOMAIN only? I've been following this channel for a long time now but I haven't seen this project in any mock interview so far. Perhaps I missed it. Could you please share the relevant link if possible?
@vikram--krishna
@vikram--krishna 5 месяцев назад
Can you please share your approach for below question? Design data pipeline for a news broadcast app. consideration : 1. Active users : 1million 2: news will be push notified 3. User can comment on each news 4. User can like,dislike the news 5. Based on the reactions, customize news type for a specific user group by running the ML model 6. Pipeline should be fast/ near real time 7. Users should also get messages based on their current location ( local news)
@savirawat6671
@savirawat6671 6 месяцев назад
Great video ,in depth knowledge sharee
@TheBigDataShow
@TheBigDataShow 6 месяцев назад
Thank you for your kind words 😊 Keep learning
@TheBigDataShow
@TheBigDataShow 6 месяцев назад
We have more than 25 other Data Engineering Mock Interview videos. Do watch them in your free time and let me know your thoughts.
@lakshaychopra
@lakshaychopra 7 месяцев назад
God bless your channel 🙏🏻
@TheBigDataShow
@TheBigDataShow 7 месяцев назад
Thank you for your kind words Lakshay :)
@louisxuan-em6lk
@louisxuan-em6lk 7 месяцев назад
why i can not find the break point of the video?
@TheBigDataShow
@TheBigDataShow 7 месяцев назад
There are chapters maintained in the video description. You can click on that & use the break point.
@bhaveshchavan6075
@bhaveshchavan6075 7 месяцев назад
Hello, can u conduct a data engineer interview for a Fresher,cause it is a very advance interview for us Freshers who are doing cdac pgdbda course.
@TheBigDataShow
@TheBigDataShow 7 месяцев назад
There are many videos in the Data Engineering playlist for freshers. Try watching old videos of the playlist. All are present in our channel.
@sharankarthick3364
@sharankarthick3364 6 месяцев назад
Insightful!!
@TheBigDataShow
@TheBigDataShow 6 месяцев назад
We are also creating multiple Data Engineering Interview questions to practice in the community section of our RU-vid channel. Visit our channel and then go to the community tab to find all the questions for practice. Nd Do watch our other Data Engineering Mock interview by following the Mock Interview playlist too. We have more than 25 Data Engineering Mock interviews.
@Jalabulajunx
@Jalabulajunx 7 месяцев назад
The candidate did not state why they need a data lake vs a warehouse. Lake can store semi structures and unstructured, but warehouse can’t.
@TheBigDataShow
@TheBigDataShow 7 месяцев назад
Please check 10:41. But your answer is also good👏 keep learning
@Jalabulajunx
@Jalabulajunx 7 месяцев назад
@@TheBigDataShow now I see you discussed about it. Thanks for your response
Далее
System Design round of #dataengineering interview
50:14
AWS Data Engineering Interview
55:13
Просмотров 31 тыс.
Data Engineering Interview
42:46
Просмотров 6 тыс.
How He Got $600,000 Data Engineer Job
19:08
Просмотров 148 тыс.
Data Engineering Interview
34:51
Просмотров 1,9 тыс.
Big Data Mock Interview
39:34
Просмотров 5 тыс.