I have watched or read many explanation about the differences among these 3 terms, but so far this video is the simpliest yet cleariest and easiest to understand. Thanks a lot!!!
Thanks Chandu for making these concepts so simple to understand. Whenever I get confused I just refer to your videos for quick and accurate understanding of the concepts.
Very clear and simple explanation thank you :) Just one point, Big Query is not a data lake, it is a data warehouse, I thought data lakes are called so when the architecture behind is based on hadoop or what do you think?
Good stuff, what's your take on BAM rediscovered with Activity Schema as Time Series and LTE (mostly materialised views)? Do you see such as something in the middle between databases and data warehouses for analytical workloads or just another modelling approach.
You could keep it in a database. If you end up doing analysis or asking questions where the structure of your stored data is not working for you, then you can reshape it and store it (and call it a warehouse).
you can not imagine how often i talked with high management and totally disillusioned them by explaining what a Datalake is. It's just the next buzzword not THE solution to all our problems .. sure it's useful of it's specific use case .. but that's it ^^
I don't understand the point why historic information should be put in a different system (the data warehouse). If you wish to delete a product (in this case, a chocolate) from your active product line, why do you even need to delete the item from the db? You could just keep the product and maintain the info aboit active portfolio in the attributes or set up a different table for discontinued/active products. Having a separate system seems like an overly complex way to maintain this information. Can someone explain?
AWS, Azure are not Datalakes, they are Cloud Platforms, S3, Blob storage are examples of datalake on these platforms. On GCP example is Cloud Storage and NOT Big Query, BQ is not a datalake
There is no easy answer for this kind of questions. ETL choices depend on existing systems, DWH architecture, technical competencies and person preferences. If you are learning, then I say learn Power BI first as it has a wider implication. All the best.
Little correction - data warehouse is a system and/or db where Hundreds of heterogeneous dbs(eg- chocolate db, biscuits db, candy, icecream dbs) or file based systems like excel xml are altogether modelled/stored/streamed using ETL(tool) for data analytics & applications downstreaming, data science & AI build purpose also.
I dont think any other video in the internet explains this difference as clearly as this video. Thank you brother. Keep posting more videos to educate us.
In a typical database there will be transactions taking place like insert of a table row, update of a table row, read of a table row that are in line with a set of business cases. In a datawarehouse there will be analysis taking place to across multiple rows from multiple tables. A data lake is where data goes to get drowned.
Super! In just 8 minutes, you have put such a clear picture of data base, data warehouse and data lake, that I can never forget and in future, any time I deal with these terminology, I have crystal clear idea of what am I dealing with! You are a GREAT teacher Chandoo and I really appreciate your effort!
Great job, clear explanation and I also enjoy your humor. Would be great if you could create a video describing the difference between data scientist, engineer, analyst and architect. Kudos on your excellent work!
If you're starting in I.T. doing analysis type work, you'll start as an Analyst. This can be anything from reporting, automated feed maintenance/RCA, and even development. Most of the above 3 (maybe save for Data Scientist) start here. Data Engineer is probably the most logical next step from analyst. You'll definitely be doing more development and analytical work as an analyst prior to this. This shifts your scope from retrieving data from a data warehouse/db/lake (lake is quite rare for a run of the mill analyst), to actually designing and some possible light architecting of table/schema structures for data to import into from other sources (typically starting as transactional information into a database from an app, or maybe an external source of some sort). Typically as an engineer you won't start on data warehouse modelling until you've had some experience with general transactional architecting/engineering since the data within a warehouse shouldn't be updated/deleted, only inserted. It will be deleted, possibly if you've archived it in some situation (like data that's over x-years old and based on specific policies), but even then it probably wouldn't be deleted. If the architecture allows, you may just duplicate the tables, or partition them in some way and then archive the older pages. They may also determine certain structural recommendations (rowstore vs columnstore table structures, for example, or using NoSQL vs relational databases), but usually it's in concert with an Architect if the process being designed is large enough, or has significant impact, especially in terms of performance. However, after discussions between Engineers and Architects, the Engineers (and to a lesser extent, Analysts) will IMPLEMENT the requisites of decided Architecture. Engineers are typically more hands on than Architects, but Archs may get their hands dirty if something is largely conceptual and they want to start plugging away earlier in the phase to ensure design solidity. Data Architect is anything from designing the schema for your transactional infrastructure (your primary database), data warehouse, or even data lake, as well as helping navigate and determine how to import data into those repositories, as well as even more expansive things such as CI/CD pipelines, *maybe* networking tasks if you're familiar enough with that (usually system administrators do that, though), or even helping implement connection string/authentication against your cloud resource targets originating from nearly any source caller (on premises machine, like a developer computer, a VM hosting an app service, CI/CD agent, or a completely separate cloud service not native to your cloud service, even on a completely different domain or client server). An Architect is going to be responsible for HOW disparate system objects are going to interact with each other and any potential issues given certain implementations or design sequences. Typically Architects are going to have some knowledge as to what different approaches are available and determine which makes sense given what's required for the need or problem that needs resolution. As an Architect you're not expected to know how to implement everything as if you were doing all the work yourself. However, having a basic understanding of the limitations of each element in the design will definitely help you determine which is possible and which may not be earlier in design phase, which helps mitigate wasted developer time later during spikes (Proof of Concept phases) and help with further engineering alignment tasks. Most people consider scientists as the babies in the room because the data they require should be perfect in terms of not needing to accommodate any changes to their representations outside of any algorithmic modelling is concerned. It's entirely possible a Scientist will ask the Engineer to modify schema and data to accommodate some sort of analysis or data modelling they're trying to complete. It's not a-typical for an Engineer to work closely with a Scientist, but not typical for the Scientist to work with the Architect, aside from initial standing up of a new Data Warehouse or Data Lake. Typically the Engineer maintains or may make the every-day changes to those structures once the inputs/outputs/transformational processes have already been established. Scientists are typically Statisticians or anything having to do with applied mathematics. They will also typically work with code that isn't strictly SQL, such as Python, R, Power BI, DAX, (maybe MDX, but I think that's fallen largely by the way-side), etc...Scientists are tasked with supplying the answers to complex problems for the business using quantitative analysis. These are the people that determine what Ads you may see given your previous and most recent search history. Something you searched for 3 years ago may not be as relevant as something you searched for yesterday. That would be a typical example of what a Scientist may do. Also, Google translate, things like that will be developed by the Scientist, but the Architect will design the bridges to source that data whereas the Engineer will make that design a reality. The Analyst will make sure data makes sense as it starts trickling through the design process and if there's any issues, the Analyst and maybe working with the Engineer will troubleshoot the why/how and determine a fix where either of them may implement that fix to ensure it works as intended. If you look at it as a decision tree, it may look something like: Analyst > Engineer > Architect Analyst > Engineer > Scientist Analyst > Scientist (again, typically short cut by a Masters in Statistics or similar) Hope that helps!