I was way too distracted by the way he was writing in reverse. I was switching between being in awe as how well he was writing in reverse, and his explanation. Thus, I have to watch it again after accepting the fact that he is a genius, and then finally get to understand everything. Great explanation.
Why is it that every data lake explanation is full of theory without any concrete examples? Aren't all of us here because we're SQL or Cube programmers and want to know whats so great about Data Lakes? All I see is the same thing I do with sql databases: import the data, prep and transform it and then query it directly or create dashboard applications.
I agree. This video is very similar to what we do with SQL. He has not really told what data lake is. But, whatever he told is true about Data Lake. Here is a list of differences in Data Lake that are not possible in standard SQL based RDBMSs. 1. Big Data 2. On the Cloud (this is possible) 3. Separation of Data from Data Processing Engine 4. Self Service Model 5. ML (this can be done) 6. Data in native format (csv/parquet/json/avro/...) All the above are common to Big Data. Here is the list of data lake differentiator. 7. Central Repository; means single source of truth.
Interesting but Data Lake is not only used by ML. It usually used to store unstructured raw data. Some governance can be applied, however, you don’t build Dashboard out of the data lake. You first need to model that data into a Data Warehouse using dimensional modeling (allowing you to extract different dimension of your data). This multi dimension represented by few tables will allow you to slice the data in multiple ways, making reporting, thus dashboards easy to build. This is why Airbyte/Fivetran + Snowflake + DBT are the most popular data stack on the market right now.
Good explanation, thank you. However, talk can get started with "Big Data" - which means data lakes are intended to store, manage and serve large Big volume, variability, velocity. Data is ingested in native format. It need to be kept organized, controlled and managed - governance. Data needs to be served in native or processed further for other needs - reporting and visualization, recommendations, process automations and more. Some real-life use cases to start the discussion. If the viewer already knows bits of data world (databases, datawarehouse, data lake etc), this helps to consolidate that understanding.
Infuse to business decisions for managers (Dashboard), consume by other part of the service in an app(Application), or automate to make the entire process smarter with AI (Automation)
Hi Scott...actually that is what we did :) Check out this blog post for the details: ru-vid.com/show-UCKWaEZ-_VweaEx1j62do_vQcommunity?lb=Ugzf5SL_yh9NglCJzgF4AaABCQ
Actually it doesn't, what makes it largely different is the kind of features a data lakes gives. Its catalogues data and makes it more usable traceable for external data operations. So ya u can say I can simply extract data out of my warehouse/etl system and then operationile for my spark jobs .... Chances are in a data lake solution this solution is already inbread in it with it's own ui or api for easy operationalisation ( spark job related transformation of data , munging cleaning etc) ... A data lake is a full blown solution more importantly an overlay over the existing data infrastructure u have. Maybe an onprem hadoop, or clustered mongodb. A data lake software should primarily be able to create a single view of these and make sense of it. It's a thin line but the data lakes are supposed to be more organized.....
I would say it depends on the underlying tech. Data warehouses (DWHs) and Extract-Transform-Load (ETL) is focused on relational databases (Postgres/Oracle/Microsoft/MariaDB/MySQL/SQLite), whereas a Data Lake also includes "Not Only SQL" (NoSQL) technologies like Kafka (data streams), Hadoop (Document Store/csv file storage), Impala (SQL query engine for Hadoop), etc. When it comes to concepts, it *heavily* overlaps, IMO.