What Is Apache Druid And Why Do Companies Like Netflix And Reddit Use It?

Подписаться 96 тыс.

Просмотров 7 тыс.

50% 1

The past few decades have increased the need for data and data faster. Some of the catalysts were the push for better data and decisions to be made around advertising. In fact, adtech has driven much of the real-time data technologies that we have today.
For example, MetaMarkets, creates a real-time analytics database that has continued to grow in popularity over the last decade.
Druid.
Druid has expanded from a solution that was developed for a single company to an open source solution that is relied upon by thousands of companies like Reddit, Walmart, and Netflix.
If you want to learn more about druid, you can check out the druid developer portal below!
bit.ly/40nxrk4
Also, the Druid Summit is coming up, and I will be speaking so do sign up below!
bit.ly/47hqO51
If you enjoyed this video, check out some of my other top videos.
Top Courses To Become A Data Engineer
• Top Courses To Become ...
What Is The Modern Data Stack - Intro To Data Infrastructure Part 1
• What Is The Modern Dat...
If you would like to learn more about data engineering, then check out Googles GCP certificate
bit.ly/3NQVn7V
If you'd like to read up on my updates about the data field, then you can sign up for our newsletter here.
seattledataguy.substack.com/
Or check out my blog
www.theseattledataguy.com/
And if you want to support the channel, then you can become a paid member of my newsletter
seattledataguy.substack.com/s...
Tags: Data engineering projects, Data engineer project ideas, data project sources, data analytics project sources, data project portfolio
_____________________________________________________________
Subscribe: / @seattledataguy
_____________________________________________________________
About me:
I have spent my career focused on all forms of data. I have focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. I have also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. I privately consult on data science and engineering problems both solo as well as with a company called Acheron Analytics. I have experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.
*I do participate in affiliate programs, if a link has an "*" by it, then I may receive a small portion of the proceeds at no extra cost to you.

Опубликовано:

29 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 15

@palgun. 8 месяцев назад

Can you make videos on building projects with modern data stack pls

@thedatadoctor 7 месяцев назад

Really informative. Thanks Ben!

@SeattleDataGuy 5 месяцев назад

you're welcome!

@michelleflood8871 5 месяцев назад

Great content, absolutely love it

@SeattleDataGuy 5 месяцев назад

Thank you so much!

@zacharythatcher7328 9 месяцев назад

Why use Druid instead of Postgres or timescaledb? Is it cheaper to store archival data with Druid since it goes to s3?

@keenshibe7529 9 месяцев назад

I'm not an expert, but might be able to give a couple of insights. Isn't Druid mainly optimized for ingesting streaming data, and for queries involving multiple dimensions, whereas timescaledb is mainly used with postgres and while can also be used for real-time analytics, its higher latency and queries is optimized mainly for time-based aggregations? Also the question of storing data with Druid since it connects to S3 had me confused as Druid does not only supports S3? And if comparing S3 and postgres doesn't make sense as S3 is an object storage service while postgres is a rdms? Sorry if I made some incorrect statements, I'm still learning. Please correct me if I'm wrong.

@ahmetlekesiz9948 9 месяцев назад

because druid is columnar. it is not cheaper, it is designed for analytics usage, not storage like for example postgres.

@davidwang4632 9 месяцев назад

Druid and Postgres design centers are built for different core use cases. Druid is for low latency analytics queries (ie. groupby, filters, aggregations etc). Postgres is a great general purpose SQL DB but will run into performance/time out issues when scanning larger data sets or highly current use cases. Re archival - no, a S3 object bucket or data lake would make more sense here. Druids architecture incorporates S3 for data durability and scale out, but it’s not meant simply for long term retention. Folks using Druid want its arch when hitting latency SLAs are crucial

@keenshibe7529 9 месяцев назад

@@davidwang4632 Thank you for your comment. When you meant by Postgres encountering performance/timeout issues when scanning, do you mean when using complex queries for real-time analytics? Also when you mean by Druid using S3 for data durability and scale out, not for long term retention - you mean by storing older immutable segments from druid into s3 as like a historical analytical data? I'm not very well versed in the terminology but keen on the subject.

@davidwang4632 9 месяцев назад

@@keenshibe7529 Re Postgres - you're right. As a database that stores data in rows, Postgres is slower than Druid for queries that require filtering (WHERE color = blue) and aggregations, as it requires either scanning the entire row or creating and maintaining an index. This is also true of other databases designed for transactions (MySQL, MariaDB, Oracle, DB2, SQL Server, MongoDB, etc). Druid has a highly optimized data format that goes beyond OLAP columnar storage with automatic indexing and compression that makes read-intensive queries highly performant, even for data sets with trillions of rows Re S3 - as an object store, it's designed to be cheap and durable, but not very performant. Its a great place to store data that you want to keep available but you don't need to query very much (eg. long term archival or non-performance sensitive queries via a query engine like Trino or Presto). Druid uses S3 (or other object stores on other clouds, like Azure Blog, or Google GCS) durability and reliability, but to get sub second performance, Druid also keeps data on high-speed local storage. Upon ingestion, Druid organizes data into files (called "segments") of its highly optimized data format and places a copy into S3 and places copies for high-speed use into data nodes that include both computing power and fast storage for fast queries. The copy in deep storage (S3 or other object store) serves as continuous backup for high availability and for use in rebalancing the cluster when you add or remove nodes.