Realtime Advertisement Clicks Aggregator | System Design

Подписаться 12 тыс.

Просмотров 19 тыс.

50% 1

Let’s design a real time advertisement clicks aggregator with Kafka, Flink and Cassandra. We start with a simple design and gradually make it scalable while talking about different trade offs.
Note: pdfhost.io/edit?doc=8a32143c-...
System Design Playlist: • System Design Beginner...
🥹 If you found this helpful, follow me online here:
✍️ Blog / irtizahafiz
👨‍💻 Website irtizahafiz.com?
📲 Instagram / irtiza.hafiz
00:00 Why Track & Aggregate Clicks?
01:07 Simple System
02:12 Will it scale?
04:00 Logs, Kafka & Stream Processing
12:02 Database Bottlenecks
17:13 Replace MySQL
18:59 Data Model
25:45 Data Reconciliation
29:00 Offline Batch Process
32:10 Future Videos
#systemDesign #programming #softwareDevelopment

Опубликовано:

4 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 86

@kevindebruyne17 Год назад

Your videos are really really great, no fluff, straight to the topic and covers a lot of details. Thank you and keep it up!

@sarthakgupta290 3 месяца назад

Thank you so much for this perfect explanation!!

@ranaafifi5487 Год назад

Very organized and neat! Thank you

@irtizahafiz Год назад

Glad it was helpful!

@nosh3019 10 месяцев назад

Thanks for the effort making this! Very informative and a perfect companion to the system design volume 2 book.

@freezefrancis Год назад

I think you deserve a lot more audience! The quality of the contents were really good. Thanks for sharing.

@irtizahafiz Год назад

Thank you for watching! Please like and share to reach more people.

@sergiim5601 Год назад

Great explanation and examples !

@irtizahafiz Год назад

Thank you!

@indavarapuaneesh2871 3 месяца назад

what I would've done differently: have both warm and cold storages. If your data access pattern is mostly reading data from the last 90 days (pick your number), then store that data in warm storage like Vitess (shared mysql or some distributed relation db). And run a background process periodically that vacuums the stale data from the warm tier and exports it to cold tier like data lakes. This way your optimising both read query latency and storage cost. Best of both worlds.

@pratikjain2264 3 месяца назад

This wasby farthe best video..thanks for dojng it

@irtizahafiz 2 месяца назад

Most welcome!

@rajeshkishore7119 5 месяцев назад

Excellent

@weixing8985 Год назад

Thanks for the tutorials! I think you're following the topics of the book System Design Interview 2 but using a way that a lot easier to understand. I'm very much struggled with those topics of the book until I came across your tutorials!

@irtizahafiz 7 месяцев назад

Yes! I refer to both the volumes of that book. I mention it on the video, as well as have it linked on most descriptions.

@protyaybanerjee5051 4 месяца назад

Some notes about this design - Adding more topics is a very vague statement. We have to define the data model to capture each click event and then allow data partitioning based on advertisement_id and some form of timestamp - Not sure why replication lag is stated as an issue here. The read patterns for this design doesn't require reading consistent data. So this should not be an issue - Relational DBs won't do well with aggregation queries. This is a little misguiding. Doing aggregation queries efficiently requires storing the data model in a column major format that unlocks efficient compressions and data loading. - Why provision a stream processing infra to upload data to cold storage . Once a log file reaches X MB, we can place an event in Kafka with a (file_id, offset) pair. There would be a consumer that reads this and uploads the data to s3. This avoids un-necessary dollar cost as well as operational cost of maintaining a stream infrastructure.

@jkl89966 Год назад

awesome

@TheZoneTrader Год назад

Each day click storage = .1KB * 3 B = .3 TB/ day and not 3 TB /day ? Correct me if i am wrong

@ramin5021 4 месяца назад

I think you missed the K in KB. btw, he didn't calculate it correctly too. I think the below calculatioin is correct 0.1KB = 100Btye 100 Byte * 3B = 300TB correct me if I am wrong

@erickadinugraha5906 2 месяца назад

@@ramin5021 you're wrong. 100 Bytes * 3B = 300 GB = 0.3 TB

@srawat1212 2 месяца назад

It should be 0.3Tb or 300Gb

@rishabhjain2404 2 месяца назад

We should use count-min sketch for real time click aggregation on the stream processor, it is going to be very fast and you query data on last minute granularity. A map-reduce system can be useful for exact click information. Clicks can be batched, put into HDFS system, reduced into aggregates and saved on DB.

@monodeepdas2778 2 месяца назад

afaik count-min sketch was applicable for top k problem. I know we can have some faster lesser accurate algos, but that is what stream processors can do.

@mohsanabbas6835 Год назад

Event data stream platform . It’s more complex system, where data is being processed either in real time streams or batch, ETL, data pipelines etc

@karthikbidder 16 дней назад

Since am working in adtech and would looking to upgrade our approach to modern, fortunately i got a look into your video and it helps me a lot. My question here is how about to use Clikchouse instead of Casandra, will it work well or lead to any issue?

@sumonmal009 3 месяца назад

capture the click with application logging, good idea, main crucks 6:30 21:30

@roopashastri9908 2 месяца назад

You videos are great!Very clearly articulated!Was curious why do we have to use Nosql DB , if we are storing only the aggregated data based on advertiser ID.What are the drawbacks of using any columnar DB like snowflake in thise case?

@irtizahafiz 2 месяца назад

You can use whatever DB works best for your case. I think Snowflake will work just as well here.

@PrateekSaini 3 месяца назад

Thanks for such a clear and detailed explanation. Could you please share a couple of blogs/articles for reference where companies are using this kind of systems?

@irtizahafiz 2 месяца назад

You should find some in the Uber engineering blog.

@kirillkirooha3848 Год назад

Thanks for the video, that's brilliant! But I didn't quite understand the problem with inaccurate data in the stream process. Late data can arrive to the stream process, but I suppose it has timestamp from apache log. So can't we just insert that late data to Cassandra?

@irtizahafiz Год назад

Hi! Yes, you can insert the late data when storing individual clicks. However, when you are computing aggregations, such as "total clicks every hour", you will already emit the count at the end of the hour (with some buffer). Then, when a late data arrives you won't be able to "correct" your aggregate.

@parthmahajan6535 2 месяца назад

That was an awesome video, i had a similar approach and got it validated. I was wondering if you could also start a code series on building such systems (as demonstrated in video).

@irtizahafiz 2 месяца назад

Thank you for watching! I plan on building similar sub-systems, but TBH, building an e2e system like this without an actual use case (and traffic) is not really worth it. I don't think it will add much value either. Thank you for the suggestion though. Appreciate it.

@parthmahajan6535 Месяц назад

@@irtizahafiz it would make sense tho, if someone's just starting,I was thinking we could use some dataset on clickstream logs, create a stream of the logs coming in(simulate a stream through python), and then build the system.

@code-commenter-vs7lb 6 месяцев назад

Hello, when we introduce a log file, how do we ensure the aggregates is still "near realtime" ? IMO when you introduce log files in the middle you will have append only logs which we will probably only publish once the log file finished appending and started generating new file. So there is a delay of sometime may be a min or something (depending how big is your rolling window).

@irtizahafiz 3 месяца назад

That depends on how you are reading the log file. You can add a "watcher" that tails the log file and publishes a message downstream whenever a new line is appended. Alternatively, as a first step, you can write to Kafka, and the Kafka consumer can both process the data and add it to the log file.

@KShi-vq4mg 9 месяцев назад

Big Fan! hopefully lot more people find these. but just one feedback. you covered what kind of data we store. would it also be worth going little deeper into data model?

@irtizahafiz 8 месяцев назад

Hi! That's a really good feedback! It's definitely worth diving deeper, but it's difficult to do that in a high level SD video. If that's something you will find valuable, I can definitely create videos on data models.

@shubhamdalvi6424 8 месяцев назад

Great Video! A minor error: 0.1Kb x 3B = 300 Gb

@lytung1532 Год назад

The tutorial is helpful. Could you give the sample of the format of the records stored in Cassandra. Do we need to store the data in Cassandra with all data in original click?

@irtizahafiz Год назад

Depends on your use case. You could have two Cassandra tables, one for individual records and one for aggregations.

@tonyliu2858 Год назад

Can we use MapReduce for stream processing? Will it meet the latency requirement? Or we have to use some other streaming processors such as Flink/Spark?

@irtizahafiz 8 месяцев назад

I think it all depends on your application. If you want the most realtime, I know Flink or Spark (with micro batches) can get you that.

@utkarshgupta2909 Год назад

Correct me if I am wrong, Seems to me more like lambda architecture.. aggregation being fast but inaccurate whereas S3 being slow but accurate

@irtizahafiz Год назад

Yup I think you can use the term for this.

@freezefrancis Год назад

Yes. That's right.

@ankitagarwal4022 5 месяцев назад

Excellent video from thinking from basic design to scalable design. Few question. 1. what technology actually we can use for log watcher, 2. can you please correct me about stream processors for saving data in s3, we are using the Steam processor. Here stream processors can be any consumer of Kafka events like my simple Java service which pushes data in s3 and for stream processer for aggregation of the data and saving in Cassandra we can flink.

@irtizahafiz 2 месяца назад

1. Don't remember off the top of my head, but you can even write a small cron job to read the file every few minutes. But there are better tools if you can Google around. 2. Yes. That should work. Flink automatically checks data in S3 when doing the aggregation.

@VishalThakur-wo1vx 4 месяца назад

We could also keep States in Kafka Stream application (local or Global State) and use Interactive Query to fetch result of the aggregation. Can you please share how to decide whether to offload the aggregation result to external DB vs when to use interactive Query ? I understand that durability can be one factor but what are others ?

@irtizahafiz 4 месяца назад

If your tables are on a relational DB and they are relatively large, aggregations will do poorly. Instead, either store precomputed aggregations in a different table, or compute on the fly using something like Flink.

@dhruvgarg722 6 месяцев назад

Great video! Did you create these diagrams in obsidian or these are images?

@irtizahafiz 5 месяцев назад

I use a combination of Obsidian and Miro to create the notes.

@ax5344 4 месяца назад

@5:35 0.1KB*3B = 3 TB Hi, how is the computation done? I thought 3B is 9 zero; multiply it by 0.1 will get 8 zero. 1 TB is 1e9 KB. Then I thought it would be 0.3 TB. Did I get something wrong?

@krutikpatel906 4 месяца назад

For batch job, i think you need separate stream processor, you cant mix real time data and old data. Please share your thought

@irtizahafiz 4 месяца назад

You don't need a stream processor for batch jobs. You can do it offline.

@vikramreddy7586 2 месяца назад

Correct me if I'm wrong - we are processing the data fetched via Batch Job, so essentially we are processing the data twice to get rid of the inaccuracies ?

@irtizahafiz 2 месяца назад

It's been a while since I uploaded the video, but I believe we are only processing it once with the stream processing pipeline. However, it's pretty common to run some kind of a nightly job to "correct" some of the small inaccuracies coming from the real-time aggregations.

@shadytanny 2 года назад

How are we using log files?Reading from log files is the best way to source?

@irtizahafiz 2 года назад

It totally depends on your system. In this example we are deciding to read from log files because of its simplicity. If you want, you can also run a Spark job on some logs for every website visit.

@chetanyaahuja1241 2 месяца назад

Thanks for clearly explaining the end to end design. Just a couple of questions: 1) Could you explain a little bit about how the Apache log files gets the clicks information and how is it realtime. 2) Also, Do you have any link of these notes/Diagram. As the one in description doesn't work.

@irtizahafiz 2 месяца назад

The simplest would be to write a cron job or something similar that executes every couple of minutes reads the log file, and writes new data to Kafka. You can also poll using a continuously running Python program. What that would look like is a Python program will be running on, say, a "while" loop and read from the file every couple of minutes to write to Kafka. These are 2 solutions you can quickly prototype. For more comprehensive solutions, there are dedicated file watcher daemons that you could use.

@irtizahafiz 2 месяца назад

Right, about the links. Unfortunately, they expired. Even I don't have access to most of them anymore. Sorry!

@chetanyaahuja1241 2 месяца назад

@@irtizahafiz Thank you for the explantation.

@matthayden1979 3 месяца назад

Is it correct to say that Kafka is Data ingestion platform as data is getting stored in the topics which would be later processed by stream processor?

@user-eq4oy6bk5p 2 года назад

What do we store in S3? Apache log files?

@irtizahafiz 2 года назад

No. In S3 we store the individual click events. In Cassandra we store the aggregations. We are doing that under the assumption that individual clicks are rarely accessed, only aggregations are accessed regularly.

@srishtiganjoo8549 4 месяца назад

Given that Kafka is durable, why are we storing the clicks in Log Files? Would this not hamper with system performance? If we need logs, maybe the Kafka consumer can log it to files.

@irtizahafiz 3 месяца назад

Kafka is durable for a short time (14 to 30 days depending on how it's configured). You also cannot do analytical queries easily when your data is in Kafka, as well as connect it to reporting software like PowerBI and Tableau. Because of all those and more, most of the time you only use Kafka as a buffer rather than permanent storage.

@bobuputheeckal2693 10 месяцев назад

Is it 300GB or 3TB ?

@unbox2359 2 месяца назад

can someone help me with the cassandra database schema design ? like what all tables will be there and what all columns will be there?

@irtizahafiz 2 месяца назад

It depends on the type of application you are trying to build.

@unbox2359 2 месяца назад

@@irtizahafiz I'm asking for this application only

@siarheikaravai Год назад

What about expenses calculations? How to estimate how much this system will consume money?

@irtizahafiz 7 месяцев назад

That depends on your AWS, GCP, etc platform cost. Most of these systems and DBs will be hosted on instances with different billing requirements.

@eternalsunshine313 9 месяцев назад

Why can’t we process the clicks a bit later than they happen so we capture late data and avoid the batch job? This wouldn’t be 100% real time, but do most systems need super fast real-time processing for ad clicks?

@irtizahafiz 8 месяцев назад

Totally depends on your use case. There are real-life use cases where you need it to be real-time, in those cases the tradeoff of dropping "some" late data is acceptable.