Thanks for the videos. I think one important functional requirement that most logging solutions offer ( GCP logging ...) is text search. So potentially having a text search engine (ELS) is something to consider.
Once we have the data in the time series DB, how do you suppose we go about hooking up a monitoring/alerting service to it? I'm not sure what the optimal route is between 1. push based model where for every new metric (or batch) in the time series DB, we query an alarms/rules DB, or 2. pull based model where the alarming service periodically queries the time series DB for all alarms/rules in the DB. 1 seems excessive since majority of real time metrics aren't going to fire an alarm. 2 seems excessive in that most alarms aren't firing at a given instance.
I feel like 1 is probably more practical, especially since you can just build an additional change data capture queue off of the time series db to read from an alarm, but truthfully am not entirely sure.
Thanks for this! Is flink consumer just like a normal java/spring queue consumer that is monitoring a AWS kinesis stream? (I've never used flink/kafka.) Do we have to use flink in conjunction with kafka queues or would any service work?
I have a dedicated video about this, but pretty sure flink is flexible with multiple types of message queue. Ideally though you should be using a replayable one though.
Good one Jordan. You are very clear in your thoughts. Keep this going !! :) Metrics/Logging system is challenging because of both high scale writes/reads. It looks like the write scale here depends on how much Kafka can scale. If we are looking at a very active public service receiving 100 billion msgs/day (1000 msgs/sec), I am guessing Kafka can handle that ? What about read load ? Since lot of people may use the log for customer investigations, there could be a lot of read load on the time series DB since the other path is for batch insights. As I am typing this, I am thinking about splunk. Could you make a video on how to design splunk like system ? (May be these are the building blocks)
Spunk is just a distributed search index as far as I know, which I already have a video on. It's called twitter search/elastic search, which you could just connect to kafka.
hey Jordan, what prevents us from sending the unstructured data directly from the client to the S3? If we do not care about data enrichment we might as well just send it straight from the client, unless I'm missing something? also a couple of follow up questions just to clarify it for myself: - why do we need a logging service, why can't we just push the data from the client straight to the queue? - as far as I understand we leverage Timeseries DB for queries on relevant "recent" data, so I assume we would need some sort of clean up jobs that run periodically? And we use data warehouse (like Snowflake) to enable analytical queries that would be too big to run on our main DB?
1) I agree, though in this case I was assuming that we are doing some sort of data enrichment via the flink consumers. 2) Similar to your first question - but at the end of the day there are reasons we want to have some sort of gateway before we process every request to send data to kafka. Examples might be rate limiting or some sort of validation on the messages to ensure that a bad actor can't spam our kafka queues. 3) Yep! Timeseries DBs making dropping old data very simple, that's one of the main benefits of them.
Hey, I'd have to look into these more, but if it allows you to query directly from kafka itself in addition to the db then I suppose that could speed things up a bit. Depends whether we need it!
From my understanding it's not necessarily cheaper if you're just using it for data storage - if you want to do processing of the data that's different - not sure we need to do any in the Hadoop cluster here though
Thanks for the amazing video. In one of the interview I was asked to design flight recorder to record the data within a flight. Could you please make a video on that.
Interesting - so fwiw flight recordings are just taking various samples of the program at slightly randomized intervals, so I think that the design would be very similar to this one, but you can just partition events based on the thread they were published from in your time series database
@@Piyush-ky9ee It would but since we're just ingesting the data it's really not a big deal in my opinion - if you split the TSDB by the source of the metric and just say send all of the metrics to that one shard of the TSDB I don't think that it should be overwhelmed, as the writes are being buffered by a queue and the entire index can be cached for fast ingestions - perhaps I'm wrong
I suppose we could - though unless we want structured data not sure why we would (I'm assuming these are just logs for now). Presumably it would be more expensive to run a managed cassandra db than just dumping in s3.
Hi! I still have much to learn, but my approach was to read as much as I could, starting from Designing Data Intensive Applications. From that point on, whenever I'd see a piece of technology I hadn't heard of I'd make note of it and look it up later. I also take notes on all of this so I retain the information better.
Kafka and spark are used for different things - Kafka at its core is a message queue, whereas Spark is a big data engine for batch processing. Although there may be kafka consumer software now, when I hear Kafka I think of a message queue.