Working with time-series data at scale by Javier Ramírez

Подписаться 3,8 тыс.

50% 1

As individuals, we use time series data in everyday life all the time; If you’re trying to improve your health, you may track how many steps you take daily, and relate that to your body weight or size over time to understand how well you’re doing.
This is clearly a small-scale example, but on the other end of the spectrum, large-scale time series use cases abound in our current technological landscape. Be it tracking the price of a stock or cryptocurrency that changes every millisecond, performance and health metrics of a video streaming application, sensors for reading temperature, pressure and humidity, or the information generated from millions of IoT devices.
Modern digital applications require collecting, storing, and analyzing time series data at extreme scale, and with performance that a relational database simply cannot provide. We have all seen very creative solutions built to work around this problem, but as throughput needs increase, scaling them becomes a major challenge.
To get the job done, developers end up landing, transforming, and moving data around repeatedly, using multiple components pipelined together. Looking at these solutions really feels like looking at Rube Goldberg machines. It’s staggering to see how complex architectures become in order to satisfy the needs of these workloads.
Most importantly, all of this is something that needed to be built, managed, and maintained, and it still doesn’t meet very high scale and performance needs. Many time series applications can generate enormous volumes of data. One common example here is video streaming.
The act of delivering high quality video content is a very complex process. Understanding load latency, video frame drops, and user activity is something that needs to happen at massive scale and in real time. This process alone can generate several GBs of data every second, while easily running hundreds of thousands, sometimes over a million, queries per hour.
A relational database certainly isn’t the right choice here. Which is exactly why we built Timestream at AWS. Timestream started out by decoupling data ingestion, storage, and query such that each can scale independently. The design keeps each sub-system simple, making it easier to achieve unwavering reliability, while also eliminating scaling bottlenecks, and reducing the chances of correlated system failures which becomes more important as the system grows.
At the same time, in order to manage overall growth, the system is cell based - rather than scale the system as a whole, we segment the system into multiple smaller copies of itself so that these cells can be tested at full scale, and a system problem in one cell can’t affect activity in any of the other cells. In this session, I will introduce the problem of time-series, I will take a look at some architectures that have been used it the past to work around the problem, and I will then introduce Amazon Timestream, a purpose-built database to process and analyze time-series data at scale.
In this session I will describe the time-series problem, discuss the architecture of Amazon Timestream, and demo how it can be used to ingest and process time-series data at scale as a fully managed service. I will also demo how it can be easily integrated with open source tools like Apache Flink or Grafana.