Isolating Streaming Ingest and Queries Using RocksDB

Подписаться 76 тыс.

50% 1

In a real-time analytics architecture, streaming data ingestion, from a source like Kafka, and query serving run on the same compute unit, so that queries can reflect newly ingested data. These two distinct competing functions invariably contend for the available compute resources, which makes it difficult to handle situations where there are unexpected bursts of either streaming ingestion or queries that can slow down the system. We will examine common approaches to the problem of compute contention, such as scaling, replication, and querying from shared storage, and discuss their tradeoffs and how they remain incomplete solutions.
In this talk, we will present a real-time analytics architecture we implemented in the Rockset database, based on RocksDB, that effectively isolates streaming data ingestion from query serving. RocksDB is a popular log-structured merge-tree storage engine that writes to an in-memory memtable and periodically flushes to disk.
Core to our architecture is the separation of compute and storage. This allows multiple RocksDB instances to query from the same shared storage. We use cloud object storage to ensure durability and use SSD as a shared hot storage tier for low-latency reads. On the compute side, we designed our query processing engine to be completely separate from all the modules that perform data ingestion.
For fresh data to be available to multiple compute units, it is essential that the in-memory state of the ingester's RocksDB memtable be replicated to other RocksDB instances. We built a RocksDB memtable replicator that propagates changes to remote instances in single-digit milliseconds. This architecture enables compute isolation so that real-time streaming ingestion does not interfere with queries, while still allowing the most recent data to be queried.
ABOUT CONFLUENT
Confluent is pioneering a fundamentally new category of data infrastructure focused on data in motion. Confluent’s cloud-native offering is the foundational platform for data in motion - designed to be the intelligent connective tissue enabling real-time data, from multiple sources, to constantly stream across the organization. With Confluent, organizations can meet the new business imperative of delivering rich, digital front-end customer experiences and transitioning to sophisticated, real-time, software-driven backend operations. To learn more, please visit www.confluent.io.
#confluent #apachekafka #kafka