Streaming Ingestion in Apache Pinot
Streaming ingestion in Apache Pinot enables real-time data to be ingested, indexed, and queried within seconds of its arrival. This capability forms the backbone of low-latency analytics use cases such as anomaly detection, dashboards, personalization engines, and monitoring platforms. Unlike batch ingestion, which processes static historical datasets, streaming ingestion is designed for continuously evolving data, ingesting millions of events per second from modern event platforms like Apache Kafka, Amazon Kinesis, and Pulsar.
Streaming ingestion populates REALTIME tables in Pinot, which are optimized for handling append-only data streams. These tables are divided into consuming segments that ingest data in-memory, and then seal and persist those segments periodically based on time or size thresholds. This architecture allows Pinot to serve low-latency queries over streaming data while seamlessly managing freshness and durability.
How Streaming Ingestion Works in Pinot
The ingestion process starts with configuring a REALTIME table to connect to a streaming source. Pinot fetches and processes records from the stream. Each partition of a Kafka topic (or equivalent in other platforms) maps to a separate segment in Pinot, giving fine-grained control over offset tracking and recovery.
As events flow in, Pinot uses a stream ingestion plugin that parses, transforms, and indexes records according to the table’s schema and ingestion config. Events are stored in in-memory consuming segments, which are periodically flushed to disk when thresholds (like segmentFlushSizeRows or flushThresholdTime) are met. This segment is then sealed, and a new consuming segment is created for continued ingestion. Sealed segments are pushed to deep storage and become part of the table’s queryable data.
Supported Stream Sources
Apache Pinot natively supports several popular streaming platforms for real-time ingestion:
- Apache Kafka: The most widely used source, with mature integration and offset tracking.
- Amazon Kinesis: Supported via plugins; suitable for AWS-native architectures.
- Apache Pulsar: Increasingly popular for event-driven systems with built-in multitenancy.
Streaming Table Configuration
A streaming table in Pinot is defined by a REALTIME table config, which includes ingestion-specific details like stream type, topic name, consumer type, and decoder class. For example, a Kafka-backed REALTIME table will specify the Kafka broker list, topic, and any stream-level properties such as auto offset reset, max records per fetch, and watermarking options.
Segment Lifecycle and Stream Consumption
Each streaming segment goes through a defined lifecycle. When a segment starts consuming data, it is in the CONSUMING state. After reaching a size or time threshold, it is sealed and moved to ONLINE state after being persisted. The system then creates a new consuming segment for the same stream partition.
In the event of a failure or restart, Pinot can recover using committed offsets, ensuring exactly-once or at-least-once semantics depending on configuration. Helix manages segment assignments, transitions, and failure recovery across the cluster, ensuring high availability and continuity.
Configuring Realtime Ingestion
Below is a sample configuration for a Pinot REALTIME table that consumes data from a Kafka topic named events_topic.
Support for Upsert
StarTree supports full upserts and partial upserts on realtime tables. This enables record deduplication and fine-grained mutation of streaming records using a primary key.
Full Upsert
When using full upsert, the latest record (by comparison column, usually timestamp) replaces the older one entirely for a given key.
Partial Upsert
Partial upserts allow merging only certain columns based on a strategy (like INCREMENT, APPEND, OVERWRITE), while others remain unchanged. Partial upsert is commonly used in metrics systems or user state tracking, where some fields are additive and others replace previous values.