How Streaming Ingestion Works in Pinot
The ingestion process starts with configuring a REALTIME table to connect to a streaming source. Pinot fetches and processes records from the stream. Each partition of a Kafka topic (or equivalent in other platforms) maps to a separate segment in Pinot, giving fine-grained control over offset tracking and recovery. As events flow in, Pinot uses a stream ingestion plugin that parses, transforms, and indexes records according to the table’s schema and ingestion config. Events are stored in in-memory consuming segments, which are periodically flushed to disk when thresholds (like segmentFlushSizeRows or flushThresholdTime) are met. This segment is then sealed, and a new consuming segment is created for continued ingestion. Sealed segments are pushed to deep storage and become part of the table’s queryable data.Supported Stream Sources
Apache Pinot natively supports several popular streaming platforms for real-time ingestion:- Apache Kafka: The most widely used source, with mature integration and offset tracking.
- Amazon Kinesis: Supported via plugins; suitable for AWS-native architectures.
- Apache Pulsar: Increasingly popular for event-driven systems with built-in multitenancy.