Streaming Ingestion in Apache Pinot

Streaming ingestion in Apache Pinot enables real-time data to be ingested, indexed, and queried within seconds of its arrival. This capability forms the backbone of low-latency analytics use cases such as anomaly detection, dashboards, personalization engines, and monitoring platforms. Unlike batch ingestion, which processes static historical datasets, streaming ingestion is designed for continuously evolving data, ingesting millions of events per second from modern event platforms like Apache Kafka, Amazon Kinesis, and Pulsar.

Streaming ingestion populates REALTIME tables in Pinot, which are optimized for handling append-only data streams. These tables are divided into consuming segments that ingest data in-memory, and then seal and persist those segments periodically based on time or size thresholds. This architecture allows Pinot to serve low-latency queries over streaming data while seamlessly managing freshness and durability.

How Streaming Ingestion Works in Pinot

The ingestion process starts with configuring a REALTIME table to connect to a streaming source. Pinot fetches and processes records from the stream. Each partition of a Kafka topic (or equivalent in other platforms) maps to a separate segment in Pinot, giving fine-grained control over offset tracking and recovery.

As events flow in, Pinot uses a stream ingestion plugin that parses, transforms, and indexes records according to the table’s schema and ingestion config. Events are stored in in-memory consuming segments, which are periodically flushed to disk when thresholds (like segmentFlushSizeRows or flushThresholdTime) are met. This segment is then sealed, and a new consuming segment is created for continued ingestion. Sealed segments are pushed to deep storage and become part of the table’s queryable data.

Supported Stream Sources

Apache Pinot natively supports several popular streaming platforms for real-time ingestion:

Apache Kafka: The most widely used source, with mature integration and offset tracking.
Amazon Kinesis: Supported via plugins; suitable for AWS-native architectures.
Apache Pulsar: Increasingly popular for event-driven systems with built-in multitenancy.

Streaming Table Configuration

A streaming table in Pinot is defined by a REALTIME table config, which includes ingestion-specific details like stream type, topic name, consumer type, and decoder class. For example, a Kafka-backed REALTIME table will specify the Kafka broker list, topic, and any stream-level properties such as auto offset reset, max records per fetch, and watermarking options.

Segment Lifecycle and Stream Consumption

Each streaming segment goes through a defined lifecycle. When a segment starts consuming data, it is in the CONSUMING state. After reaching a size or time threshold, it is sealed and moved to ONLINE state after being persisted. The system then creates a new consuming segment for the same stream partition.

In the event of a failure or restart, Pinot can recover using committed offsets, ensuring exactly-once or at-least-once semantics depending on configuration. Helix manages segment assignments, transitions, and failure recovery across the cluster, ensuring high availability and continuity.

Configuring Realtime Ingestion

Below is a sample configuration for a Pinot REALTIME table that consumes data from a Kafka topic named events_topic.

"ingestionConfig": {
    "streamIngestionConfig": {
      "streamConfigMaps": [
        {
            "streamType": "kafka",
            "stream.kafka.topic.name": "events_topic",
            "stream.kafka.consumer.type": "lowlevel",
            "stream.kafka.broker.list": "kafka-broker:9092",
            "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
            "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
        }
     ]
    }
}

Support for Upsert

StarTree supports full upserts and partial upserts on realtime tables. This enables record deduplication and fine-grained mutation of streaming records using a primary key.

Full Upsert

When using full upsert, the latest record (by comparison column, usually timestamp) replaces the older one entirely for a given key.

"upsertConfig": {
  "mode": "FULL",
  "comparisonColumn": "event_time"
}

Partial Upsert

Partial upserts allow merging only certain columns based on a strategy (like INCREMENT, APPEND, OVERWRITE), while others remain unchanged. Partial upsert is commonly used in metrics systems or user state tracking, where some fields are additive and others replace previous values.

"upsertConfig": {
  "mode": "PARTIAL",
  "partialUpsertStrategies": {
    "view_count": "INCREMENT",
    "tags": "APPEND",
    "last_updated": "OVERWRITE"
  },
  "comparisonColumn": "event_time"
}

Get Started

Ingestion

Query Data

Manage Data

Visualize Data

Manage Security

Release Notes

Reference

Streaming Ingestion in Apache Pinot

How Streaming Ingestion Works in Pinot

Supported Stream Sources

Streaming Table Configuration

Segment Lifecycle and Stream Consumption

Configuring Realtime Ingestion

Support for Upsert

Full Upsert

Partial Upsert

Get Started

Ingestion

Query Data

Manage Data

Visualize Data

Manage Security

Release Notes

Reference

​How Streaming Ingestion Works in Pinot

​Supported Stream Sources

​Streaming Table Configuration

​Segment Lifecycle and Stream Consumption

​Configuring Realtime Ingestion

​Support for Upsert

​Full Upsert

​Partial Upsert

How Streaming Ingestion Works in Pinot

Supported Stream Sources

Streaming Table Configuration

Segment Lifecycle and Stream Consumption

Configuring Realtime Ingestion

Support for Upsert

Full Upsert

Partial Upsert