Batch Ingestion in Apache Pinot with Minions
Batch ingestion in Apache Pinot is the process of importing large volumes of static or historical data from sources like cloud storage (S3, GCS, ADLS), distributed file systems (HDFS), or SQL databases(Snowflake, BigQuery) into Pinot’s offline tables. These tables are optimized for high-performance analytical queries. Traditionally in Apache Pinot, batch ingestion relied on external orchestration frameworks such as Apache Spark or Hadoop. However, with the introduction of Pinot Minions, this process can now be handled natively within the Pinot cluster itself. Minions allow ingestion workflows and post-processing tasks to be automated, distributed, and managed internally without requiring external job schedulers.
Pinot Minions: Cluster-Native Ingestion & Processing
Pinot Minions are lightweight, background workers that execute asynchronous tasks on Pinot segments. They are orchestrated by the Pinot Controller and coordinated using Apache Helix, which handles task assignments, execution tracking, and failure recovery. Minions enable Pinot to run ingestion and data lifecycle operations, such as segment generation, merging, transformation, and movement, without interrupting query-serving processes. They provide a scalable and fault-tolerant mechanism to handle ingestion and processing tasks in a distributed manner.
Common Pinot Minion Ingestion Tasks
The File Ingestion Task offers a simpler way to ingest files from cloud storage(S3, GCS, ADLS) directly into Pinot without external orchestration. It reads input data from directories in almost all formats and not limited to JSON, CSV, Avro, or Parquet. It transforms them into segments ready for ingestion. This task is often used for small to large scale ingestion jobs where lightweight, frequent ingestion is necessary—for example, when periodically ingesting new logs or exports dropped into a cloud storage bucket or one time ingestion of historical data. More details on the File Ingestion Task can be found here.
The Delta Ingestion Task is particularly useful to ingest data from Delta Lake. Especially when only a subset of data needs to be ingested, such as newly arrived records for a given time range. Instead of reprocessing an entire dataset, this task ingests only the delta—defined by a start and end timestamp—thereby improving efficiency and reducing processing time. It’s well suited for append-only datasets like event logs, transactions, or metric streams that arrive periodically. More details on the Delta Ingestion Task can be found here.
Another critical task is the SQL Connector Batch Push Task, which allows Pinot to directly fetch data from relational databases such as Snowflake or Google BigQuery using SQL queries. With this connector, you can write a SQL query that selects a subset of rows from a source table, and Pinot will use that result set to generate and ingest offline segments. This provides a seamless pipeline for importing analytical data from structured databases into Pinot without intermediate ETL systems. More details on the SQL Connector Batch Push Task can be found here.
How to Schedule or Trigger Minion Tasks
Minion tasks can be scheduled automatically or triggered manually. For scheduled execution, tasks are defined in the Pinot table configuration using the tasks section. Each task type can be associated with a frequency—for example, a File Ingestion Task might run every 12 hours. These frequencies are parsed and enforced by the Pinot Controller, which ensures that tasks are launched at regular intervals and assigned to available Minion instances.
For manual execution, Pinot exposes REST endpoints that can be used to trigger tasks on demand. By sending a POST request to the Controller’s /tasks/schedule endpoint and specifying the task type, operators can initiate ingestion or transformation processes at any time.