Minion-Based Backfill Task
Overview
As data governance, compliance, and quality expectations continue to evolve, organizations require more precise tools to manage and maintain data in real-time analytics platforms. The Minion-Based Backfill Task in StarTree Pinot addresses this need by enabling selective deletion and backfill of data in a way that is both powerful and easy to control, without resorting to complex, manual workarounds.
Key Capabilities
Selective Data Deletion
The Backfill Task supports multiple modes to select data for deletion:
- Time range (e.g., a specific day or hour)
- Dimension filters (e.g., currency = EUR, region = APAC)
- Combination of time range and dimension filters
It is recommended to configure one of these criteria for data deletion. If deletion is not required, use the File Ingestion Task instead.
Selective Data Backfill
Users can re-ingest targeted subsets of data, such as corrected records or missing slices, without reprocessing entire partitions or tables.
Atomic and Consistent Updates
Backfill operations are atomic by design. When segment replacement is involved, either all related segments are updated or none are, maintaining table integrity and avoiding partial updates or broken transitions.
How Does it Work?
The Backfill Task is initiated via an ad-hoc execution API and operates on time-partitioned data with all-or-none consistency. It consists of four phases:
1. Segment Selector
Filters segments based on deletion or purge criteria. It identifies full deletions and partial purges, generating tasks with types: Purge
, Delete
, or Ingest
.
2. Segment Purger
Removes specific records from segments using defined filters, without deleting entire segments.
3. Segment Deletor
Completely deletes segments that the Segment Selector marked for removal.
4. Data Ingestor
Uses the FileIngestionTask to bring in new data by creating segments that replace or append to existing ones.
Limitations
Constraint | Description |
---|---|
No Native Scheduling Support | Triggered on demand via API; users must manage automation. |
No Preview or Dry Run Mode | No simulation capability; filters should be validated thoroughly. |
No Support for Upsert-Enabled Tables | Not supported currently. |
Task Interference | Disable other tasks (FileIngestion, SegmentRefresh, AlterTableTask) during Backfill execution. |
Configuration Parameters
A reference table of important configuration parameters:
Parameter | Description | Accepted Values / Format | Example |
---|---|---|---|
backfill.start.time.ms | Start timestamp for purge/replace | Epoch ms | 1727748000000 |
backfill.end.time.ms | End timestamp | Epoch ms | 1727755200000 |
backfill.segment.field.names | Multi-dimensional filters | Comma-separated | industry,country |
backfill.segment.field.values | Values for field names | Comma-separated | ENERGY,Germany |
backfill.comparison.operator | Field comparison logic | = , != | "!=" |
backfill.input.dir | Data source path | File path | /data/clean/stocks/2024-10-01 |
backfill.logical.operator | Combines multiple filters | && , || | || |
backfill.input.format | Input file format | CSV , JSON , etc. | CSV |
Sample Task Configuration
Key Use Case Scenarios
1. Delete a Subset of Table
Remove data/segments for a specific time range.
2. Replace a Subset of the Table
Delete existing data and replace with clean data for a specific time range.
3. Mixed Operation: Delete Some, Replace Others, and Add New Segments
A table has inconsistent data - some segments are invalid and should be removed, some are outdated and need replacing, and new records need to be appended. Perform a combination of:
- Deleting some segments outright,
- Replacing a set of segments,
- Ingesting entirely new segments.
4. Replace Data Matching Multiple Values
Example: currency is EUR or USD
Example: industry is ENERGY or country is Germany
Example: currency is not EUR
5. Replace with Time and Multi-dimension Filter
6. Add New Data or Segments Only
Use the File Ingestion Task when no deletion is needed.