Overview

As data governance, compliance, and quality expectations continue to evolve, organizations require more precise tools to manage and maintain data in real-time analytics platforms. The Minion-Based Backfill Task in StarTree Pinot addresses this need by enabling selective deletion and backfill of data in a way that is both powerful and easy to control, without resorting to complex, manual workarounds.

Key Capabilities

Selective Data Deletion

The Backfill Task supports multiple modes to select data for deletion:

  • Time range (e.g., a specific day or hour)
  • Dimension filters (e.g., currency = EUR, region = APAC)
  • Combination of time range and dimension filters

It is recommended to configure one of these criteria for data deletion. If deletion is not required, use the File Ingestion Task instead.

Selective Data Backfill

Users can re-ingest targeted subsets of data, such as corrected records or missing slices, without reprocessing entire partitions or tables.

Atomic and Consistent Updates

Backfill operations are atomic by design. When segment replacement is involved, either all related segments are updated or none are, maintaining table integrity and avoiding partial updates or broken transitions.

How Does it Work?

The Backfill Task is initiated via an ad-hoc execution API and operates on time-partitioned data with all-or-none consistency. It consists of four phases:

1. Segment Selector

Filters segments based on deletion or purge criteria. It identifies full deletions and partial purges, generating tasks with types: Purge, Delete, or Ingest.

2. Segment Purger

Removes specific records from segments using defined filters, without deleting entire segments.

3. Segment Deletor

Completely deletes segments that the Segment Selector marked for removal.

4. Data Ingestor

Uses the FileIngestionTask to bring in new data by creating segments that replace or append to existing ones.

Limitations

ConstraintDescription
No Native Scheduling SupportTriggered on demand via API; users must manage automation.
No Preview or Dry Run ModeNo simulation capability; filters should be validated thoroughly.
No Support for Upsert-Enabled TablesNot supported currently.
Task InterferenceDisable other tasks (FileIngestion, SegmentRefresh, AlterTableTask) during Backfill execution.

Configuration Parameters

A reference table of important configuration parameters:

ParameterDescriptionAccepted Values / FormatExample
backfill.start.time.msStart timestamp for purge/replaceEpoch ms1727748000000
backfill.end.time.msEnd timestampEpoch ms1727755200000
backfill.segment.field.namesMulti-dimensional filtersComma-separatedindustry,country
backfill.segment.field.valuesValues for field namesComma-separatedENERGY,Germany
backfill.comparison.operatorField comparison logic=, !="!="
backfill.input.dirData source pathFile path/data/clean/stocks/2024-10-01
backfill.logical.operatorCombines multiple filters&&, ||||
backfill.input.formatInput file formatCSV, JSON, etc.CSV

Sample Task Configuration

{
  "task": {
    "taskTypeConfigsMap": {
      "SegmentBackfillTask": {
        "backfill.start.time.ms": 1735689601000,
        "backfill.end.time.ms": 1735775999000,
        "backfill.input.dir": "/path/to/data",
        "backfill.input.format": "json",
        "input.fs.className": "org.apache.pinot.plugin.filesystem.S3PinotFS",
        "input.fs.prop.accessKey": "MY_ACCESS_KEY",
        "input.fs.prop.secretKey": "MY_SECRET_KEY",
        "input.fs.prop.region": "us-west-2"
      }
    }
  }
}

Key Use Case Scenarios

1. Delete a Subset of Table

Remove data/segments for a specific time range.

{
  "backfill.start.time.ms": "<start_timestamp_ms>",
  "backfill.end.time.ms": "<end_timestamp_ms>"
}

2. Replace a Subset of the Table

Delete existing data and replace with clean data for a specific time range.

{
  "backfill.start.time.ms": "<start_timestamp_ms>",
  "backfill.end.time.ms": "<end_timestamp_ms>",
  "backfill.input.dir": "/path/to/clean/data",
  "backfill.input.format": "CSV"
}

3. Mixed Operation: Delete Some, Replace Others, and Add New Segments

A table has inconsistent data - some segments are invalid and should be removed, some are outdated and need replacing, and new records need to be appended. Perform a combination of:

  • Deleting some segments outright,
  • Replacing a set of segments,
  • Ingesting entirely new segments.
{
  "backfill.segment.list": "segment_old1, segment_old2",
  "backfill.start.time.ms": "<start_timestamp_ms>",
  "backfill.end.time.ms": "<end_timestamp_ms>",
  "backfill.input.dir": "/path/to/new/data",
  "backfill.input.format": "CSV"
}

4. Replace Data Matching Multiple Values

Example: currency is EUR or USD

{
  "backfill.segment.field.names": "currency, currency",
  "backfill.segment.field.values": "EUR,USD",
  "backfill.logical.operator": "||",
  "backfill.input.dir": "/path/to/data",
  "backfill.input.format": "CSV"
}

Example: industry is ENERGY or country is Germany

{
  "backfill.segment.field.names": "industry,country",
  "backfill.segment.field.values": "ENERGY,Germany",
  "backfill.logical.operator": "||",
  "backfill.input.dir": "/path/to/data",
  "backfill.input.format": "CSV"
}

Example: currency is not EUR

{
  "backfill.field.names": "currency",
  "backfill.field.values": "EUR",
  "backfill.comparison.operator": "!=",
  "backfill.input.dir": "/path/to/data",
  "backfill.input.format": "CSV"
}

5. Replace with Time and Multi-dimension Filter

{
  "backfill.start.time.ms": "<start_timestamp_ms>",
  "backfill.end.time.ms": "<end_timestamp_ms>",
  "backfill.segment.field.names": "currency, currency",
  "backfill.segment.field.values": "EUR,USD",
  "backfill.logical.operator": "||",
  "backfill.input.dir": "/path/to/data",
  "backfill.input.format": "CSV"
}

6. Add New Data or Segments Only

Use the File Ingestion Task when no deletion is needed.

FAQs & Recommendations