Offline Upserts - StarTree Docs

Introduction

Offline upserts allow an OFFLINE Pinot table to upsert records by primary key across multiple segments, presenting query results as if only the most-recent version of each record exists. In a standard OFFLINE table every segment is independent — pushing two segments that each contain a row for the same primary key means queries will return both rows. Offline upserts fix this: when a new segment is pushed, the server resolves conflicts with all previously loaded segments and surfaces only the winning record per primary key.

Record precedence ordering

When no explicit comparison column is configured, segment push time determines which record wins: the record from the segment pushed most recently always wins. This is the simplest setup for offline tables where each push represents a more-complete or corrected snapshot of the data. You can also configure an explicit comparison column (e.g., an eventTime timestamp). In that case the row with the highest value in the comparison column wins, regardless of push order.

Use Cases

Scenario	Why offline upserts help
Incremental corrections	A small fraction of rows in the table has changed (e.g., late-arriving corrections, status updates) and you want to push only the changed rows as a new segment without re-ingesting the whole table.
Snapshot uploads with overlap	An upstream job exports daily snapshots that may overlap (the same primary key appears in yesterday’s and today’s export). Offline upserts deduplicate automatically.

Offline upserts are a good fit when the number of rows being updated is a small-to-moderate fraction of the total table size (roughly less than ~40%). For larger fractions — where you are effectively replacing the whole table — consider Atomic Sync (see below).

How to Configure Offline Upserts

Enabling offline upserts requires changes to three places: the schema, the table config, and (optionally) the ingestion task config.

1. Schema — declare primary key columns

{
  "schemaName": "my_table",
  "primaryKeyColumns": ["userId"],
  "dimensionFieldSpecs": [
    { "name": "userId",  "dataType": "INT" },
    { "name": "status",  "dataType": "STRING" }
  ],
  "metricFieldSpecs": [
    { "name": "score", "dataType": "FLOAT" }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "eventTime",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MILLISECONDS"
    }
  ]
}

primaryKeyColumns can contain multiple columns — Pinot hashes them together as a composite key.

2. Table config

Three sections of the table config must be set together: upsertConfig, segmentPartitionConfig, and routing.

{
  "tableName": "my_table",
  "tableType": "OFFLINE",
  "segmentsConfig": {
    "timeColumnName": "eventTime"
  },
  "upsertConfig": {
    "mode": "FULL",
    "metadataManagerClass": "ai.startree.pinot.upsert.rocksdb.RocksDBTableUpsertMetadataManager"
  },
  "indexingConfig": {
    "segmentPartitionConfig": {
      "columnPartitionMap": {
        "userId": {
          "functionName": "Murmur",
          "numPartitions": 4
        }
      }
    }
  },
  "routing": {
    "instanceSelectorType": "strictReplicaGroup"
  }
}

Key points:

upsertConfig.mode must be "FULL". Partial upsert mode is not supported for offline tables (see Limitations).
upsertConfig.metadataManagerClass must point to the StarTree RocksDB implementation. If omitted the in-heap OSS default is used, which does not persist metadata across server restarts and is not recommended for production.
segmentPartitionConfig on the primary key column is recommended. The partition function and count define how the data is partitioned and also governs the overall scalability. Every segment for a given primary key must land on the same server — without this the server cannot see all versions of a key and deduplication will be incorrect. In general, the recommendation is to use a high partition count (eg: 128) to account for organic growth. Note that you cannot change this post table creation.
instanceSelectorType: strictReplicaGroup ensures that a query is routed to exactly one replica group. This is required so that the server’s local RocksDB view (which is per-partition) is authoritative for the keys it owns.

Optional: explicit comparison column

"upsertConfig": {
  "mode": "FULL",
  "metadataManagerClass": "ai.startree.pinot.upsert.rocksdb.RocksDBTableUpsertMetadataManager",
  "comparisonColumns": ["eventTime"]
}

When comparisonColumns is set, the row with the highest value in that column wins. When it is omitted, the row from the most-recently-pushed segment wins.

3. Partitioning upstream data with FileIngestionTask

For offline upserts to work correctly, every segment must be partitioned on the primary key: all rows with the same primary key value must reside in the same server. FileIngestionTask handles repartitioning automatically, even when the upstream S3 data is not pre-partitioned by primary key. On each execution the task executor reads segmentPartitionConfig from the table config and passes it to SegmentProcessorFramework, which physically sorts all ingested rows into per-partition buckets by applying the configured partition function (e.g., Murmur) across the configured number of partitions. Each output segment contains only rows that hash to a single partition bucket, satisfying the upsert requirement without any extra config. The segmentPartitionConfig in the table config acts as the contract: the task executor reads it to know how many partitions to create and which hash function to use, producing pre-partitioned segments that satisfy the upsert requirement automatically.

Offline Upserts vs. Atomic Sync

StarTree also offers Atomic Sync (via consistentPushSwapEnabled=true on FileIngestionTask), which replaces the entire table atomically. Both features update an offline table from S3, but they are designed for different scenarios.

Dimension	Offline Upserts	Atomic Sync
What it does	Merges new/updated rows into existing segments. Only changed rows need to be pushed.	Rebuilds and atomically swaps all segments in the table from the input source.
Granularity	Row-level upsert/deduplication by primary key.	File / Partition / Table-level replacement.
Data volume per push	Small-to-moderate fraction of the table (works well under ~40% of total rows being upserted).	Optimised for full partition or table refreshes (essentially 100% of rows).
Query consistency during push	New segments become visible as they are pushed; there is a brief window where both old and new versions of a row are in the table and the upsert filter resolves them.	Old and new sets of segments are swapped atomically — queries see either the old or the new version of the table, never a mix.
Index support	No StarTree index support (see Limitations).	StarTree indexes are supported which significantly improves query performance
Partial upsert	Not supported (FULL mode only).	Not applicable — full re-ingestion replaces all data.
When to choose	You are pushing incremental corrections, late data, or a small update batch. The primary key is known and stable.	You are regenerating the entire table (e.g., reprocessing pipeline, full restatement), or you require atomic cutover.

Rule of thumb: if less than ~40% of the table’s rows are changing in a given push cycle, offline upserts are the better fit — they avoid re-ingesting unchanged data and are cheaper to operate. Above ~40%, or when you need atomic visibility guarantees, use Atomic Sync.

Limitations

No partial upsert support. Only mode: FULL is supported for offline tables. FULL mode means the entire winning row replaces the entire losing row. Column-level merge strategies available in PARTIAL mode for realtime upsert tables are not available. Note that this is generally ok since column level update pattern is typically found in CDC/streaming use cases not batch. No StarTree index. Columns in an offline upsert table cannot have a StarTree (multi-dimensional pre-aggregation) index configured. The controller will reject such a table config at creation time. Segment push time ordering is implicit and irreversible. When comparisonColumns is not set, the server uses segment push time to break ties. Pushing an older segment after a newer one will cause the older data to win. Use an explicit comparisonColumns (e.g., a business event timestamp) if push order cannot be guaranteed.

​Introduction

​Record precedence ordering

​Use Cases

​How to Configure Offline Upserts

​1. Schema — declare primary key columns

​2. Table config

​3. Partitioning upstream data with FileIngestionTask

​Offline Upserts vs. Atomic Sync

​Limitations