Replica Group based Workload Isolation

Context

Workload isolation is a new feature in StarTree Cloud (Starting 0.14 release) that allows users to physically isolate queries from different workloads on the same Pinot table. For instance, let’s assume we have following 2 workloads

User facing (application) queries: characterized by strict SLA, high impact to business
Adhoc queries: Required for triaging/debugging, low impact to business

With this feature, we can isolate A and B using replica group based routing isolation.

Pre-requisite

For this feature to work, the table must be configured to use replica groups and use the ‘StarTreeReplicaGroupInstanceSelector’ instance selector type. For instance let’s consider this sample table config

{
  "instanceAssignmentConfigMap": {
    "OFFLINE": {
      "tagPoolConfig": {
        "tag": "Tag1_OFFLINE"
      },
      "replicaGroupPartitionConfig": {
        "replicaGroupBased": true,
        "numReplicaGroups": 3,
        "numInstancesPerReplicaGroup": 3
      }
    }
  },
  "routing": {
    "instanceSelectorType": "ai.startree.pinot.broker.routing.instanceselector.StarTreeReplicaGroupInstanceSelector"
  }
}

In this case, we’ve configured to use replica group based instance assignment & routing with a total of 3 replica groups (each with 3 instances). The layout will look like this Note: If the current setup does not include replica groups, it must be enabled before using this feature. This involves changing the table config for all relevant tables and doing a rebalance operation.

How to use this feature

Preferred Replicas

Once enabled, you can configure isolation at the query level via the preferredReplicas query option. It accepts a comma-separated list of replica group IDs (derived from ideal state, not from instance partitions) that the broker should use to route the query. Suppose we want the following configuration:

Workload A: allowed to use both replica groups (0 and 1)
Workload B: restricted to replica group 2

Queries from Workload A can be written like this

SET preferredReplicas='0,1';
select ... from table where ...

Whereas, queries from Workload B can we written like this

SET preferredReplicas='2';
select ... from table where ...

Preferred Replica list semantics

When multiple replica groups are listed, the broker rotates across them by request ID for load balancing — it does not treat the first entry as primary and the rest as failover. So preferredReplicas=‘0,1’ routes roughly half of the requests entirely through RG 0 and the other half entirely through RG 1. To express “prefer RG 0, and fall over to RG 1 only when RG 0 cannot serve a segment”, use preferredReplicas=‘0’ together with fallbackReplicas=‘1’ (see the next section). Single-replica case. preferredReplicas=‘2’ means the query is routed only to instances in replica group 2. If any segment is unavailable on RG 2, it is reported as unavailable (partial result) unless fallbackReplicas is set. Therefore, in the happy path, Workload A will be served from RG 0 and RG 1 (load balanced), and Workload B will be served exclusively from RG 2.

Fallback replicas

The fallbackReplicas query option is an optional companion to preferredReplicas. It accepts a comma-separated list of replica group IDs that the broker should consider only when none of the preferredReplicas can serve a given segment. Example:

SET preferredReplicas = '0';
SET fallbackReplicas  = '1,2';
SELECT ... FROM table WHERE ...

In the example above, the broker tries to serve every segment from RG 0; for any segment that RG 0 cannot serve, it tries RG 1, then RG 2, before declaring the segment unavailable.

Selection order

For each segment, the broker walks the preferred replicas first (rotated by request ID for load balancing within the preferred set), and only falls through to the fallbackReplicas (in the listed order, no rotation) when no preferred replica has the segment online. Cross-preferred-replica failover takes priority over fallback: with preferredReplicas=‘0,1’ and fallbackReplicas=‘2’, a segment that is offline on RG 0 but online on RG 1 is served from RG 1, and the fallback metric is not incremented.

Upsert / dedup tables

Falling back to a different replica group is safe even for upsert / dedup tables: if any segment in a partition is unavailable on an instance, the underlying selector removes that instance as a candidate for all segments in that partition, so the fallback replica is chosen consistently across the whole partition.

Observability

Whenever any segment in a query is served by a fallback replica, the broker increments the per-table meter pinot.broker.<table>.queriesUsingFallbackReplica (exposed to Prometheus as pinot_broker_queriesUsingFallbackReplica_* with a “table” label). Use this metric to alert on workload isolation being silently violated. Recommended pattern. For strict isolation, set preferredReplicas only. For best-effort isolation with availability under partial outages, set preferredReplicas to your isolation target and fallbackReplicas to the rest — and alert on the queriesUsingFallbackReplica meter so you know when isolation is being broken. Here are the key failover scenarios:

Failure Scenario	Implication (no fallbackReplicas)	Implication (with fallbackReplicas set)
Segment replica failure within a preferred RG, but another preferred RG has the segment	Served from the other preferred RG (no metric increment)	Same as left column — fallback is not used
Segment replica failure across all preferred RGs	Segment reported unavailable → partial result	Served from the first fallback RG that has the segment; queriesUsingFallbackReplica meter is incremented
Segment replica failure on a preferred RG with upserts enabled	Entire affected partition is failed over to the next preferred RG	Entire partition fails over to the next available RG (preferred or fallback) that has it (consistent across the partition)
Segment unavailable on every preferred and fallback RG	Partial result	Partial result

Limitations

During any kind of segment movement (eg: rebalance or segment relocator -> realtime to offline or across tiers) - workload isolation guarantees are lost (this is something to do with the internals of Apache Pinot as it works today).
Only homogeneous replica groups are supported today. In other words, each replica group must contain the same amount of instances.
When preferredReplicas lists more than one replica group, request-ID-based rotation is the only load-balancing strategy — adaptive server selection is not consulted within the preferred set.

Get Started

Ingestion

StarTree Iceberg/S3 Tables

Query Data

Manage Data

Visualize Data

Manage Security

Release Notes

Reference

Replica Group based Workload Isolation

Context

Pre-requisite

How to use this feature

Preferred Replicas

Preferred Replica list semantics

Fallback replicas

Selection order

Upsert / dedup tables

Observability

Limitations

Get Started

Ingestion

StarTree Iceberg/S3 Tables

Query Data

Manage Data

Visualize Data

Manage Security

Release Notes

Reference

Documentation Index

​Context

​Pre-requisite

​How to use this feature

​Preferred Replicas

​Preferred Replica list semantics

​Fallback replicas

​Selection order

​Upsert / dedup tables

​Observability

​Limitations

Context

Pre-requisite

How to use this feature

Preferred Replicas

Preferred Replica list semantics

Fallback replicas

Selection order

Upsert / dedup tables

Observability

Limitations