Index-Based DISTINCT in Apache Pinot

From StarTree release 0.15 onwards, we have added support for two index-based execution paths for SELECT DISTINCT that skip the normal scan + projection pipeline and read distinct values directly from existing indexes. Both are opt-in behind a single query option. In many cases, this will lead to a significant increase in performance for distinct operators.

Enabling

SET useIndexBasedDistinctOperator = true;

This is off by default. Set it per query (or session) to turn on the optimization. When enabled, Pinot automatically picks the cheapest path for each query — index-based or the regular scan — based on filter selectivity and column cardinality. Ineligible queries silently fall back to the regular DISTINCT path, so it is safe to set globally. The following two features share this flag:

Inverted-index DISTINCT — for single-column SELECT DISTINCT on a column with a dictionary plus an inverted (or sorted) index.
JSON-index DISTINCT — for SELECT DISTINCT jsonExtractIndex(...) on a column with a JSON index.

1. Inverted-Index DISTINCT

What it does

Computes SELECT DISTINCT col by walking the column’s dictionary and checking which values still have matching docs after the filter, instead of scanning every row.

When it helps

Single-column SELECT DISTINCT on a column with dictionary + inverted index, or a sorted column.
Selective filters (the filter eliminates most rows).
Low-to-moderate cardinality columns (cities, country codes, status codes, tenant IDs, etc.).
ORDER BY <col> LIMIT N on the DISTINCT column — the operator can short-circuit on sorted columns.

When it does not help (and Pinot falls back automatically)

Multiple columns in the SELECT DISTINCT list.
Very high cardinality columns with a weak or no filter — scanning is cheaper.
Column without a dictionary or without an inverted/sorted index.
Multi-value columns.

Sample queries

SET useIndexBasedDistinctOperator = true;

SELECT DISTINCT city FROM myTable;

SELECT DISTINCT city FROM myTable WHERE country = 'US';

SELECT DISTINCT city FROM myTable
WHERE country = 'US'
ORDER BY city
LIMIT 100;

Tuning (optional)

If the default cost-based pick is too conservative or too aggressive for your workload, you can nudge it:

SET invertedIndexDistinctCostRatio = 10;

Higher value → bias toward the index path.
Lower value → bias toward scanning.
0 → always use the inverted-index bitmap path whenever the filter has any match.

Defaults already scale with cardinality (30 for ≤1K, 10 for ≤10K, 6 above), so most workloads do not need to set this.

2. JSON-Index DISTINCT

What it does

Computes SELECT DISTINCT jsonExtractIndex(col, '$.path', ...) by reading the JSON index’s value-to-docId map directly, instead of parsing the JSON for every matching row.

When it helps

High-cardinality JSON fields where parsing dominates cost (e.g., distinct user IDs, distinct event types out of a JSON blob).
Faceting / dropdown style queries that just need the set of distinct values for a path.
Queries where the JSON path is indexed.

When it does not help (and Pinot falls back automatically)

The JSON path is not covered by the index.
Multi-value arrays inside the JSON where each matching row contributes many values.
Multiple expressions in the SELECT DISTINCT list.
Multi-value return types like STRING_ARRAY.

Prerequisite — JSON index on the column

{
  "fieldConfigList": [
    {
      "name": "myJsonCol",
      "encodingType": "RAW",
      "indexTypes": ["JSON"]
    }
  ]
}

Sample queries

SET useIndexBasedDistinctOperator = true;

SELECT DISTINCT jsonExtractIndex(myJsonCol, '$.user.country', 'STRING')
FROM myTable;

SELECT DISTINCT jsonExtractIndex(myJsonCol, '$.event.type', 'STRING')
FROM myTable
WHERE timestamp > 1700000000000
ORDER BY jsonExtractIndex(myJsonCol, '$.event.type', 'STRING')
LIMIT 1000;

Supported patterns

Pattern	Supported
`DISTINCT jsonExtractIndex(col, '$.path', 'STRING')` and all single-value primitive types	Yes
`defaultValue` argument for missing paths	Yes
Inline JSON filter expression (5th arg of `jsonExtractIndex`)	Yes
`WHERE` clause filters	Yes
`ORDER BY` + `LIMIT`	Yes
Multi-value return types (`STRING_ARRAY`, etc.)	No — falls back
Multiple columns in `SELECT DISTINCT`	No — falls back

Verifying the optimization kicked in

EXPLAIN PLAN FOR <query> shows InvertedIndexDistinct or JsonIndexDistinct in place of the default distinct node.
Query response metadata: numEntriesScannedPostFilter drops to 0 for the JSON path, and near-zero for the inverted-index bitmap path.
If the plan still shows the regular DISTINCT, either the query is ineligible (see “When it does not help” above) or the cost-based picker chose scan because that was cheaper.

Background — how the path is picked

Pinot does cost-based selection for the inverted-index path at query time. For each segment it estimates the cost of scanning vs. probing the index using the rule:

dictionaryCardinality × costRatio ≤ filteredDocCount  →  use index path

Otherwise it falls back to scanning. This means the same query can use different paths on different segments depending on data distribution, and you do not have to know in advance which one is cheaper. The JSON-index path is unconditional once eligible — reading from the index is always cheaper than parsing JSON per row.

Get Started

Ingestion

StarTree Iceberg/S3 Tables

Query Data

Manage Data

Visualize Data

Manage Security

Release Notes

Reference

Index-Based DISTINCT in Apache Pinot

Enabling

1. Inverted-Index DISTINCT

What it does

When it helps

When it does not help (and Pinot falls back automatically)

Sample queries

Tuning (optional)

2. JSON-Index DISTINCT

What it does

When it helps

When it does not help (and Pinot falls back automatically)

Prerequisite — JSON index on the column

Sample queries

Supported patterns

Verifying the optimization kicked in

Background — how the path is picked

Get Started

Ingestion

StarTree Iceberg/S3 Tables

Query Data

Manage Data

Visualize Data

Manage Security

Release Notes

Reference

Documentation Index

​Enabling

​1. Inverted-Index DISTINCT

​What it does

​When it helps

​When it does not help (and Pinot falls back automatically)

​Sample queries

​Tuning (optional)

​2. JSON-Index DISTINCT

​What it does

​When it helps

​When it does not help (and Pinot falls back automatically)

​Prerequisite — JSON index on the column

​Sample queries

​Supported patterns

​Verifying the optimization kicked in

​Background — how the path is picked

Enabling

1. Inverted-Index DISTINCT

What it does

When it helps

When it does not help (and Pinot falls back automatically)

Sample queries

Tuning (optional)

2. JSON-Index DISTINCT

What it does

When it helps

When it does not help (and Pinot falls back automatically)

Prerequisite — JSON index on the column

Sample queries

Supported patterns

Verifying the optimization kicked in

Background — how the path is picked