From StarTree release 0.15 onwards, we have added support for two index-based execution paths forDocumentation Index
Fetch the complete documentation index at: https://docs.startree.ai/llms.txt
Use this file to discover all available pages before exploring further.
SELECT DISTINCT that skip the normal scan + projection pipeline and read distinct values directly from existing indexes. Both are opt-in behind a single query option. In many cases, this will lead to a significant increase in performance for distinct operators.
Enabling
- Inverted-index DISTINCT — for single-column
SELECT DISTINCTon a column with a dictionary plus an inverted (or sorted) index. - JSON-index DISTINCT — for
SELECT DISTINCT jsonExtractIndex(...)on a column with a JSON index.
1. Inverted-Index DISTINCT
What it does
ComputesSELECT DISTINCT col by walking the column’s dictionary and checking which values still have matching docs after the filter, instead of scanning every row.
When it helps
- Single-column
SELECT DISTINCTon a column with dictionary + inverted index, or a sorted column. - Selective filters (the filter eliminates most rows).
- Low-to-moderate cardinality columns (cities, country codes, status codes, tenant IDs, etc.).
ORDER BY <col> LIMIT Non the DISTINCT column — the operator can short-circuit on sorted columns.
When it does not help (and Pinot falls back automatically)
- Multiple columns in the
SELECT DISTINCTlist. - Very high cardinality columns with a weak or no filter — scanning is cheaper.
- Column without a dictionary or without an inverted/sorted index.
- Multi-value columns.
Sample queries
Tuning (optional)
If the default cost-based pick is too conservative or too aggressive for your workload, you can nudge it:- Higher value → bias toward the index path.
- Lower value → bias toward scanning.
0→ always use the inverted-index bitmap path whenever the filter has any match.
30 for ≤1K, 10 for ≤10K, 6 above), so most workloads do not need to set this.
2. JSON-Index DISTINCT
What it does
ComputesSELECT DISTINCT jsonExtractIndex(col, '$.path', ...) by reading the JSON index’s value-to-docId map directly, instead of parsing the JSON for every matching row.
When it helps
- High-cardinality JSON fields where parsing dominates cost (e.g., distinct user IDs, distinct event types out of a JSON blob).
- Faceting / dropdown style queries that just need the set of distinct values for a path.
- Queries where the JSON path is indexed.
When it does not help (and Pinot falls back automatically)
- The JSON path is not covered by the index.
- Multi-value arrays inside the JSON where each matching row contributes many values.
- Multiple expressions in the
SELECT DISTINCTlist. - Multi-value return types like
STRING_ARRAY.
Prerequisite — JSON index on the column
Sample queries
Supported patterns
| Pattern | Supported |
|---|---|
DISTINCT jsonExtractIndex(col, '$.path', 'STRING') and all single-value primitive types | Yes |
defaultValue argument for missing paths | Yes |
Inline JSON filter expression (5th arg of jsonExtractIndex) | Yes |
WHERE clause filters | Yes |
ORDER BY + LIMIT | Yes |
Multi-value return types (STRING_ARRAY, etc.) | No — falls back |
Multiple columns in SELECT DISTINCT | No — falls back |
Verifying the optimization kicked in
EXPLAIN PLAN FOR <query>showsInvertedIndexDistinctorJsonIndexDistinctin place of the default distinct node.- Query response metadata:
numEntriesScannedPostFilterdrops to0for the JSON path, and near-zero for the inverted-index bitmap path. - If the plan still shows the regular DISTINCT, either the query is ineligible (see “When it does not help” above) or the cost-based picker chose scan because that was cheaper.

