Overview
The Composite JSON index is an enhanced version of the JSON index. It indexes flattened JSON records and can add specialized per-path indexes for common JSON_MATCH predicates.
The Composite JSON index is available from StarTree version 0.11.0. Compact doc ID mapping and the related JSON_MATCH query-path performance improvements are available after the StarTree 0.15 release.
Use the Composite JSON index when you need to query nested JSON paths with one or more of the following patterns:
- Equality,
IN, NOT IN, IS NULL, or IS NOT NULL filters on selected JSON paths.
- Range filters on numeric JSON fields.
- Text search with
TEXT_MATCH on selected JSON paths.
- Regex or prefix-style filters with
REGEXP_LIKE on selected paths accelerated by an FST index.
- Count filters over the number of matching flattened records in a JSON document.
The Composite JSON index can also reduce index size by limiting the paths included in the inverted index, disabling array-position indexing, and, for eligible array-free columns, storing a compact doc ID mapping.
Creating both a JSON index and a Composite JSON index on the same column is not recommended. When evaluating JSON_MATCH, Pinot checks for the regular JSON index first and ignores the Composite JSON index.
Add the Composite JSON index under the column’s fieldConfigList[].indexes entry.
{
"fieldConfigList": [
{
"name": "json",
"encodingType": "RAW",
"indexes": {
"compositeJson": {
"flushThreshold": 10000,
"enablePositionalIndexing": true,
"invertedIndexConfigs": [
{
"path": "$.id"
}
],
"rangeIndexConfigs": [
{
"path": "$[*].timestamp",
"name": "eventTimestamp",
"dataType": "LONG",
"createDictionary": "true",
"dictionaryType": "fixedLength",
"defaultValue": "0"
}
],
"textIndexConfigs": [
{
"path": "$[*].message",
"name": "messageText"
}
],
"fstIndexConfigs": [
{
"path": "$.email",
"name": "emailFst"
}
]
}
}
}
]
}
To index every path in the inverted index, use includeAllPaths:
{
"fieldConfigList": [
{
"name": "json",
"encodingType": "RAW",
"indexes": {
"compositeJson": {
"invertedIndexConfigs": [
{
"includeAllPaths": "true"
}
]
}
}
}
]
}
Configuration Parameters
The Composite JSON index supports the JSON index options maxLevels, excludeArray, disableCrossArrayUnnest, includePaths, excludePaths, excludeFields, indexPaths, maxValueLength, and skipInvalidJson.
It also supports these Composite JSON options:
| Parameter | Default | Description |
|---|
flushThreshold | unset | Flushes off-heap posting lists every N documents to limit temporary memory usage during index creation. |
enablePositionalIndexing | true | Includes array positions such as $[0] in the inverted index. Set to false to reduce index size when queries only use wildcard paths such as $[*]. |
invertedIndexConfigs | empty | Lists paths to add to the inverted index. Use objects with path, or one object with includeAllPaths: "true" to include all paths. |
invertedIndexPartitionCount | 1 | Splits the inverted index into multiple path partitions. Partitioning can reduce large index overhead, but it affects compact doc ID mapping eligibility and FST availability. |
rangeIndexConfigs | empty | Lists paths to index with an internal range index. |
textIndexConfigs | empty | Lists paths to index with an internal Lucene text index. |
fstIndexConfigs | empty | Lists paths to index with an FST for faster REGEXP_LIKE filters. |
enableCompactDocIdMapping | false | Allows eligible new immutable segments to use a compact doc ID mapping instead of the legacy per-document int[] mapping. Enable only after all nodes that read the index support the new format. |
Range Indexes
Use rangeIndexConfigs for JSON paths that appear in range predicates.
{
"path": "$[*].timestamp",
"name": "eventTimestamp",
"dataType": "LONG",
"createDictionary": "true",
"dictionaryType": "fixedLength",
"defaultValue": "0"
}
Range index fields:
| Field | Required | Description |
|---|
path | yes | JSON path to index. |
name | yes | Unique index name. Use only letters, digits, and underscores. |
dataType | yes | JSON field type. With createDictionary: "true", supported types are INT, LONG, FLOAT, DOUBLE, STRING, and BIG_DECIMAL. Without a dictionary, supported types are INT, LONG, FLOAT, and DOUBLE. |
createDictionary | no | Set to "true" to build a dictionary-encoded range index. Default is "false". |
dictionaryType | no | fixedLength or variableLength. Default is variableLength. BIG_DECIMAL supports only variableLength. |
defaultValue | no | Value used for flattened records that do not contain the field or contain an unparseable value. |
The range index has one value for each flattened record. If a value is missing or unparseable, Pinot uses the configured defaultValue, or the type default when defaultValue is omitted. This can affect unbounded range queries. For example, if the default value is 0, this query can include documents without $.value:
SELECT count(*)
FROM users
WHERE JSON_MATCH(json, '"$.value" < 3');
Text Indexes
Use textIndexConfigs for JSON paths queried with TEXT_MATCH.
{
"path": "$[*].message",
"name": "messageText"
}
The name must be unique. Text index configs also accept the same fields supported by a regular text index, including:
rawValue, queryCache, useANDForMultiTermQueries, stopWordsInclude, stopWordsExclude,
luceneUseCompoundFile, luceneMaxBufferSizeMB, luceneAnalyzerClass,
luceneAnalyzerClassArgs, luceneAnalyzerClassArgTypes, luceneQueryParserClass,
enablePrefixSuffixMatchingInPhraseQueries, reuseMutableIndex,
luceneNRTCachingDirectoryMaxBufferSizeMB
For more information, see the text index documentation.
FST Indexes
FST indexes within Composite JSON were added in StarTree 0.12.
Use fstIndexConfigs to accelerate REGEXP_LIKE predicates on selected JSON paths.
{
"fieldConfigList": [
{
"name": "json",
"encodingType": "RAW",
"indexes": {
"compositeJson": {
"invertedIndexConfigs": [
{
"path": "$.email"
}
],
"fstIndexConfigs": [
{
"path": "$.email",
"name": "emailFst"
}
]
}
}
}
]
}
FST indexes require the path to also be available in the Composite JSON inverted index. FST acceleration is not used when invertedIndexPartitionCount is greater than 1; queries fall back to dictionary scanning.
Compact Doc ID Mapping
Compact doc ID mapping is available after the StarTree 0.15 release.
Composite JSON indexes store matches over flattened JSON records and then translate them back to table document IDs. Historically, each immutable segment stored a full (numDocs + 1) * 4 byte int[] mapping for this translation, even when the JSON column was array-free and each document produced at most one flattened record.
For array-free columns, the mapping can dominate index size and query latency. On an 11.3M-row segment with about 8% empty JSON documents, the full mapping was about 45 MB. With compact doc ID mapping, the same shape can store a small null-docs bitmap instead, around 125 KB in that measured case.
Enable compact mapping with enableCompactDocIdMapping:
{
"fieldConfigList": [
{
"name": "tagsJSON",
"encodingType": "RAW",
"indexes": {
"compositeJson": {
"enableCompactDocIdMapping": true,
"invertedIndexConfigs": [
{
"path": "$.dir"
}
]
}
}
}
]
}
enableCompactDocIdMapping writes a newer Composite JSON on-disk format for eligible segments. Keep it disabled during rolling upgrades. Enable it only after every server, peer-download path, and validation path that may read the index has been upgraded to a version that supports compact Composite JSON doc ID mappings.
Eligibility
The flag allows compact mapping, but the index creator still chooses the smallest safe representation for each new immutable segment.
Segment shape with enableCompactDocIdMapping: true | Mapping written |
|---|
| Array-free, no empty JSON documents | No mapping. Flattened IDs already equal real document IDs. |
Array-free, has empty documents, no range or text index, and invertedIndexPartitionCount: 1 | Compact null-docs bitmap. Inverted-index postings are relabeled to real document IDs at segment seal time. |
Array-free, has empty documents, and has a range index, text index, or invertedIndexPartitionCount > 1 | Legacy full int[] mapping. |
| Contains array values | Legacy full int[] mapping. |
| Empty segment | Legacy full int[] mapping. |
With enableCompactDocIdMapping omitted or set to false, all new segments keep the legacy full int[] mapping.
Rollout and Rollback
enableCompactDocIdMapping is read only while creating new immutable segments. Changing the setting does not rewrite existing segments by itself.
Use this rollout order:
- Upgrade every component that may read Composite JSON indexes.
- Set
enableCompactDocIdMapping: true on the target column.
- Rebuild, refresh, or re-commit the affected segments so new segments are written with the compact representation when eligible.
To roll back to a version that does not support compact mappings:
- Set
enableCompactDocIdMapping: false.
- Refresh or re-commit any affected compact segments so they are written with the legacy mapping.
- Downgrade only after no compact-mapping segments remain.
Query Examples
Equality
SELECT count(*)
FROM users
WHERE JSON_MATCH(json, '"$.dir" = ''upstream''');
Range Query
SELECT count(*)
FROM users
WHERE JSON_MATCH(json, '"$[*].timestamp" BETWEEN 1577836800000 AND 1735689599999');
Text Search
SELECT count(*)
FROM users
WHERE JSON_MATCH(json, 'TEXT_MATCH("$[*].message", ''timeout OR exception'')');
Regex Search with FST
SELECT count(*)
FROM users
WHERE JSON_MATCH(json, 'REGEXP_LIKE("$.email", ''.*@example[.]com'')');
Count Filter
Use the optional third argument to filter by the number of matching flattened records per document.
SELECT count(*)
FROM users
WHERE JSON_MATCH(json, '"$[*].id" = 10', '"cnt" >= 5');
For IS NULL, a count filter can distinguish empty JSON documents from present documents where the path is absent:
SELECT count(*)
FROM users
WHERE JSON_MATCH(json, '"$.dir" IS NULL', '"cnt" >= 1');
Array Position Indexing
Queries that specify a concrete array position require enablePositionalIndexing: true, even when the same path is also range-indexed.
SELECT count(*)
FROM users
WHERE JSON_MATCH(json, '"$.grades[1].value" BETWEEN 1 AND 10');
When enablePositionalIndexing is false, use wildcard paths such as $.grades[*].value instead.
Range Queries on Real-Time Tables
For committed immutable segments, range predicates can use the internal Composite JSON range index. Mutable consuming segments rely on the inverted index path.
When you use Composite JSON range predicates on a real-time or hybrid table, include the same JSON path in invertedIndexConfigs. Otherwise, mutable consuming segments can return no rows for that predicate until they commit.
{
"fieldConfigList": [
{
"name": "json",
"encodingType": "RAW",
"indexes": {
"compositeJson": {
"invertedIndexConfigs": [
{
"path": "$.grades[*].value"
}
],
"rangeIndexConfigs": [
{
"path": "$.grades[*].value",
"name": "gradeValue",
"dataType": "LONG",
"createDictionary": "true",
"dictionaryType": "fixedLength",
"defaultValue": "0"
}
]
}
}
}
]
}
The following comparisons were measured on internal Composite JSON benchmarks for the improvements available after the StarTree 0.15 release.
| Scenario | Before | After | Impact |
|---|
Array-free, inverted-index-only segment with compact mapping enabled: non-selective equality JSON_MATCH on 5M documents | 93.2 ms | 0.21 ms | About 450x faster |
| Same shape: Composite JSON index size | 25.7 MB | 9.6 MB | About 63% smaller |
| Same shape: doc ID mapping size | 19 MB | 0.6 MB | About 97% smaller |
Real 11.3M-row tagsJSON segment: doc ID mapping size | 45 MB | 125 KB | About 99.7% smaller |
Identity-mapped segment: reader-level getMatchingDocIds on a 359K-match posting | 0.052 ms | 0.003 ms | About 17x faster |
| Legacy full-mapping segment: dense non-selective remap on 2M rows | 33 ms | 11 ms | About 3x faster |
| Legacy full-mapping segment: sparse selective remap on 2M rows | 1.4 ms | 0.9 ms | About 36% faster |
Recent Composite JSON query-path improvements also avoid cloning read-only postings for compact identity segments and speed up the legacy remap path for array and full-mapping segments. Those improvements require no table-config changes.