Skip to main content

Overview

The Composite JSON index is an enhanced version of the JSON index. It indexes flattened JSON records and can add specialized per-path indexes for common JSON_MATCH predicates.
The Composite JSON index is available from StarTree version 0.11.0. Compact doc ID mapping and the related JSON_MATCH query-path performance improvements are available after the StarTree 0.15 release.
Use the Composite JSON index when you need to query nested JSON paths with one or more of the following patterns:
  • Equality, IN, NOT IN, IS NULL, or IS NOT NULL filters on selected JSON paths.
  • Range filters on numeric JSON fields.
  • Text search with TEXT_MATCH on selected JSON paths.
  • Regex or prefix-style filters with REGEXP_LIKE on selected paths accelerated by an FST index.
  • Count filters over the number of matching flattened records in a JSON document.
The Composite JSON index can also reduce index size by limiting the paths included in the inverted index, disabling array-position indexing, and, for eligible array-free columns, storing a compact doc ID mapping.
Creating both a JSON index and a Composite JSON index on the same column is not recommended. When evaluating JSON_MATCH, Pinot checks for the regular JSON index first and ignores the Composite JSON index.

Configure the Index

Add the Composite JSON index under the column’s fieldConfigList[].indexes entry.
{
  "fieldConfigList": [
    {
      "name": "json",
      "encodingType": "RAW",
      "indexes": {
        "compositeJson": {
          "flushThreshold": 10000,
          "enablePositionalIndexing": true,
          "invertedIndexConfigs": [
            {
              "path": "$.id"
            }
          ],
          "rangeIndexConfigs": [
            {
              "path": "$[*].timestamp",
              "name": "eventTimestamp",
              "dataType": "LONG",
              "createDictionary": "true",
              "dictionaryType": "fixedLength",
              "defaultValue": "0"
            }
          ],
          "textIndexConfigs": [
            {
              "path": "$[*].message",
              "name": "messageText"
            }
          ],
          "fstIndexConfigs": [
            {
              "path": "$.email",
              "name": "emailFst"
            }
          ]
        }
      }
    }
  ]
}
To index every path in the inverted index, use includeAllPaths:
{
  "fieldConfigList": [
    {
      "name": "json",
      "encodingType": "RAW",
      "indexes": {
        "compositeJson": {
          "invertedIndexConfigs": [
            {
              "includeAllPaths": "true"
            }
          ]
        }
      }
    }
  ]
}

Configuration Parameters

The Composite JSON index supports the JSON index options maxLevels, excludeArray, disableCrossArrayUnnest, includePaths, excludePaths, excludeFields, indexPaths, maxValueLength, and skipInvalidJson. It also supports these Composite JSON options:
ParameterDefaultDescription
flushThresholdunsetFlushes off-heap posting lists every N documents to limit temporary memory usage during index creation.
enablePositionalIndexingtrueIncludes array positions such as $[0] in the inverted index. Set to false to reduce index size when queries only use wildcard paths such as $[*].
invertedIndexConfigsemptyLists paths to add to the inverted index. Use objects with path, or one object with includeAllPaths: "true" to include all paths.
invertedIndexPartitionCount1Splits the inverted index into multiple path partitions. Partitioning can reduce large index overhead, but it affects compact doc ID mapping eligibility and FST availability.
rangeIndexConfigsemptyLists paths to index with an internal range index.
textIndexConfigsemptyLists paths to index with an internal Lucene text index.
fstIndexConfigsemptyLists paths to index with an FST for faster REGEXP_LIKE filters.
enableCompactDocIdMappingfalseAllows eligible new immutable segments to use a compact doc ID mapping instead of the legacy per-document int[] mapping. Enable only after all nodes that read the index support the new format.

Range Indexes

Use rangeIndexConfigs for JSON paths that appear in range predicates.
{
  "path": "$[*].timestamp",
  "name": "eventTimestamp",
  "dataType": "LONG",
  "createDictionary": "true",
  "dictionaryType": "fixedLength",
  "defaultValue": "0"
}
Range index fields:
FieldRequiredDescription
pathyesJSON path to index.
nameyesUnique index name. Use only letters, digits, and underscores.
dataTypeyesJSON field type. With createDictionary: "true", supported types are INT, LONG, FLOAT, DOUBLE, STRING, and BIG_DECIMAL. Without a dictionary, supported types are INT, LONG, FLOAT, and DOUBLE.
createDictionarynoSet to "true" to build a dictionary-encoded range index. Default is "false".
dictionaryTypenofixedLength or variableLength. Default is variableLength. BIG_DECIMAL supports only variableLength.
defaultValuenoValue used for flattened records that do not contain the field or contain an unparseable value.
The range index has one value for each flattened record. If a value is missing or unparseable, Pinot uses the configured defaultValue, or the type default when defaultValue is omitted. This can affect unbounded range queries. For example, if the default value is 0, this query can include documents without $.value:
SELECT count(*)
FROM users
WHERE JSON_MATCH(json, '"$.value" < 3');

Text Indexes

Use textIndexConfigs for JSON paths queried with TEXT_MATCH.
{
  "path": "$[*].message",
  "name": "messageText"
}
The name must be unique. Text index configs also accept the same fields supported by a regular text index, including:
rawValue, queryCache, useANDForMultiTermQueries, stopWordsInclude, stopWordsExclude,
luceneUseCompoundFile, luceneMaxBufferSizeMB, luceneAnalyzerClass,
luceneAnalyzerClassArgs, luceneAnalyzerClassArgTypes, luceneQueryParserClass,
enablePrefixSuffixMatchingInPhraseQueries, reuseMutableIndex,
luceneNRTCachingDirectoryMaxBufferSizeMB
For more information, see the text index documentation.

FST Indexes

FST indexes within Composite JSON were added in StarTree 0.12.
Use fstIndexConfigs to accelerate REGEXP_LIKE predicates on selected JSON paths.
{
  "fieldConfigList": [
    {
      "name": "json",
      "encodingType": "RAW",
      "indexes": {
        "compositeJson": {
          "invertedIndexConfigs": [
            {
              "path": "$.email"
            }
          ],
          "fstIndexConfigs": [
            {
              "path": "$.email",
              "name": "emailFst"
            }
          ]
        }
      }
    }
  ]
}
FST indexes require the path to also be available in the Composite JSON inverted index. FST acceleration is not used when invertedIndexPartitionCount is greater than 1; queries fall back to dictionary scanning.

Compact Doc ID Mapping

Compact doc ID mapping is available after the StarTree 0.15 release.
Composite JSON indexes store matches over flattened JSON records and then translate them back to table document IDs. Historically, each immutable segment stored a full (numDocs + 1) * 4 byte int[] mapping for this translation, even when the JSON column was array-free and each document produced at most one flattened record. For array-free columns, the mapping can dominate index size and query latency. On an 11.3M-row segment with about 8% empty JSON documents, the full mapping was about 45 MB. With compact doc ID mapping, the same shape can store a small null-docs bitmap instead, around 125 KB in that measured case. Enable compact mapping with enableCompactDocIdMapping:
{
  "fieldConfigList": [
    {
      "name": "tagsJSON",
      "encodingType": "RAW",
      "indexes": {
        "compositeJson": {
          "enableCompactDocIdMapping": true,
          "invertedIndexConfigs": [
            {
              "path": "$.dir"
            }
          ]
        }
      }
    }
  ]
}
enableCompactDocIdMapping writes a newer Composite JSON on-disk format for eligible segments. Keep it disabled during rolling upgrades. Enable it only after every server, peer-download path, and validation path that may read the index has been upgraded to a version that supports compact Composite JSON doc ID mappings.

Eligibility

The flag allows compact mapping, but the index creator still chooses the smallest safe representation for each new immutable segment.
Segment shape with enableCompactDocIdMapping: trueMapping written
Array-free, no empty JSON documentsNo mapping. Flattened IDs already equal real document IDs.
Array-free, has empty documents, no range or text index, and invertedIndexPartitionCount: 1Compact null-docs bitmap. Inverted-index postings are relabeled to real document IDs at segment seal time.
Array-free, has empty documents, and has a range index, text index, or invertedIndexPartitionCount > 1Legacy full int[] mapping.
Contains array valuesLegacy full int[] mapping.
Empty segmentLegacy full int[] mapping.
With enableCompactDocIdMapping omitted or set to false, all new segments keep the legacy full int[] mapping.

Rollout and Rollback

enableCompactDocIdMapping is read only while creating new immutable segments. Changing the setting does not rewrite existing segments by itself. Use this rollout order:
  1. Upgrade every component that may read Composite JSON indexes.
  2. Set enableCompactDocIdMapping: true on the target column.
  3. Rebuild, refresh, or re-commit the affected segments so new segments are written with the compact representation when eligible.
To roll back to a version that does not support compact mappings:
  1. Set enableCompactDocIdMapping: false.
  2. Refresh or re-commit any affected compact segments so they are written with the legacy mapping.
  3. Downgrade only after no compact-mapping segments remain.

Query Examples

Equality

SELECT count(*)
FROM users
WHERE JSON_MATCH(json, '"$.dir" = ''upstream''');

Range Query

SELECT count(*)
FROM users
WHERE JSON_MATCH(json, '"$[*].timestamp" BETWEEN 1577836800000 AND 1735689599999');
SELECT count(*)
FROM users
WHERE JSON_MATCH(json, 'TEXT_MATCH("$[*].message", ''timeout OR exception'')');

Regex Search with FST

SELECT count(*)
FROM users
WHERE JSON_MATCH(json, 'REGEXP_LIKE("$.email", ''.*@example[.]com'')');

Count Filter

Use the optional third argument to filter by the number of matching flattened records per document.
SELECT count(*)
FROM users
WHERE JSON_MATCH(json, '"$[*].id" = 10', '"cnt" >= 5');
For IS NULL, a count filter can distinguish empty JSON documents from present documents where the path is absent:
SELECT count(*)
FROM users
WHERE JSON_MATCH(json, '"$.dir" IS NULL', '"cnt" >= 1');

Array Position Indexing

Queries that specify a concrete array position require enablePositionalIndexing: true, even when the same path is also range-indexed.
SELECT count(*)
FROM users
WHERE JSON_MATCH(json, '"$.grades[1].value" BETWEEN 1 AND 10');
When enablePositionalIndexing is false, use wildcard paths such as $.grades[*].value instead.

Range Queries on Real-Time Tables

For committed immutable segments, range predicates can use the internal Composite JSON range index. Mutable consuming segments rely on the inverted index path. When you use Composite JSON range predicates on a real-time or hybrid table, include the same JSON path in invertedIndexConfigs. Otherwise, mutable consuming segments can return no rows for that predicate until they commit.
{
  "fieldConfigList": [
    {
      "name": "json",
      "encodingType": "RAW",
      "indexes": {
        "compositeJson": {
          "invertedIndexConfigs": [
            {
              "path": "$.grades[*].value"
            }
          ],
          "rangeIndexConfigs": [
            {
              "path": "$.grades[*].value",
              "name": "gradeValue",
              "dataType": "LONG",
              "createDictionary": "true",
              "dictionaryType": "fixedLength",
              "defaultValue": "0"
            }
          ]
        }
      }
    }
  ]
}

Performance Notes

The following comparisons were measured on internal Composite JSON benchmarks for the improvements available after the StarTree 0.15 release.
ScenarioBeforeAfterImpact
Array-free, inverted-index-only segment with compact mapping enabled: non-selective equality JSON_MATCH on 5M documents93.2 ms0.21 msAbout 450x faster
Same shape: Composite JSON index size25.7 MB9.6 MBAbout 63% smaller
Same shape: doc ID mapping size19 MB0.6 MBAbout 97% smaller
Real 11.3M-row tagsJSON segment: doc ID mapping size45 MB125 KBAbout 99.7% smaller
Identity-mapped segment: reader-level getMatchingDocIds on a 359K-match posting0.052 ms0.003 msAbout 17x faster
Legacy full-mapping segment: dense non-selective remap on 2M rows33 ms11 msAbout 3x faster
Legacy full-mapping segment: sparse selective remap on 2M rows1.4 ms0.9 msAbout 36% faster
Recent Composite JSON query-path improvements also avoid cloning read-only postings for compact identity segments and speed up the legacy remap path for array and full-mapping segments. Those improvements require no table-config changes.