Composite JSON Index
Overview and Purpose
The Composite JSON Index is an enhanced version of the JSON Index.
The Composite JSON Index is available from Apache Pinot version 1.4.0.
In addition to the features of the JSON index, this index supports:
- Indexing select path(s) with internal range and/or text index.
- Controlling which paths are included in inverted index (to make the index smaller, speed up index build).
- Applying a filter to the number of matching flattened documents (per JSON document).
- Controlling how often to flush with the off-heap index creator and also limiting temporary memory usage.
- Disabling the indexing of array positions (to make the index smaller).
The composite JSON index heavily optimizes document id mappings to reduce the index size and improve performance.
Creating both the JSON Index and the Composite JSON Index on the same column is not recommended. When evaluating the JSON_MATCH function, Pinot first checks for the regular JSON Index, ignoring the Composite JSON Index.
Example Illustration
Benchmark testing shows a significant response time reduction for range expressions.
Consider the query:
The query takes 1200ms with a regular JSON index, but only 5ms with the composite JSON index.
Configuration
Enabling Composite JSON Index
To enable a composite JSON index on a column in your StarTree Cloud table, add the following configuration to your table definition:
Configuration Parameters
The composite JSON index configuration supports the JSON index options, as well as:
- flushThreshold: Flush off-heap posting list every n documents to limit memory usage.
- enablePositionalIndexing: Indicates whether to include array indexes or not.
- Optional
- Default: true
- When set to false, then Pinot won’t include array indexes in the inverted index.
- invertedIndexConfigs: A list of paths to add to the inverted index.
- Default: empty
- Can be a single element with
"includeAllPaths":true
, to include everything
- rangeIndexConfigs: A list of paths to include in internal range index(es), along with settings.
Supports the following parameters:- path: JSON path of field to index
- Required
- name: The name for the field.
- Required and must be unique.
- dataType: Data type of the JSON field.
- Required
- When
createDictionary=true
, this parameter accepts accepts INT, LONG, FLOAT, DOUBLE, STRING, BIG_DECIMAL. - Otherwise this parameter accepts INT, LONG, FLOAT, DOUBLE.
- createDictionary: Indicates whether to create the dictionary for the field.
- Optional
- Default: false
- dictionaryType: Indicates the type, fixedLength or variableLength.
- Optional
- Default: variableLength
- Due to dictionary limitations, Pinot currently supports only variableLength for BIG_DECIMAL.
- defaultValue: A value for flattened records that don’t include the field, or that contain a badly formed value.
- Optional
- path: JSON path of field to index
- textIndexConfigs: A list of paths to include in the internal Lucene text index(es).
- Accepts path and name parameters.
- The name parameter must be unique.
Example Queries
Range-index-based range query against a JSON field
Text-index-based TEXT_MATCH against JSON field
See the Pinot documentation for more information.
Select only documents with at least 5 matching flattened documents
Range Index Default Values
The range index contains value for each flattened row. If value is missing or unparseable then either user-set or fixed - ‘0’ - default is used. That default might cause unexpected results to appear when range-querying without upper or lower bound. For example, with no defaultValue
set, the following query will include documents without a value
field:
Array Position Indexing
A query that specifies an array index in the JSON path will require the enablePositionalIndexing=true
parameter configuration, even when path is range-indexed. Otherwise, the query will return an empty result. For example, the following query:
That is because array index values are stored as separate bitmaps in the inverted index.
Range Queries against Mutable Segments
A range index is used for immutable/commited segments, while mutable segments rely on the inverted index. When indexing hybrid or real-time tables, you must include paths used for range queries in the inverted index. Otherwise queries will return empty results for the mutable segments.
Consider the following example configuration:
The inverted index configuration is empty. In such a case, the following query will return correct results only for for the immutable segments:
The mutable/realtime segments will produce no rows because they can’t use the range index and the inverted index is empty.
Text Index Configuration
Apart from the path and name, the textIndexConfig can contain fields allowed in regular text index configuration:
“rawValue”, “queryCache”, “useANDForMultiTermQueries”, “stopWordsInclude”, “stopWordsExclude”,
“luceneUseCompoundFile”, “luceneMaxBufferSizeMB”, “luceneAnalyzerClass”, “luceneAnalyzerClassArgs”,
“luceneAnalyzerClassArgTypes”, “luceneQueryParserClass”, “enablePrefixSuffixMatchingInPhraseQueries”,
“reuseMutableIndex”, “luceneNRTCachingDirectoryMaxBufferSizeMB”
For more information on the text index and configuration, see the Pinot documentation.