JSON Index
The JSON index flattens and indexes JSON structures to allow fast, path-based filtering without the need for full document parsing.
Overview and Purpose
A JSON index is a specialized data structure that accelerates filtering operations on JSON string columns. It enables efficient value lookups within JSON documents, without requiring the system to parse and analyze the entire JSON at query time.
JSON strings provide flexibility for representing complex data structures such as arrays, maps, and nested fields, without enforcing a rigid schema. However, without proper indexing, queries that filter on JSON content can be computationally expensive as they require scanning and reconstructing the entire JSON object for each record.
JSON indexes are particularly valuable for:
- Applications using JSON to store semi-structured or schema-flexible data
- Queries that filter on specific fields or paths within JSON documents
- Nested data structures where some fields are frequently used for filtering
- Use cases requiring fast lookups within arrays or nested objects
The JSON index optimizes query performance by pre-processing JSON documents during indexing, creating a flattened and searchable representation that dramatically reduces query-time computation.
How the Index Works
Core Concepts
Traditional approaches to JSON filtering require the system to:
- Scan every record in the table
- Parse each JSON string into a structured object
- Extract the required value based on the specified path
- Apply the filter condition to determine matches
This process is computationally expensive, especially for large datasets or complex JSON structures.
The JSON index in StarTree Cloud optimizes this process, by:
- Flattening: During indexing, the system “flattens” the JSON structure, extracting key-value pairs using paths that identify each value’s location in the original structure.
- Path Indexing: The flattened key-value pairs are indexed, allowing fast lookups based on specific JSON paths.
- Array Handling: Arrays are processed by either:
- Creating separate entries for each array element
- Combining elements from multiple arrays using cross-array unnesting
- Query Optimization: At query time, the system can directly access the indexed paths without parsing the entire JSON structure.
Example Illustration
Consider a JSON document representing a person:
Without a JSON index, to find all persons with addresses in “ca”, the system would need to parse and analyze every JSON document.
With a JSON index, the document is flattened during indexing into entries like:
name
: “adam”age
: 30addresses[0].number
: 112addresses[0].street
: “main st”addresses[0].country
: “us”addresses[1].number
: 2addresses[1].street
: “second st”addresses[1].country
: “ca”
At query time, the system can directly check the indexed values for addresses[*].country
= “ca” without parsing the entire JSON structure.
Configuration
Enabling JSON Index
To enable a JSON index on a column in your StarTree Cloud table, add the following configuration to your table definition:
For a simple configuration with default values, you can use:
Configuration Parameters
- maxLevels: Maximum levels to flatten in the JSON object (arrays count as one level)
- Default: 5
- Higher values allow indexing deeper nested structures
- Lower values reduce index size but limit deep path accessibility
- excludeArray: Whether to exclude arrays when flattening the object
- Default: false
- When true, array elements are not indexed
- Useful to reduce index size when arrays aren’t used in filters
- disableCrossArrayUnnest: Controls array unnesting behavior
- Default: false (calculate unique combinations of all elements)
- When true, prevents cross-product expansion of multiple arrays
- Recommended for documents with multiple large arrays to avoid combinatorial explosion
- includePaths: Only include the specified JSON paths (mutually exclusive with excludePaths)
- Example: [“.addresses[*].country”]
- Paths under included paths are automatically included
- Useful to precisely control which paths are indexed
- excludePaths: Exclude the specified JSON paths (mutually exclusive with includePaths)
- Example: [“.addresses[*].number”]
- Paths under excluded paths are automatically excluded
- Useful to omit rarely filtered paths
- excludeFields: Exclude specific field names regardless of path
- Example: [“street”, “zip”]
- Applies even if the field is under included paths
- Useful for excluding specific fields across all objects
- indexPaths: Control indexed paths using glob patterns
- Example: [”.”, “addresses.**”]
- More flexible than includePaths/excludePaths
- Can be combined with other configuration options
- maxValueLength: Maximum length for indexed values
- Values longer than this are replaced with ""
- Useful for limiting index size when some values are very large
- skipInvalidJson: Handle invalid JSON documents gracefully
- When true, replaces invalid JSON with empty key/path and "" value
- When false (default), invalid JSON causes errors during indexing
Alternative Configuration Methods
While the above method is recommended, you can also configure JSON indexes using legacy approaches:
Or more simply:
Important Configuration Considerations
- Column Type: JSON indexes can only be applied to columns with STRING or JSON data types that contain valid JSON strings.
- Dictionary Encoding: To reduce storage overhead, consider disabling the dictionary indexes on JSON-indexed columns.
- Array Handling: For documents with multiple large arrays, setting disableCrossArrayUnnest to true can prevent the “Got too many combinations” error, which occurs when more than 100,000 flattened documents would be created.
- Index Size: The index size grows with JSON complexity, depth, and the number of paths indexed. Use configuration parameters to control this.