Star-tree Index

Overview and Purpose

A star-tree index is an advanced multi-dimensional indexing structure that significantly accelerates aggregation and group-by queries by using pre-computed aggregation results. Unlike single-column indexes, star-tree indexes work across multiple dimensions to dramatically reduce the number of records that need to be processed at query time.

One of the biggest challenges in real-time analytics is maintaining low query latencies on large datasets while efficiently using storage resources. Traditional indexes improve query performance but are still limited by the number of records that must be processed. Pure pre-aggregation strategies guarantee fast query responses but can lead to storage explosion.

Star-tree indexes in StarTree Cloud provide an optimal balance between these approaches by selectively pre-aggregating data based on common query patterns, offering:

Predictable, low query latencies for aggregation operations
Efficient use of storage space compared to full pre-aggregation
Significant performance improvements for multi-dimensional queries
Configurable trade-offs between query speed and storage requirements

Star-tree indexes are particularly valuable for analytical workloads with high query volumes on large datasets where aggregation queries are common.

How the Index Works

Core Concepts

The star-tree index creates a hierarchical tree structure that organizes data based on multiple dimensions. The key elements of this structure include:

Dimension-based Organization: Data is organized hierarchically based on an ordered list of dimensions, with each level in the tree representing a particular dimension.
Pre-aggregation: For each node in the tree, metrics are pre-aggregated, allowing the system to directly use these pre-computed results when possible.
Star Nodes: Special nodes that contain pre-aggregated results after removing a specific dimension, enabling efficient handling of queries that don’t filter on certain dimensions.
Configurable Depth: The tree depth and node size can be configured to balance between storage requirements and query performance.

Example Illustration

Consider a dataset with dimensions Country, Browser, and Locale, and a metric Impressions:

Traditional Approach: To find the sum of Impressions for USA, Chrome across all Locales, the system would need to scan all matching records and compute the sum at query time.
Star-tree Approach:
- The data is organized hierarchically by Country, then Browser, then Locale
- At the Country=USA, Browser=Chrome level, a star node pre-computes the sum across all Locales
- The query directly uses this pre-computed result without scanning individual records

This approach significantly reduces query time by eliminating the need to process individual records when aggregate values can be used instead.

Tree Structure and Components

Node Types

Root Node: The single starting point of the tree from which all other nodes can be traversed.
Leaf Node: Contains at most T records (where T is configurable), representing the most granular level of data.
Non-leaf Node: Contains more than T records and is further split into child nodes based on the next dimension in the split order.
Star Node: A special child node that contains pre-aggregated records after removing the dimension on which data was split at that level.

Node Properties

Each node in the star-tree contains:

Dimension: The dimension the node is split on
Document ID Range: The range of documents this node encompasses
Aggregated Document ID: A single document containing the pre-aggregated results for all documents in this node

Configuration

Enabling Star-tree Index

To enable a star-tree index in your StarTree Cloud table, add the following configuration to your table definition:

{
  "tableIndexConfig": {
    "starTreeIndexConfigs": [
      {
        "dimensionsSplitOrder": [
          "country",
          "browser",
          "locale"
        ],
        "skipStarNodeCreationForDimensions": [],
        "functionColumnPairs": [
          "SUM__impressions",
          "COUNT__*"
        ],
        "maxLeafRecords": 10000
      }
    ]
  }
}

Alternative Configuration Using Aggregation Configs

You can also use the more detailed aggregationConfigs approach with compression settings:

{
  "tableIndexConfig": {
    "starTreeIndexConfigs": [
      {
        "dimensionsSplitOrder": [
          "country",
          "browser",
          "locale"
        ],
        "skipStarNodeCreationForDimensions": [],
        "aggregationConfigs": [
          {
            "columnName": "impressions",
            "aggregationFunction": "SUM",
            "compressionCodec": "LZ4"
          },
          {
            "columnName": "*",
            "aggregationFunction": "COUNT"
          }
        ],
        "maxLeafRecords": 10000
      }
    ]
  }
}

Default Star-tree Index

For a simpler approach, you can enable the default star-tree index:

{
  "tableIndexConfig": {
    "enableDefaultStarTree": true
  }
}

The default configuration:

Includes all single-value dictionary-encoded dimensions with cardinality ≤ 10,000
Sorts dimensions by cardinality in descending order
Appends time columns at the end of the dimension split order
Includes COUNT(*) and SUM for all numeric metrics
Uses a default maxLeafRecords value of 10,000

Configuration Parameters

dimensionsSplitOrder: An ordered list of dimension names that determines the split order in the tree.
- Only these dimensions are included in the star-tree index
- The order is important as it determines the hierarchy of the tree
- All columns in the filter and group-by clauses should be included for the index to be used
skipStarNodeCreationForDimensions: (Optional) Dimensions for which star nodes should not be created.
functionColumnPairs / aggregationConfigs: The metrics and aggregation functions to pre-compute.
- Format for functionColumnPairs: FUNCTION__COLUMN (e.g., SUM__impressions)
- All aggregations in a query must be included here for the star-tree to be used
maxLeafRecords: (Optional, default 10,000) The threshold for splitting nodes further.
compressionCodec: (Optional, via aggregationConfigs) The compression type for aggregated values.

Supported Aggregation Functions

The star-tree index supports aggregation functions that have bounded-sized intermediate results:

SUM, MIN, MAX, COUNT
AVG (implemented as SUM/COUNT)
DISTINCTCOUNT variants with bounded intermediate size:
- DISTINCTCOUNT_HLL
- DISTINCTCOUNT_THETA_SKETCH
- DISTINCTCOUNT_CPC_SKETCH
- DISTINCTCOUNT_TUPLE_SKETCH
PERCENTILE implementations with bounded size:
- PERCENTILE_TDIGEST

Unsupported functions include:

Basic DISTINCTCOUNT (unbounded Set)
PERCENTILE (unbounded List)

Query Behavior and Optimization

When Star-tree Index Is Used

The star-tree index is automatically used when:

All predicates in the query are supported by the star-tree index
All group-by columns are included in the dimensionsSplitOrder
All aggregation functions are included in the index configuration

How Queries Are Optimized

When a query is executed with a star-tree index:

The system first checks if the query can be answered using the star-tree
It traverses the tree to identify nodes that satisfy the query predicates
For each level in the tree:
- If no predicate or group-by exists on the split dimension, the star node is selected if available
- If predicates exist on the split dimension, only matching child nodes are selected
- If no predicate but a group-by exists on the split dimension, all child nodes except the star node are selected
The system collects documents from selected nodes:
- If all predicates and group-bys are satisfied, it uses the single pre-aggregated document from each node
- Otherwise, it processes all documents in the range

Supported Predicates

Equality: column = value
Range: >, >=, <, <=, BETWEEN
NOT: Applied to simple predicates (added in 1.2.0)
OR: Only for predicates on the same dimension

Unsupported Predicates

REGEXP_LIKE: Not supported as it requires scanning the entire dictionary
IS NULL/IS NOT NULL: Limited support (use column = default or column != default as workaround)
OR across dimensions: Cannot be used due to double-counting risks
Complex NOT: NOT on top of AND/OR is not supported

Get Started

Ingestion

Query Data

Manage Data

Visualize Data

Manage Security

Release Notes

Reference

Overview and Purpose

How the Index Works

Core Concepts

Example Illustration

Tree Structure and Components

Node Types

Node Properties

Configuration

Enabling Star-tree Index

Alternative Configuration Using Aggregation Configs

Default Star-tree Index

Configuration Parameters

Supported Aggregation Functions

Query Behavior and Optimization

When Star-tree Index Is Used

How Queries Are Optimized

Supported Predicates

Unsupported Predicates

Get Started

Ingestion

Query Data

Manage Data

Visualize Data

Manage Security

Release Notes

Reference

​Overview and Purpose

​How the Index Works

​Core Concepts

​Example Illustration

​Tree Structure and Components

​Node Types

​Node Properties

​Configuration

​Enabling Star-tree Index

​Alternative Configuration Using Aggregation Configs

​Default Star-tree Index

​Configuration Parameters

​Supported Aggregation Functions

​Query Behavior and Optimization

​When Star-tree Index Is Used

​How Queries Are Optimized

​Supported Predicates

​Unsupported Predicates

Overview and Purpose

How the Index Works

Core Concepts

Example Illustration

Tree Structure and Components

Node Types

Node Properties

Configuration

Enabling Star-tree Index

Alternative Configuration Using Aggregation Configs

Default Star-tree Index

Configuration Parameters

Supported Aggregation Functions

Query Behavior and Optimization

When Star-tree Index Is Used

How Queries Are Optimized

Supported Predicates

Unsupported Predicates