Dictionary Index
Overview and Purpose
A dictionary index is a fundamental indexing structure that improves storage efficiency and query performance by encoding repetitive data. It replaces actual column values with compact dictionary IDs, which is particularly beneficial for columns with a moderate number of unique values repeated across many rows.
In StarTree Cloud (powered by Apache Pinot), dictionary indexes serve as both an encoding mechanism and an indexing structure, enabling:
- Efficient storage of repetitive data through value deduplication
- Faster query processing through compact integer representations
- Reduced memory footprint for columns with moderate cardinality
- Optimized performance for string and other variable-length data types
- Support for other index types that rely on dictionary encoding
Dictionary indexes are enabled by default in StarTree Cloud for most columns, as they typically provide storage and performance benefits. However, they may be less efficient for columns with very high cardinality (where unique values approach the number of rows).
How the Index Works
Core Concepts
When data contains repetitive values across many rows, storing each value individually wastes space and reduces query efficiency. Dictionary encoding addresses this challenge through value deduplication.
The dictionary index in StarTree Cloud:
- Creates a value mapping table (dictionary) that assigns a unique integer ID to each unique value in the column.
- Performs ID substitution, replacing actual values in the data with their corresponding dictionary IDs, which are typically much smaller.
- Offers bidirectional lookup by maintaining structures for both:
- Forward lookup: ID → Value (for retrieving original values)
- Reverse lookup: Value → ID (for encoding and filtering)
- Serves a dual purpose, by functioning as both a storage optimization technique and an indexing structure that enables efficient value lookups.
Example Illustration
For a column storing countries with many repeated values:
-
Original Data:
-
Dictionary Approach:
-
Create a dictionary mapping unique values to IDs:
-
Store the column using the more compact IDs:
-
This approach significantly reduces storage requirements, especially for strings and other variable-length data types, while enabling efficient lookups and transformations.
Relationship with Other Indexes
Dictionary indexes influence the behavior and implementation of many other index types:
Index Type | Relationship | Description |
---|---|---|
Forward Index | Conditional | Implementation depends on whether dictionary is enabled |
Range Index | Conditional | Implementation depends on whether dictionary is enabled |
Inverted Index | Required | Requires dictionary index to be enabled |
JSON Index | May disable | When optimizeDictionary is enabled, dictionary may be disabled |
Text Index | May disable | When optimizeDictionary is enabled, dictionary may be disabled |
FST Index | Required | Requires dictionary index to be enabled |
H3/Geospatial Index | Incompatible | Cannot use dictionary index |
Configuration
Default Behavior
Dictionary indexes are enabled by default for all columns, based on the assumption that the number of unique values will be significantly lower than the number of rows.
Explicitly Disabling Dictionary Index
To disable a dictionary for a column with high cardinality, use one of these approaches:
Method 1: Using the dictionary
property in indexes
:
Method 2: Setting encodingType
to RAW
:
Method 3: Using the noDictionaryColumns
table configuration:
Heuristic Dictionary Configuration
For columns where the decision to use a dictionary isn’t clear, you can enable heuristic-based dictionaries at the table level:
Heuristic Parameters
- optimizeDictionary: When set to true, enables heuristic dictionary creation for all eligible columns.
- Default: false
- optimizeDictionaryForMetrics: When set to true, enables heuristic dictionary creation only for metric columns.
- Default: false
- Note: optimizeDictionary takes precedence if both are specified
- noDictionarySizeRatioThreshold: The ratio threshold used to determine if a dictionary should be created.
- Default: 0.85
- Meaning: If the ratio of (dictionary-encoded size)/(raw-encoded size) is lower than this threshold, the dictionary is created
Advanced Configuration Parameters
You can use advanced configuration parameters for fine-tuning the dictionary behavior.
- onHeap: Determines whether the dictionary is stored in Java heap memory.
- Default: false (off-heap)
- When true: Dictionary is loaded into Java heap memory
- When false: Dictionary is kept in off-heap memory (recommended)
- Impact: May slightly improve performance, but increases heap usage and GC pressure
- useVarLengthDictionary: Controls storage approach for variable-length data (strings, bytes, etc.).
- Default: false
- When false: All values are padded to the length of the longest value
- When true: Values are stored with their actual length plus an offset
- Recommended: Set to true when the length of values varies significantly
- intern: Controls string interning for on-heap dictionaries.
- Only applies to string and byte columns with onHeap=true
- Purpose: Reduces memory usage when the same values appear across segments
- intern.capacity: Maximum number of distinct values to intern.
- Default: null (no interning)
- Recommended: Set based on the estimated unique values across segments
Performance Considerations
- Cardinality Trade-offs:
- Low cardinality (few unique values): Dictionary provides significant benefits
- High cardinality (many unique values): Raw encoding may be more efficient
- Rule of thumb: If the ratio of unique values is greater than 30% of total rows, consider disabling dictionary
- Variable Length Data:
- For string columns with widely varying lengths, set useVarLengthDictionary to true
- For string columns with similar lengths, keep the default false for better performance
- Memory Management:
- Keep dictionaries off-heap (default) in most cases
- Use on-heap dictionaries with caution, only for low-cardinality columns
- When using on-heap dictionaries, consider enabling interning to reduce the memory footprint
- Segment Considerations:
- Dictionary scope is per segment
- Very large tables may have many segments, each with its own dictionary
- This can impact memory usage, especially with on-heap dictionaries