Text Index (Native)
Overview and Purpose
A native text index is an experimental indexing solution, built from the ground up specifically for Apache Pinot, that accelerates text search operations without relying on external libraries like Lucene. It provides optimized performance for the most common text search patterns while reducing storage requirements.
In StarTree Cloud (powered by Apache Pinot), native text indexes are designed to address the specific text search needs of OLAP systems rather than implementing a full text search DSL like traditional search engines. These indexes are particularly valuable for:
- Prefix wildcard queries (like “pino*”)
- Suffix wildcard queries (like “*inot”)
- Term queries (like “pinot”)
- Real-time text search on streaming data
- Applications requiring lower storage overhead for text search capabilities
Native text indexes are currently marked as experimental and are recommended only for testing purposes. For production environments, the standard text index is recommended until the native implementation matures.
How the Index Works
Core Concepts
Traditional text indexing in Pinot uses Lucene as a sidecar to the main Pinot segments. While effective, this approach limits optimization possibilities for Pinot-specific use cases.
The native text index in StarTree Cloud:
- Custom Indexing Engine: Built from the ground up specifically for Pinot’s access patterns and workloads.
- Integration with Inverted Indexes: Leverages Pinot’s powerful inverted index capabilities rather than relying on external libraries.
- Optimized for Common Patterns: Specialized for the most frequent text search patterns in SQL-based systems rather than supporting a full text search DSL.
- Real-time Capability: Supports concurrent indexing and searching, enabling true real-time text search on streaming data.
Key Benefits
- Performance: Runs 80-120% faster than Lucene-based indexes for common text search patterns.
- Storage Efficiency: Requires approximately 40% less disk space compared to Lucene-based indexes.
- Real-time Text Search: Unlike traditional text indexes that require sealing before searching, native text indexes support concurrent indexing and searching.
Configuration
Enabling Native Text Index
To enable a native text index on a column in your StarTree Cloud table, add the following configuration to your table definition:
Important Configuration Considerations
- Experimental Status: The native text index is currently experimental and recommended only for testing environments.
- Column Encoding: Like the standard text index, native text indexes work with RAW encoded columns (not dictionary-encoded).
- Real-time Tables: Native text indexes are particularly beneficial for real-time tables where concurrent indexing and searching is required.
Performance Considerations
- Experimental Status: As an experimental feature, performance characteristics may change in future releases.
- Use Case Fit: Native text indexes excel at common SQL text search patterns (prefix, suffix, term) but may not support all features of full-text search engines.
- Real-time Performance: For real-time tables with text search requirements, native text indexes offer true real-time capabilities without the near-real-time limitation of traditional approaches.
- Storage Benefits: The 40% reduction in index size can be significant for large text columns or tables with many text-indexed columns.
- Development Trajectory: As this feature matures, it’s expected to eventually replace the Lucene-based approach as the recommended text index for StarTree Cloud.