Documentation Index
Fetch the complete documentation index at: https://docs.startree.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview and Purpose
Pinot supports super-fast query processing through its indexes on non-BLOB like columns. Queries with exact match filters are run efficiently through a combination of dictionary encoding, inverted index, and sorted index. This is useful for a query like the following, which looks for exact matches on two columns of type STRING and INT respectively:TEXT_MATCH function, like this:
<column_name> is the column text index is created on and <search_expression> conforms to one of the following:
| Search Expression Type | Example |
|---|---|
| Phrase query | TEXT_MATCH (<column_name>, ‘“distributed system”‘) |
| Term Query | TEXT_MATCH (<column_name>, ‘Java’) |
| Boolean Query | TEXT_MATCH (<column_name>, ‘Java AND c++‘) |
| Prefix Query | TEXT_MATCH (<column_name>, ‘stream*‘) |
| Regex Query | TEXT_MATCH (<column_name>, ‘/Exception.*/‘) |
| Not Query | TEXT_MATCH (<column_name>, ’*:* NOT c%’) NOT TEXT_MATCH (<column_name>, ‘c%’) |
- Log analysis where each log entry contains large chunks of text
- Document search across free-form text fields
- Search across JSON blobs stored as strings
- Any use case involving queries on large text values where exact matching is insufficient
Text indexes provide significant performance benefits for search operations on STRING columns containing large text values (BLOB/CLOB-like data) that don’t fit the pattern of dictionary-encoded exact matches.
How the Index Works
Core Concepts
Traditional indexing approaches in Pinot (dictionary encoding, inverted index, sorted index) excel at exact match and range queries but are inefficient for complex text search operations on large text values. The text index in StarTree Cloud:- Text Processing Pipeline: During indexing, each text value is processed through a pipeline that includes:
- Tokenization: Breaking the text into individual terms
- Stop word removal: Filtering out common words with low search value
- Term normalization: Converting terms to a standard form
- Inverted Index Creation: An inverted index is built that maps terms to the documents containing them, enabling fast lookup.
- Query Processing: At query time, search expressions are analyzed and used to efficiently locate matching documents without full table scans.
Example Illustration
Consider a column containing Apache access logs where each row stores an entire log entry:- To find all log entries containing “Firefox”, the system would need to scan every row and check if the string “Firefox” appears in the text.
- During indexing, the log entry is tokenized into terms including “Firefox”
- The term “Firefox” is linked to the documents containing it
- At query time, the system can directly retrieve the rows containing “Firefox” without scanning all data
Enable a per-column text index
Enable a text index on a column in the table configuration by adding a new section with the name “fieldConfigList”.noDictionaryColumns in tableIndexConfig:
- Adding a new table with text index enabled on one or more columns.
- Adding a new column with text index enabled to an existing table.
- Enabling a text index on an existing column.
Important Configuration Considerations
- Column Requirements:
- The column must be of type STRING
- The column must be single-valued (not multi-valued)
- The column should use RAW encoding (no dictionary)
- noDictionaryColumns: Text-indexed columns should be added to the noDictionaryColumns list in tableIndexConfig to reduce storage overhead.
- Index Coexistence: Using a text index in coexistence with other Pinot indexes on the same column is not supported.
Customizing Stop Words
You can customize the stop words used during indexing:- stopWordInclude: Comma-separated list of words to include as stop words (in addition to defaults)
- stopWordExclude: Comma-separated list of words to exclude from the default stop words list
Enabling Prefix/Suffix Matching in Phrase Queries
For more flexible phrase matching across term boundaries:Tuning Lucene Index Creation
For optimizing index creation performance:TEXT_MATCH Query Options
TheTEXT_MATCH function supports an optional third parameter for specifying Lucene query parser options at query time. This allows for flexible and advanced text search without changing table configuration.
Function Signature:
text_column_name: Name of the column to perform text search on.search_expression: The query string for text search.options(optional): Comma-separated string of key-value pairs to control query parsing and search behavior.
| Option | Values | Description |
|---|---|---|
parser | CLASSIC, STANDARD, COMPLEX, MATCHPHRASE, MATCH | Selects the Lucene query parser to use. Default is CLASSIC. |
allowLeadingWildcard | true, false | Allows queries to start with a wildcard (e.g., *term). Default is false. |
defaultOperator | AND, OR | Sets the default boolean operator for multi-term queries. Default is OR. |
| Parser | Best For |
|---|---|
CLASSIC (Default) | Standard Lucene syntax with wildcards and boolean operators. |
STANDARD | Advanced regex pattern matching and field-specific queries. |
COMPLEX | Enhanced phrase matching that allows wildcards inside quotes. |
MATCHPHRASE | Autocomplete-style phrase search with prefix matching. |
MATCH | Queries requiring a specific number of term matches (minimum_should_match). |
Parser-Specific Options
1. CLASSIC Parser (Default) The standard Lucene query parser. It is robust and handles most boolean and wildcard requirements. Options Supported:- allowLeadingWildcard: (Boolean) Enable to allow queries like *term. Default: false.
- defaultOperator: (String) Logic between terms if no operator is provided. Values: OR (default), AND.
- analyzeWildcard: (Boolean) Whether to analyze terms within wildcard queries. Default: false.
- fuzzyMinSim: (Float) Minimum similarity for fuzzy matching (~). Default: 2.0 (edit distance).
- Options Supported:
- allowLeadingWildcard: (Boolean) Default: false.
- defaultOperator: (String) Values: OR, AND.
- Options Supported:
- allowLeadingWildcard: (Boolean) Default: false.
- defaultOperator: (String) Values: OR, AND.
- inOrder: (Boolean) If true, terms in a phrase must appear in the exact order specified. Default: true.
- Options Supported:
- enablePrefixMatch: (Boolean) When true, the last term of the phrase is treated as a prefix. Default: true.
- slop: (Integer) The number of “jumps” allowed between words in the phrase. Default: 0.
- Options Supported:
- minimumShouldMatch: (String) Defines how many terms must match for a document to be returned. Supports:
- Integer: 2 (exactly two terms).
- Percentage: 75% (at least 75% of terms).
- Negative: -1 (all but one term).
- minimumShouldMatch: (String) Defines how many terms must match for a document to be returned. Supports:
Examples of Advanced Search Patterns
1. Regex Pattern Matching (.*text.*)
For highly specific patterns that go beyond simple wildcards, you can use regular expressions. This requires the STANDARD parser.- Matches: “NullPointerException”, “Runtime_exception”, “exception_handler”.
- Note: Regular expressions are wrapped in forward slashes (/pattern/).
2. Field and Anchor Search (^)
The STANDARD parser allows for more granular control, such as boosting specific terms or using anchors if supported by the underlying analyzer. In Lucene-based searches, the ^ symbol is also frequently used for Term Boosting, allowing you to give more weight to certain words.3. Prefix and Suffix Search
Find documents containing words that start or end with a specific sequence. Prefix (Starts With):4. Contains Search
Find a term regardless of its position or surrounding characters.5. Exact Match
Search for a specific term or a specific phrase in its exact order.- Single Term Exact Match:
- Exact Phrase Match: Use double quotes within the query string to ensure terms appear together and in order.
Other Miscellaneous examples
Technical Syntax Reference
For deep-dive syntax on how to write query_string expressions (e.g., using ~ for fuzzy, ^ for boosting, or [] for ranges), please refer to the officialApache Lucene Query Parser Syntax . Note: Apache Pinot implements Lucene 8.x/9.x logic. While the core syntax remains the same, advanced field-specific features may behave differently based on your Pinot Table Index configuration.Best Practices
- Leading Wildcards: Using term is slower than term because it requires a more intensive index scan.
- Index Requirement: TEXT_MATCH only works with Lucene-based text indexes, not native Pinot text indexes.
TooManyClauses exceptions, causing query failures. This commonly occurs with:
- Complex boolean queries with many OR conditions
- Wildcard queries that expand to many terms
- Queries with large numbers of search terms To handle such cases, you can increase the maximum clause count at the cluster level. See the cluster configuration reference for the
pinot.lucene.max.clause.countsetting.

