Overview and Purpose

Pinot supports super-fast query processing through its indexes on non-BLOB like columns. Queries with exact match filters are run efficiently through a combination of dictionary encoding, inverted index, and sorted index. This is useful for a query like the following, which looks for exact matches on two columns of type STRING and INT respectively:
SELECT COUNT(*) 
FROM Foo 
WHERE STRING_COL = 'ABCDCD' 
AND INT_COL > 2000
For arbitrary text data that falls into the BLOB/CLOB territory, we need more than exact matches. This often involves using regex, phrase, fuzzy queries on BLOB like data. Text indexes can efficiently perform arbitrary search on STRING columns where each column value is a large BLOB of text using the TEXT_MATCH function, like this:
SELECT COUNT(*) 
FROM Foo 
WHERE TEXT_MATCH (<column_name>, '<search_expression>')
where <column_name> is the column text index is created on and <search_expression> conforms to one of the following:
Search Expression TypeExample
Phrase queryTEXT_MATCH (<column_name>, ‘“distributed system”‘)
Term QueryTEXT_MATCH (<column_name>, ‘Java’)
Boolean QueryTEXT_MATCH (<column_name>, ‘Java AND c++‘)
Prefix QueryTEXT_MATCH (<column_name>, ‘stream*‘)
Regex QueryTEXT_MATCH (<column_name>, ‘/Exception.*/‘)
Not QueryTEXT_MATCH (<column_name>, ’*:* NOT c%’) NOT TEXT_MATCH (<column_name>, ‘c%’)
A text index is a specialized data structure that enables efficient text search operations on large text data stored in STRING columns. Unlike traditional indexes that support exact matches or range queries, text indexes facilitate complex text search capabilities including phrase matching, term queries, regex patterns, and prefix searches. In StarTree Cloud (powered by Apache Pinot), the text index bridges the gap between structured query processing and text search functionality. It allows users to perform flexible search operations on text data that would otherwise require expensive full table scans and string manipulations. Text indexes are particularly valuable for:
  • Log analysis where each log entry contains large chunks of text
  • Document search across free-form text fields
  • Search across JSON blobs stored as strings
  • Any use case involving queries on large text values where exact matching is insufficient
Text indexes provide significant performance benefits for search operations on STRING columns containing large text values (BLOB/CLOB-like data) that don’t fit the pattern of dictionary-encoded exact matches.

How the Index Works

Core Concepts

Traditional indexing approaches in Pinot (dictionary encoding, inverted index, sorted index) excel at exact match and range queries but are inefficient for complex text search operations on large text values. The text index in StarTree Cloud:
  1. Text Processing Pipeline: During indexing, each text value is processed through a pipeline that includes:
    • Tokenization: Breaking the text into individual terms
    • Stop word removal: Filtering out common words with low search value
    • Term normalization: Converting terms to a standard form
  2. Inverted Index Creation: An inverted index is built that maps terms to the documents containing them, enabling fast lookup.
  3. Query Processing: At query time, search expressions are analyzed and used to efficiently locate matching documents without full table scans.

Example Illustration

Consider a column containing Apache access logs where each row stores an entire log entry:
109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
Without a text index:
  • To find all log entries containing “Firefox”, the system would need to scan every row and check if the string “Firefox” appears in the text.
With a text index:
  • During indexing, the log entry is tokenized into terms including “Firefox”
  • The term “Firefox” is linked to the documents containing it
  • At query time, the system can directly retrieve the rows containing “Firefox” without scanning all data
This approach significantly improves query performance for text search operations, especially on large datasets.

Enable a per-column text index

Enable a text index on a column in the table configuration by adding a new section with the name “fieldConfigList”.
"fieldConfigList":[
  {
     "name":"text_col_1",
     "encodingType":"RAW",
     "indexTypes":["TEXT"]
  },
  {
     "name":"text_col_2",
     "encodingType":"RAW",
     "indexTypes":["TEXT"]
  }
]
Each column that has a text index should also be specified as noDictionaryColumns in tableIndexConfig:
"tableIndexConfig": {
   "noDictionaryColumns": [
     "text_col_1",
     "text_col_2"
 ]}
You can configure text indexes in the following scenarios:
  • Adding a new table with text index enabled on one or more columns.
  • Adding a new column with text index enabled to an existing table.
  • Enabling a text index on an existing column.

Important Configuration Considerations

  1. Column Requirements:
    • The column must be of type STRING
    • The column must be single-valued (not multi-valued)
    • The column should use RAW encoding (no dictionary)
  2. noDictionaryColumns: Text-indexed columns should be added to the noDictionaryColumns list in tableIndexConfig to reduce storage overhead.
  3. Index Coexistence: Using a text index in coexistence with other Pinot indexes on the same column is not supported.

Customizing Stop Words

You can customize the stop words used during indexing:
{
  "fieldConfigList": [
    {
      "name": "log_content",
      "encodingType": "RAW",
      "indexes": {
        "text": {}
      },
      "properties": {
        "stopWordInclude": "incl1, incl2, incl3",
        "stopWordExclude": "it"
      }
    }
  ]
}
  • stopWordInclude: Comma-separated list of words to include as stop words (in addition to defaults)
  • stopWordExclude: Comma-separated list of words to exclude from the default stop words list

Enabling Prefix/Suffix Matching in Phrase Queries

For more flexible phrase matching across term boundaries:
{
  "fieldConfigList": [
    {
      "name": "log_content",
      "encodingType": "RAW",
      "indexes": {
        "text": {}
      },
      "properties": {
        "enablePrefixSuffixMatchingInPhraseQueries": "true"
      }
    }
  ]
}

Tuning Lucene Index Creation

For optimizing index creation performance:
{
  "fieldConfigList": [
    {
      "name": "log_content",
      "encodingType": "RAW",
      "indexes": {
        "text": {}
      },
      "properties": {
        "luceneUseCompoundFile": "false",
        "luceneMaxBufferSizeMB": "128"
      }
    }
  ]
}

TEXT_MATCH Query Options

The TEXT_MATCH function supports an optional third parameter for specifying Lucene query parser options at query time. This allows for flexible and advanced text search without changing table configuration. Function Signature:
TEXT_MATCH(text_column_name, search_expression [, options])
  • text_column_name: Name of the column to perform text search on.
  • search_expression: The query string for text search.
  • options (optional): Comma-separated string of key-value pairs to control query parsing and search behavior.
Available Options:
OptionValuesDescription
parserCLASSIC, STANDARD, COMPLEXSelects the Lucene query parser to use. Default is CLASSIC.
allowLeadingWildcardtrue, falseAllows queries to start with a wildcard (e.g., *term). Default is false.
defaultOperatorAND, ORSets the default boolean operator for multi-term queries. Default is OR.
Examples:
-- Use CLASSIC parser with leading wildcard support
SELECT * FROM myTable WHERE TEXT_MATCH(myCol, '*search*', 'parser=CLASSIC, allowLeadingWildcard=true')

-- Use STANDARD parser with AND operator
SELECT * FROM myTable WHERE TEXT_MATCH(myCol, 'term1 term2', 'parser=STANDARD, defaultOperator=AND')

-- Use COMPLEX parser for advanced queries
SELECT * FROM myTable WHERE TEXT_MATCH(myCol, 'complex query', 'parser=COMPLEX')
When text search queries contain too many terms or clauses, Lucene may throw TooManyClauses exceptions, causing query failures. This commonly occurs with:
  • Complex boolean queries with many OR conditions
  • Wildcard queries that expand to many terms
  • Queries with large numbers of search terms To handle such cases, you can increase the maximum clause count at the cluster level. See the cluster configuration reference for the pinot.lucene.max.clause.count setting.