Overview and Purpose

A text index is a specialized data structure that enables efficient text search operations on large text data stored in STRING columns. Unlike traditional indexes that support exact matches or range queries, text indexes facilitate complex text search capabilities including phrase matching, term queries, regex patterns, and prefix searches.

In StarTree Cloud (powered by Apache Pinot), the text index bridges the gap between structured query processing and text search functionality. It allows users to perform flexible search operations on text data that would otherwise require expensive full table scans and string manipulations.

Text indexes are particularly valuable for:

  • Log analysis where each log entry contains large chunks of text
  • Document search across free-form text fields
  • Search across JSON blobs stored as strings
  • Any use case involving queries on large text values where exact matching is insufficient

Text indexes provide significant performance benefits for search operations on STRING columns containing large text values (BLOB/CLOB-like data) that don’t fit the pattern of dictionary-encoded exact matches.

How the Index Works

Core Concepts

Traditional indexing approaches in Pinot (dictionary encoding, inverted index, sorted index) excel at exact match and range queries but are inefficient for complex text search operations on large text values.

The text index in StarTree Cloud:

  1. Text Processing Pipeline: During indexing, each text value is processed through a pipeline that includes:
    • Tokenization: Breaking the text into individual terms
    • Stop word removal: Filtering out common words with low search value
    • Term normalization: Converting terms to a standard form
  2. Inverted Index Creation: An inverted index is built that maps terms to the documents containing them, enabling fast lookup.
  3. Query Processing: At query time, search expressions are analyzed and used to efficiently locate matching documents without full table scans.

Example Illustration

Consider a column containing Apache access logs where each row stores an entire log entry:

109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"

Without a text index:

  • To find all log entries containing “Firefox”, the system would need to scan every row and check if the string “Firefox” appears in the text.

With a text index:

  • During indexing, the log entry is tokenized into terms including “Firefox”
  • The term “Firefox” is linked to the documents containing it
  • At query time, the system can directly retrieve the rows containing “Firefox” without scanning all data

This approach significantly improves query performance for text search operations, especially on large datasets.

Configuration (How to Enable and Use)

Enabling Text Index

To enable a text index on a column in your StarTree Cloud table, add the following configuration to your table definition:

{
  "fieldConfigList": [
    {
      "name": "log_content",
      "encodingType": "RAW",
      "indexes": {
        "text": {}
      }
    }
  ],
  "tableIndexConfig": {
    "noDictionaryColumns": [
      "log_content"
    ]
  }
}

Alternative Configuration (Legacy Method)

You can also use the older configuration format:

{
  "fieldConfigList": [
    {
      "name": "log_content",
      "encodingType": "RAW",
      "indexTypes": ["TEXT"]
    }
  ],
  "tableIndexConfig": {
    "noDictionaryColumns": [
      "log_content"
    ]
  }
}

Important Configuration Considerations

  1. Column Requirements:
    • The column must be of type STRING
    • The column must be single-valued (not multi-valued)
    • The column should use RAW encoding (no dictionary)
  2. noDictionaryColumns: Text-indexed columns should be added to the noDictionaryColumns list in tableIndexConfig to reduce storage overhead.
  3. Index Coexistence: Using a text index in coexistence with other Pinot indexes on the same column is not supported.

Customizing Stop Words

You can customize the stop words used during indexing:

{
  "fieldConfigList": [
    {
      "name": "log_content",
      "encodingType": "RAW",
      "indexes": {
        "text": {}
      },
      "properties": {
        "stopWordInclude": "incl1, incl2, incl3",
        "stopWordExclude": "it"
      }
    }
  ]
}
  • stopWordInclude: Comma-separated list of words to include as stop words (in addition to defaults)
  • stopWordExclude: Comma-separated list of words to exclude from the default stop words list

Enabling Prefix/Suffix Matching in Phrase Queries

For more flexible phrase matching across term boundaries:

{
  "fieldConfigList": [
    {
      "name": "log_content",
      "encodingType": "RAW",
      "indexes": {
        "text": {}
      },
      "properties": {
        "enablePrefixSuffixMatchingInPhraseQueries": "true"
      }
    }
  ]
}

Tuning Lucene Index Creation

For optimizing index creation performance:

{
  "fieldConfigList": [
    {
      "name": "log_content",
      "encodingType": "RAW",
      "indexes": {
        "text": {}
      },
      "properties": {
        "luceneUseCompoundFile": "false",
        "luceneMaxBufferSizeMB": "128"
      }
    }
  ]
}