Skip to main content

StarTree Cloud highlights

New Features

Composite JSON index enhancements

Added support for the following indexes within the Composite JSON index
  • FST Index
  • Text index
This enables users to configure the Composite Json index on a JSON column and configure these sub-indexes. Also added tiered storage support for these sub-indexes. For more details, please refer to the documentation.

Segment Backfill & Purge

  • Segment Backfill Dry Run Mode
    A new capability was added to preview backfill actions before execution. It allows us to see which segments will be purged and what data will be ingested without making actual changes, reducing the risk of data loss.
  • Support Segment Purge for Upsert Tables
    Segment purge tasks now work with upsert tables by marking rows as deleted, rather than removing them.

Consolidated Preload for Tiered Storage

 New mode to download all indexes into a single consolidated mmap file.
Config:
preload.enable.index.consolidation (default: false)
Preload index maintains one mmap mapping per index per column per segment. Usually Linux kernel has a limit of 65536, when the amount of segments and configured pinned indexes grows close to the number, this feature could be turned on to reduce the mappings to one mapping per segment. Consolidated preload index puts all preloaded indexes in one file, and slices the buffers of different indexes on one mmap buffer. The config is dynamic, the preloaded index would be fetched from the S3 after the change on preload.enable.index.consolidation (either change to false or to true)

New Table Storage Usage API

Unexpected growth in table storage can increase infrastructure and object storage costs. Pinot stores table data across multiple locations, such as server disk, deep store, and remote object stores used by tiered storage. The Table Storage Usage API reports table size with a breakdown by storage location and highlights mismatches between the expected size and the actual size. For more info, please see the doc.

Tiered Storage: Ease of table evolution

Evolving the table schema and config requires running of minion tasks (SegmentRefreshTask or AlterTableTask). This release makes it easy to run these tasks and directly update the remote tier (without going through Pinot servers). This will be a lot more efficient and faster to make changes.

Embedded Schema Extraction for Parquet Files

While creating a table, Data Portal now automatically extracts schema from Parquet file metadata, eliminating manual schema inference. Supports primitive types, logical type annotations, and timestamp handling with a new priority system: Provided → Embedded → Inferred schemas.

Improvements

Minion Task Execution enhancements & Guardrails

  • Adaptive Disk Usage
    Stops mapper phase when disk usage exceeds maxDiskUsagePercentage (default: 85%). This prevents disk full failures during large ingestion jobs.
  • Faster reduce phase task generation in Alter Table Task Added support to parallelize metadata file Downloads in Alter Table Task Reduce phase. This was single threaded before and caused a bottleneck. This is controlled by the table level config flag (numMetadataFileDownloadThreads, default: 4).
  • Default Max Concurrent Tasks per Minion Based on Memory
    Automatically derives safe concurrency levels from system memory. This is crucial in preventing out of memory errors.
  • Soft File Count Limit for Delta Ingestion
    Enforces a ~200k soft limit on file count to prevent excessive segment generation.
  • Conflict prevention Prevent conflicting scenarios such as running refresh / alter task during Delta ingestion or running SIT/Delta/SRT during an ongoing backfill task.
  • Improve File Ingestion Task (FIT) Documentation, Defaults, and Validations
    Includes new defaults for consistent push retries and validation of critical config fields.
  • Cluster-Level Max Subtask Limit Enforcement
    Ensures subtask count respects cluster safety thresholds.
  • New Metrics New metrics added to capture consistent push failure, skipped tasks due to conflict and improved task metric accuracy
  • Added Tenant rebalance cancellation - Added the ability to cancel tenant rebalance
  • Allow manual or ad-hoc trigger of the controller’s periodic task - Added support for triggering controller tasks on demand (one-time execution), rather than requiring them to be scheduled

Performance & Stability

  • Controller Thread Pool Defaults
    Sets bounded defaults (general executor: 1000, rebalance: 200) to avoid stability issues.
  • Tiered Storage Enhancements
    • Cleanup dangling deep store sessions and stale reduce outputs
      New configs:
      • bufferDaysToPurgeOutputSegments (default: 3 days)
      • cleanupDanglingIntermediateFiles (default: true)
    • Avoid Prefetching Forward Index When JSON or Text Index Exists
      Reduces unnecessary I/O for JSON_EXTRACT_INDEX and text_match operations.

Bug Fixes

  • Fix IndexOutOfBounds in Backfill for Empty Predicates
  • Improved backwards compatibility in dangling file cleanup
  • Fixes related to task limit validation, config overrides, inconsistent retry configs
  • **Segment Import Task: **Allows changing bucket duration (e.g., 1h → 6h) without data loss or duplicates.

Apache Pinot (OSS) highlights

New Features

  • Robust OOM Protection
    Unified lifecycle/metadata for all query execution threads → safer cancellation, better resource tracking, improved observability. For more information, please see the doc
  • Apache Arrow Decoder (Experimental)
    Adds initial Arrow-format ingestion support. This is intended for improving ingestion efficiency by reducing the processing overhead.
  • N-gram Filtering Index (Experimental)
    Added support for Realtime n-gram index to pre-filter non-matching strings efficiently.
  • Kafka Client Default Upgraded to Kafka 3
  • IP Address Functions
    Added helper functions such as ipPrefix, ipSubnetMin, ipSubnetMax, etc.
  • Array Manipulation Functions
    Added helper functions to push elements to front/back for all primitive/string array types.

Improvements

  • Support MAP Type for Derived Columns During Reload
  • Partial Upserts Stability
    Disable reload on consuming segments; force commit to avoid corruption.
  • Segment Reload Failure Tracking
    This release adds in-memory tracking for failed reloads.
  • Automatic Rewrite of MIN/MAX/SUM on Long/String Types
    Rewrites to type-correct variants to avoid precision loss.
  • Star-tree Index Build Robustness
    • Skip star-tree creation if index build fails
    • Roll back to existing index when updates fail
    • New metrics to track failures
  • Startree MV Aggregation Support
    The following functions: SUMMV, COUNTMV, AVGMV are now supported on multi-value columns
  • Async Segment Refresh Message Processing
    Enabled by default in StarTree Cloud.
  • Audit logging filtering improvements - URL based filtering support while collecting audit log
  • Perf Improvements in MSE (join optimization) Improved query performance via optimizing hash function usage in the query planner (https://github.com/apache/pinot/pull/16830)