Skip to main content

Executive Summary

This release focuses on making StarTree more lakehouse-native, faster, and more reliable in production.
  • Lakehouse Integration: Deeper support for Iceberg and Parquet (including complex types and improved readers) makes it easier to run analytics directly on data lakes.
  • Performance & Query Execution: Enhancements like replica-aware routing, broker-based execution, and improved caching reduce query latency and improve efficiency.
  • Streaming & CDC: Expanded Debezium support strengthens real-time ingestion pipelines.
  • Observability & Reliability: New metrics, retry mechanisms, and safeguards improve stability and operational visibility.
Bottom line: Faster queries, better lakehouse integration, and stronger production readiness.

StarTree Cloud highlights

New Features

External Tables: Iceberg (Glue & S3 Tables)

StarTree Cloud can now query Apache Iceberg tables in place — no re-ingestion required. Iceberg tables registered in Nessie, AWS Glue or S3 Tables are queried directly using Pinot’s query engine. For more details, see the External Tables documentation.

External Tables: S3 Remote Parquet Files

StarTree Cloud can now register an S3 location as a Pinot table and query Parquet files in place without ingestion. For more details, see the External Tables documentation.

Query Engine & Execution

Added 2 new features in the query engine for improving performance and isolation
  • Added replica-group-aware query routing, enabling better isolation and workload-aware query execution.
  • Introduced support for brokers acting as intermediate stage workers in distributed query execution. This helps in improving elasticity for multi-stage engine queries given that broker scalability is simpler as compared to the servers. The number of brokers used and which brokers are selected can be controlled via Helix tags and query options.

Columnar Segment Processing Framework (CSPF)

A new columnar-first processing framework for segment operations that works column-by-column instead of row-by-row, significantly reducing CPU and memory overhead for large segments. Segment reload, File Ingestion Task, Segment Refresh Task, Alter Table Task, and Segment Import Task are all integrated with CSPF. Supports expression transformations, sorting, sanitization, and time column handling.

Minion Task Orchestration Framework

A new plan-based orchestration framework for managing complex, multi-step Minion task workflows. Task plans define sequences of tasks with dependencies, enabling safer and more predictable execution. Includes REST APIs for managing task plans and support for ad-hoc (one-time) task triggers. File Ingestion Task and the Segment Purge Task are onboarded on this framework.

Segment Purge Task Enhancements

The segment purge capability received major improvements across the board:
  • Flexible purge criteria — Supports column-level predicates and SQL query-based selectors for targeting rows to purge
  • Dry run mode — Preview which segments and rows are affected before executing, with verbose reporting of skipped segments
  • Ad-hoc API — Trigger purge tasks on demand without a scheduled task config
  • Performance — Replaced DISTINCT with GROUP BY in purge queries for significantly better performance on large tables

Composite JSON Index Enhancements

  • Added FST and Text index as sub-indexes within the Composite JSON index, with tiered storage support
  • Added partitioned inverted index — the dictionary and posting lists are split into N sub-indexes by JSON path, reducing per-partition memory pressure and enabling read parallelism
  • Added PromQL label query support via the JSON index
  • TEXT_MATCH predicates now work correctly on consuming (real-time) segments

Query Analyzer

Query Analyzer is a new AI-powered tool embedded in the Data Portal Query Console that helps you understand and optimize Apache Pinot multi-stage engine (MSE) queries. It analyzes your SQL query alongside table metadata, explain plans, and execution statistics to produce prioritized, evidence-backed optimization recommendations — without requiring deep Pinot expertise. This is a beta feature, disabled by default. Contact your StarTree account team to have it enabled for your environment.

Ingestion: New Decoder Support

  • Debezium CDC — Added Debezium decoder support for all Confluent Schema Registry formats (JSON, Protobuf, Avro), enabling CDC pipelines to ingest directly into StarTree without custom transformation
  • AWS Glue Schema Registry — New stream decoder for Kafka topics encoded with the AWS Glue Schema Registry Avro wire format
  • Kafka 3.0 Confluent Consumer — Added ConfluentKafkaConsumerFactory for Kafka 3.x clients

Additional New Features

  • Tiered Storage Caching Integration — Prefetched tiered storage segments are now stored in the Parquet disk cache, reducing repeated S3 access for hot segments
  • Table Storage Usage API — Added percentile-based segment size stats (p50, p90, p99) and verbose level controls
  • Deep Store Stale Segment Detection — New API GET /tables/{tableName}/deepstoreStaleSegmentInfo to estimate segments out of sync between servers and the deep store
  • Cluster Cloner Controls — Added options to conditionally skip deep store copying, table deletion, and schema/config change checks during cluster migration
  • Delta Ingestion Reliability — Upfront config validation, auto-default S3/GCS parameters, and correct failure propagation when files fail during segment generation
  • Batch File Listing Optimizations — Narrowed S3 listing scope using glob pattern prefixes; paginated PinotFS listing with filter push-down in the Preview API
  • gRPC Authentication & Authorization — RBAC enforcement for gRPC requests in broker access control
  • Introduced Arrow-based Parquet column reader for improved performance.
  • Added support for:
    • Null handling in remote Parquet tables
    • Complex types (Struct, List) in Parquet ingestion

Improvements

  • OOM Protection — OOM resource accounting enabled by default with query kill.
  • Upsert Stability — Fixed SIGSEGV errors on upsert table startup/shutdown; prevented RocksDB state reuse for partial upsert tables
  • Improved query performance with prefetchable forward index reader.
  • Enhanced Parquet read efficiency with:
    • Page-level caching
    • Cache eviction mechanisms
  • Audit Identity — Added StarTreeTokenResolver for consistent identity attribution in audit logs
  • Observability — New metric for long segment replacement durations; Parquet page cache Prometheus metrics with per-layer hit/miss tracking; Preload cache size and buffer usage metrics are added.

Bug Fixes

  • Purge task no longer uploads empty segments or throws exceptions at the max segment limit
  • Fixed Iceberg segment name normalization and segment name conflict on partition collisions
  • Fixed Delta ingestion silent success on file processing errors; fixed preserveNullValues not being honored consistently
  • Fixed timestamp column min/max values being incorrectly converted to milliseconds
  • Fixed Parquet reader errors: multi-value string UnsupportedOperationException, INT96 timestamp ClassCastException, and incorrect cache lookups
  • Fixed gRPC connection failures — now falls back to HTTP gracefully
  • Fixed Parquet disk cache write position bug on reload; disabled snapshot recovery by default
  • Fixed GlobPrefixExtractor URI scheme restoration for S3 and GCS paths
  • Fixed filterColumnsForRow dropping incomplete and sanitized flags in dedup processing
  • Fixed InstancePoolsNReplicaGroupsCheck health check failure for tables using instancePartitionsMap
  • Fixed controller crash-loop on restart when schemas are missing

Apache Pinot Highlights

Following changes are made to the open source Apache Pinot project with respect to the last release timeline.

New Features

  • Enhancements to multi-stage query engine (v2), including better stage execution and scalability.
  • Improved query routing and server selection, enabling more efficient distributed execution.
  • Expanded JSON and text indexing capabilities for richer query support.
  • Added improvements to stream ingestion (Kafka/CDC) for better handling of real-time pipelines.
  • New/updated minion tasks and table management APIs for maintenance workflows (e.g., purge, dedup).

Improvements

  • Faster queries via improvements in filter pushdown, segment pruning, and index utilization.
  • Enhanced upsert and dedup performance, including better handling of edge cases.
  • Improved Parquet/deep storage integration, making lakehouse-style querying more efficient.
  • Better metrics and observability for query execution, segment lifecycle, and system health.
  • Improved broker/server memory and resource usage.
  • General stability and scalability improvements in distributed query execution.

Bug Fixes

  • Fixed query correctness issues in joins, aggregations, and edge-case filters.
  • Resolved upsert inconsistencies and state management issues.
  • Fixed stream ingestion edge cases (offset handling, schema mismatches).
  • Addressed segment loading, replacement, and metadata inconsistencies.
  • Fixed NPEs, race conditions, and build/dependency issues.