FAQ - StarTree Docs

This feature requires StarTree release 0.14.0 or later, and must be enabled on demand — contact StarTree support to activate it.

General

What is an External Table? A Pinot table whose data stays in Parquet files in your object store (S3 Data Lake, AWS Glue, or AWS S3 Tables) instead of being copied into Pinot’s own segment format. Pinot reads the remote data at query time, with server-side caching to keep queries fast. Which sources are supported? S3 Data Lake (raw Parquet), AWS Glue (Iceberg REST), and AWS S3 Tables (Iceberg REST). How is it different from a regular Pinot table? A regular table ingests and stores a local copy of the data as Pinot segments. An External Table leaves the data in place and reads it remotely — so there’s no data duplication and onboarding is fast, at the cost of relying on the object store plus cache for read performance. When should I use one — and when not? Use it when you already have data in S3/Iceberg and want to query it without copying or re-ingesting — large or infrequently queried datasets, or data shared with other engines. Prefer a regular table for the lowest, most predictable query latency, or when you need an unsupported capability such as a sorted index. Does Pinot copy the Parquet into Pinot, or read it from S3 directly? It reads directly from the Parquet files in S3 — the data is never duplicated into Pinot’s segment format. Servers cache the bytes they read (data, index, and footer caches) so repeat queries are fast. See Data and Index Caching.

Onboarding & sync

How do I onboard one? Two ways: the Data Portal UI or the REST APIs. Do I have to trigger the first sync? No. After the table is created, the controller’s External Table watcher runs the first sync automatically and then re-syncs on the table’s schedule. You can trigger a run manually to start sooner — see Observability. How often does it sync? Can I change the schedule? It syncs on the schedule (Quartz cron) in the table’s ExternalTableSyncTask config. The Data Portal applies a default of every 5 minutes; via the API you set schedule yourself. Match it to your source’s commit cadence — polling faster than the source changes just adds overhead. Can I pause and resume syncing? Yes. In the Data Portal, open the table and select Pause Sync. In-progress runs finish normally and data is preserved; resume when ready. Connection validation failed — what do I check? Almost always a permission or region mismatch. Confirm the credentials can read both the catalog and the underlying S3 data, and that the region matches. For AWS Glue, the warehouse must be the numeric AWS account ID. Can multiple tables share the same credentials? Yes — reuse the same catalog connection parameters across as many External Tables as you like. Do I put s3:// in the bucket field? No. The bucket field takes the bucket name only — not an s3://... URL — and the prefix should not have a trailing slash. Make sure the AWS region is set; a missing region is a common cause of onboarding errors. See Troubleshooting. How do I grant access to the source bucket? Three options: an assumed IAM role (set roleArn, plus externalId if required), the cluster’s node IAM role, or static AWS access keys. The role/keys need s3:GetObject and s3:ListBucket on the source bucket and prefix. Verify access from the cluster with aws s3 ls s3://<bucket>/<prefix>/ before onboarding. Can I onboard many tables at once? Onboard each table through the Data Portal or the REST APIs. There is no single bulk table-creation API today; script the per-table API flow to automate it.

Schema & data types

How are source types mapped to Pinot types? See Data Type Mapping for the Parquet (raw S3) and Iceberg type tables. Complex types (struct, map, list) become JSON or multi-value columns. Can I edit the inferred schema or change a column’s data type? Avoid changing the data types that the preview/onboarding step infers. Forcing a different type (for example, switching a binary/string column to INT) breaks segment generation. Adjusting names, the time column, or null handling is fine. How do I add an inverted index (or another dictionary-backed index)? Keep the column’s forward index RAW and add an explicit dictionary block alongside the index — dictionary-backed indexes (inverted, FST, IFST) need it on External Tables:

"indexes": {
  "forward":    { "encodingType": "RAW" },
  "dictionary": {},
  "inverted":   { "disabled": false }
}

Can I rename columns on an External Table? Column rename is not currently supported. How is the time column chosen? Automatically, from the first suitable timestamp/date column, preserving the source granularity. See the time column section. Why are my timestamps off by ~1000×? Iceberg REST catalogs report timestamps without precision, so they’re inferred as milliseconds. If the underlying Parquet stores microseconds, set the time column’s granularity explicitly to match. Why did I get a “Requires RAW encoding” error? External Tables require every column to use RAW (no-dictionary) encoding. A column was configured or defaulted to dictionary encoding. The preview/onboarding flow sets RAW automatically.

Indexes

Which indexes are supported? Most of them — including inverted, range, timestamp, JSON, text, sparse, star-tree, bloom, FST, and IFST. Sorted, vector, and geospatial/H3 are not supported. See Supported Indexes. Why isn’t the sorted index supported? A sorted index is the column’s forward data physically stored in sorted order. External Tables read the Parquet files in place and don’t own that layout, so the data can’t be reordered. Can I add an index after creating the table? Yes — update the table config to add a supported index, the same as any Pinot table. The index is built over the existing remote data.

Performance & caching

Where is the data cached? On each server: a Parquet data cache, an index cache, and a Parquet footer cache. See Data and Index Caching. The first query on a column is slow, then it’s fast. Why? The first read populates the cache from object storage; later reads hit the cache. Enable pre-warm / preload to populate caches at segment load instead of on first query. How do I make queries faster? Add supported indexes for your filters, enable the page cache and preload for the table, and pre-warm at segment load. See Best Practices and Configs for the relevant knobs. How do I clear a cache? Use the server page cache endpoint — see Data and Index Caching.

Operations

What happens when the source data changes or files are deleted? Each scheduled sync advances to the latest Iceberg snapshot (or, for raw S3, scans for new files) and ingests what’s new; the watermark is visible via the checkpoint endpoint. Append-only sources are fully handled. Reconciliation of source-side deletes and compaction is still evolving — if your source mutates or compacts data, confirm the current behavior for your release with StarTree support.

​General

​Onboarding & sync

​Schema & data types

​Indexes

​Performance & caching

​Operations

General

Onboarding & sync

Schema & data types

Indexes

Performance & caching

Operations