External Tables Overview

This feature requires StarTree release 0.14.0 or later, and must be enabled on demand — contact StarTree support to activate it.

An External Table is a Pinot table whose data stays in Parquet files in your object store — S3 Data Lake, AWS Glue, or AWS S3 Tables — instead of being copied into Pinot’s own segment format. Pinot reads the remote Parquet at query time and exposes it through standard SQL. There is no ETL pipeline, no data duplication, and onboarding takes minutes instead of hours. A watcher on the controller detects source-side changes on a schedule and builds segment index files (bloom filters, inverted indexes, range indexes, etc.) alongside the Parquet. Queries use server-side caches — a Parquet data cache, an index cache, and a footer cache — so repeat reads avoid paying S3 round-trip latency. With the right indexes configured, query times drop from minutes to milliseconds on large datasets.

How it works

At query time, the server checks its local caches first. On a miss, it fetches the required Parquet column pages or index byte ranges from object storage and stores them for subsequent queries. Index files (built by the watcher at sync time) live in tiered storage and are also cached locally, so filters and aggregations avoid full column scans.

Supported sources

Source	Protocol	`catalogType`
S3 Data Lake	Raw Parquet files under an S3 prefix	`s3`
AWS Glue	Iceberg REST	`iceberg-rest` (`serviceType=glue`)
AWS S3 Tables	Iceberg REST	`iceberg-rest` (`serviceType=s3Tables`)

catalogType=iceberg-rest works with any Iceberg REST–compliant catalog. Data files must be Parquet.

Where to start

Choose your path based on what you’re trying to do:

New user, prefer point-and-click → Onboarding via Data Portal — wizard-based setup, no API calls required
New user, prefer API / automation / IaC → Onboarding via API — 4-step REST flow with bash examples and a copy-paste quickstart script
Table created, monitoring sync progress → Observability — sync status, checkpoint watermark, and source file count APIs
Queries are slow → Indexes to add the right indexes, then Best Practices & Configs for caching and tuning
Something is broken → Troubleshooting for symptom-based fixes, or FAQ for common questions

Page map

Page	What it covers
Onboarding via Data Portal	Point-and-click wizard: connect, browse, configure, monitor
Onboarding via API	REST API 4-step flow with bash examples and a self-contained quickstart script
Observability	Sync run status, ingestion checkpoint, source file count, and manual trigger APIs
Data Type Mapping	Parquet → Pinot and Iceberg → Pinot type mapping tables, plus time column detection
Indexes	Supported indexes, why columns are RAW, and per-index config examples
Data and Index Caching	Three caches (data, index, footer), eviction, restart behavior, and how to clear them
Best Practices & Configs	Full config reference: sync task, tier backend, server/cluster, query options, and OOM protection
FAQ	Common questions by category: general, onboarding, schema, indexes, performance, operations
Troubleshooting	Symptom-based diagnostic playbook with exact error strings and escalation guidance

​How it works

​Supported sources

​Where to start

​Page map

How it works

Supported sources

Where to start

Page map