This feature requires StarTree release 0.14.0 or later, and must be enabled on demand — contact StarTree support to activate it.
An External Table is a Pinot table whose data stays in Parquet files in your object store — S3 Data Lake, AWS Glue, or AWS S3 Tables — instead of being copied into Pinot’s own segment format. Pinot reads the remote Parquet at query time and exposes it through standard SQL. There is no ETL pipeline, no data duplication, and onboarding takes minutes instead of hours.
A watcher on the controller detects source-side changes on a schedule and builds segment index files (bloom filters, inverted indexes, range indexes, etc.) alongside the Parquet. Queries use server-side caches — a Parquet data cache, an index cache, and a footer cache — so repeat reads avoid paying S3 round-trip latency. With the right indexes configured, query times drop from minutes to milliseconds on large datasets.
How it works
At query time, the server checks its local caches first. On a miss, it fetches the required Parquet column pages or index byte ranges from object storage and stores them for subsequent queries. Index files (built by the watcher at sync time) live in tiered storage and are also cached locally, so filters and aggregations avoid full column scans.
Supported sources
| Source | Protocol | catalogType |
|---|
| S3 Data Lake | Raw Parquet files under an S3 prefix | s3 |
| AWS Glue | Iceberg REST | iceberg-rest (serviceType=glue) |
| AWS S3 Tables | Iceberg REST | iceberg-rest (serviceType=s3Tables) |
catalogType=iceberg-rest works with any Iceberg REST–compliant catalog. Data files must be Parquet.
Where to start
Choose your path based on what you’re trying to do:
- New user, prefer point-and-click → Onboarding via Data Portal — wizard-based setup, no API calls required
- New user, prefer API / automation / IaC → Onboarding via API — 4-step REST flow with bash examples and a copy-paste quickstart script
- Table created, monitoring sync progress → Observability — sync status, checkpoint watermark, and source file count APIs
- Queries are slow → Indexes to add the right indexes, then Best Practices & Configs for caching and tuning
- Something is broken → Troubleshooting for symptom-based fixes, or FAQ for common questions
Page map
| Page | What it covers |
|---|
| Onboarding via Data Portal | Point-and-click wizard: connect, browse, configure, monitor |
| Onboarding via API | REST API 4-step flow with bash examples and a self-contained quickstart script |
| Observability | Sync run status, ingestion checkpoint, source file count, and manual trigger APIs |
| Data Type Mapping | Parquet → Pinot and Iceberg → Pinot type mapping tables, plus time column detection |
| Indexes | Supported indexes, why columns are RAW, and per-index config examples |
| Data and Index Caching | Three caches (data, index, footer), eviction, restart behavior, and how to clear them |
| Best Practices & Configs | Full config reference: sync task, tier backend, server/cluster, query options, and OOM protection |
| FAQ | Common questions by category: general, onboarding, schema, indexes, performance, operations |
| Troubleshooting | Symptom-based diagnostic playbook with exact error strings and escalation guidance |