This feature requires StarTree release 0.14.0 or later, and must be enabled on demand — contact StarTree support to activate it.
Best practices for running External Tables, followed by the full reference for every configuration and query option: the table-level sync config, the S3 tier and caching properties, the server/cluster tuning knobs, and the per-query SET options.
Best practices
Onboarding
- Enter the bucket name only (no
s3://) and set the AWS region in the connection config — a region in config means you don’t depend on the AWS_REGION environment variable.
- Grant the cluster access to the source bucket with an assumed IAM role (
roleArn + optional externalId) or the cluster’s node role; verify with aws s3 ls s3://<bucket>/<prefix>/ before onboarding.
- For very large sources, start with a smaller prefix subset and scale the cluster before onboarding the full dataset.
- Set
continueOnFileError=true if a sync run should skip an unreadable file instead of failing the whole run (default is to fail).
Schema
- Keep the data types the preview step infers; overriding them can break segment generation.
- Set a time column where the data has one — it enables time-based segment pruning and retention. Recommended but not mandatory; tables without a time column still work.
Indexes
- Add indexes for the columns you filter and group by — without them, queries scan remote data.
- For dictionary-backed indexes (inverted, FST, IFST), keep the forward index
RAW and add an explicit dictionary block.
- Avoid grouping by derived/computed columns — it defeats segment pruning.
Tiered storage — data & index caching
- Enable the page cache (
enable.prefetch.page.cache) and preload.enable so index data is served locally — see Caching and index pinning.
- Pin small, hot indexes (bloom filter, dictionary) with
preload.index.keys.override; let the index cache serve the larger/colder ones.
- Turn on index consolidation to collapse a segment’s per-column index files into one mmap, avoiding the OS
max_map_count limit on wide tables — see Caching and index pinning.
- Enable pre-warm (
pinot.parquet.prewarm.enabled) to pay the cache-fill cost at segment load instead of on the first query.
Large-scan queries (OOM protection)
- For tables that get heavy scans or aggregations, enable the query OOM killer so a single runaway query is killed instead of crashing the server — see Query OOM protection.
- Optionally enable the pre-kill pause so the server briefly pauses to let GC reclaim memory before killing — this can save a query that would otherwise be killed.
Operations
Caching and index pinning
External Table queries read indexes from remote storage, so keeping index data local is what makes them fast. There are two ways to do that — pinning and the index cache — and they’re used together.
| Pinning (preload) | Index cache |
|---|
| Enabled by | preload.enable=true + preload.index.keys.override | enable.prefetch.page.cache=true |
| What it does | Downloads the listed index files into local mmap at segment load and keeps them resident | Serves index byte ranges through an on-demand, evictable (LRU) cache |
| When it loads | Eagerly, at segment load | Lazily, on first access |
| Eviction | Never — always local | Evicted under memory pressure |
| Memory cost | Fixed and predictable | Elastic, bounded by the cache size |
Pin the small, always-needed indexes; let the cache handle the rest. Pinning is resident memory, so reserve it for compact indexes you hit on most queries — bloom filters, dictionaries, and small range/inverted indexes. Everything larger or colder (JSON, text, wide columns) should go through the index cache, which fetches on demand and evicts what isn’t used.
"tierBackendProperties": {
"preload.enable": "true",
"preload.index.keys.override": "*.bloom_filter,*.dictionary,*.range_index",
"enable.prefetch.page.cache": "true"
}
Pin keys are <column>.<indexKey> (or * for all columns), where indexKey is the index’s file key — e.g. inverted_index, range_index, bloom_filter, or dictionary. Examples: payment_type.inverted_index, *.bloom_filter, *.dictionary.
Index consolidation. On a wide table, each column’s index is a separate memory-mapped file, and many segments × many columns can exhaust the OS memory-map limit (max_map_count) — surfacing as native-memory / mmap allocation failures and server restarts. Set preload.enable.index.consolidation=true to pack a segment’s index regions into a single file per segment, drastically cutting the mmap count. Onboarding enables this by default.
Where configs live
| Scope | Set in | Applies to |
|---|
Table config — ExternalTableSyncTask | task.taskTypeConfigsMap.ExternalTableSyncTask in the table config | Catalog connection, scheduling, snapshot processing. |
| Tier backend properties | tierConfigs[].tierBackendProperties (S3 tier) | Per-table caching and preload behavior. |
| Server / cluster config | Helix cluster config or server PinotConfiguration | Page cache sizing, pre-warm, segment cache. |
| Query option | SET key = 'value' per query | Per-query cache/prefetch behavior. |
Server and cluster keys resolve in the order cluster config (Helix) → JVM -D property → built-in default. Prefer cluster config so values can change without a restart.
Table config — ExternalTableSyncTask
Catalog & mode
| Key | Default | Description |
|---|
catalogType | (required) | s3 or iceberg-rest. Selects the catalog provider. |
executor | (minion) | Set to controller for the watcher path used throughout these docs. Onboarding sets controller; set it explicitly or you lose the watcher and all observability endpoints. |
enabled | true | Enable/disable the per-table sync. |
inputFormat | parquet | Source file format. |
Connection keys
The catalog connection keys are prefixed catalog.s3.* or catalog.iceberg-rest.* (REST URI, service type, warehouse, namespace/table, and auth.rest.* / auth.storage.* credentials). The full list with examples is in Onboarding via APIs.
Scheduling & task sizing
| Key | Default | Description |
|---|
schedule | — | Quartz cron for sync frequency, e.g. 0 */5 * * * ?. |
desiredSegmentSize | 500MB | Target size for generated segments. |
taskMaxNumFiles | 100 | Max source files per task run (minion mode). |
taskMaxDataSize | 1GB | Max source bytes per task run (minion mode). |
tableMaxNumTasks | 8 | Max concurrent tasks for this table (minion mode). |
maxParallelOps | (CPU cores) | Parallelism within a single run. |
continueOnFileError | false | Skip an unreadable file instead of failing the run. |
Snapshot processing (Iceberg)
| Key | Default | Description |
|---|
iceberg.snapshotProcessing.initialVersionSelector | LATEST | Where to start: LATEST, EARLIEST, or SNAPSHOT_ID. |
iceberg.snapshotProcessing.initialSnapshotId | — | Snapshot ID to start from (when selector is SNAPSHOT_ID). |
iceberg.snapshotProcessing.batchSize | 1000 | Max new files ingested per run. |
iceberg.snapshotProcessing.maxFilesScannedPerRun | 100000 | Upper bound on files enumerated per run (≥ batchSize). |
iceberg.snapshotProcessing.maxRetries | 3 | Consecutive failures before a snapshot is marked FAILED. |
Checkpointing & schema evolution
| Key | Default | Description |
|---|
enableCheckpointing | true | Persist the ingestion watermark between runs. |
checkpointTTLSec | 0 | Checkpoint TTL in seconds (0 = bootstrap mode). |
schemaEvolution.enabled | false | Refine the Pinot schema to the source schema before each sync. |
schemaEvolution.failFast | false | Abort (vs. log and continue) when a schema change is rejected. |
Tier backend properties (caching & preload)
Set on the S3 tier in tierConfigs[].tierBackendProperties. The onboarding/preview flow sets sensible defaults.
Common
| Key | Default | Description |
|---|
bucket / pathPrefix / region | (required) | S3 location of the tier storage. |
enable.prefetch.page.cache | false | Route this table’s index reads through the page cache. Set to true by onboarding. |
preload.enable | false | Eagerly download index files into mmap at segment load instead of on first query. |
preload.enable.index.consolidation | false¹ | Pack a segment’s index regions into one file to reduce mmap count on wide tables. |
enable.startree.index | — | Load StarTree index files for this tier. |
¹ Defaults to true for tables scaffolded through the onboarding API.
Preload tuning
| Key | Default | Description |
|---|
preload.dir | — | Directory for preloaded mmap files. |
preload.total.size | 100GB | Disk budget for the preload cache. |
preload.index.keys.override | — | Comma-separated index keys to preload, e.g. *.inverted_index,*.range_index. |
preload.load.existing.buffers | true | Reuse existing mmap files on restart. |
The S3 tier also exposes advanced on-demand buffer, mmap, and read-ahead sub-properties (ondemand.*, mmap.*, readAhead.*). These are internal tuning knobs — leave them at defaults unless StarTree support advises otherwise.
Server / cluster configs
Parquet page cache sizing
| Key | Default | Description |
|---|
pinot.parquet.page.cache.memory.data.size.mb | 256 | In-memory data page cache size. |
pinot.parquet.page.cache.memory.dict.size.mb | 64 | In-memory dictionary page cache size. |
pinot.parquet.page.cache.prefetch.size.mb | 1024 | Prefetch buffer size (decoded hot pages). |
pinot.parquet.page.cache.disk.storage.percent | 30 | Percent of the disk partition for the on-disk cache. 0 disables it. |
pinot.parquet.page.cache.disk.data.size.mb | (percent) | Absolute disk data cache size; overrides the percent. |
pinot.parquet.page.cache.disk.dict.size.mb | (percent) | Absolute disk dictionary cache size; overrides the percent. |
pinot.parquet.page.cache.disk.snapshot.enabled | false | Persist the on-disk cache across restarts (otherwise wiped on startup). |
Append .PARQUET_INDEX (data cache) or .SEGMENT_INDEX (index cache) to any disk.* key to size the two caches independently — e.g. pinot.parquet.page.cache.disk.storage.percent.SEGMENT_INDEX = 50.
Pre-warm at segment load
| Key | Default | Description |
|---|
pinot.parquet.prewarm.enabled | false | Pre-warm the first data page of each column when a segment loads. |
pinot.parquet.prewarm.max.inflight.per.table | 64 | Per-table cap on concurrent pre-warm reads. |
pinot.parquet.prewarm.max.bytes.inflight.per.table | 64MB | Per-table cap on concurrent pre-warm bytes. |
pinot.parquet.prewarm.max.inflight.per.table.override | "" | Per-table overrides: table1,N;table2,M. |
Page reader & prefetch
| Key | Default | Description |
|---|
pinot.parquet.pagereader.read.timeout.ms | 30000 | Timeout for a single page S3 GET. |
pinot.parquet.pagereader.max.in.flight.chunk.reads | 0 | Cap on concurrent chunk reads (0 = unbounded). |
pinot.parquet.pagereader.pool.size | max(4, 2×cores) | Thread-pool parallelism for async page reads. |
pinot.parquet.prefetch.deserialize | true | Decode prefetched pages to values (not just raw bytes). |
pinot.parquet.prefetch.segment.lookahead.enabled | false | Warm upcoming segments’ chunks ahead of time. |
pinot.parquet.prefetch.segment.lookahead.max.inflight.chunks | 12 | Server-wide cap on concurrent look-ahead chunk reads. |
Prefetch depth is bounded by the query option prefetch.projection.queue.size (default 10) — how many projection blocks are fetched ahead when the scan queue refills. The other prefetch-related knobs are the prefetch buffer size (...prefetch.size.mb), the look-ahead in-flight cap (...segment.lookahead.max.inflight.chunks), the in-flight chunk-read cap (...max.in.flight.chunk.reads), and the page-reader pool size (...pagereader.pool.size).
| Key | Default | Description |
|---|
pinot.server.instance.segment.cache.directory | (disabled if unset) | Directory for the per-segment cache (index headers + Parquet footer). |
pinot.server.instance.segment.cache.parquet.footer.enabled | true | Persist/reload the Parquet footer (parquet_footer.bin). |
pinot.server.instance.segment.cache.load.existing.cache | true | Reload the cache on server restart. |
Index reader
| Key | Default | Description |
|---|
pinot.server.index.inverted.enable.startree.reader | false | Use StarTree’s S3-optimized inverted index reader for eligible segments. |
Query OOM protection (large scans)
A query that scans a large amount of remote data can grow the server heap until the process OOMs. Server-side per-query memory accounting protects against this: when heap usage crosses a threshold, the most memory-hungry query is killed (or briefly paused first) instead of the whole server crashing. Recommended for tables that get heavy scans or aggregations.
| Key | Default | Description |
|---|
pinot.query.scheduler.accounting.enable.thread.memory.sampling | false | Track per-query memory. Required for memory-based killing to work. |
pinot.query.scheduler.accounting.oom.enable.killing.query | false | Enable the OOM killer — kill the biggest query when heap is critical, instead of crashing the server. |
pinot.query.scheduler.accounting.oom.critical.heap.usage.ratio | 0.96 | Heap ratio at which the heaviest query is killed. |
pinot.query.scheduler.accounting.oom.panic.heap.usage.ratio | 0.99 | Heap ratio for panic — kill even without per-query sampling. |
pinot.query.scheduler.accounting.oom.alarming.usage.ratio | 0.75 | Heap ratio at which usage is flagged (logging / back-pressure). |
pinot.query.scheduler.accounting.oom.panic.allow.pre.query.kill.pause | false | OOM pause: at panic level, pause briefly to let GC reclaim memory before killing — can save a query that would otherwise be killed. |
pinot.query.scheduler.accounting.oom.pre.query.kill.pause.duration.ms | -1 | How long the pre-kill pause waits (ms). -1 disables it. |
To protect a cluster running large External Table scans, set (full keys, all prefixed pinot.query.scheduler.accounting.): ...enable.thread.memory.sampling=true and ...oom.enable.killing.query=true. Add the pre-kill pause (...oom.panic.allow.pre.query.kill.pause=true with ...oom.pre.query.kill.pause.duration.ms=2000) so transient spikes recover via GC instead of killing the query.
Query options
Set per query with SET "key" = 'value'. Several of these are also tier or server configs — setting them as a query option overrides the config for that one query.
| Query option | Default | Also a config? | Effect |
|---|
enable.prefetch.page.cache | false | Tier + server | Route index reads through the page cache for this query (overrides read-ahead). |
prefetch.projection.queue.size | 10 | — | Projection blocks prefetched ahead on a scan queue refill (prefetch depth). |
readAhead.enable | false | Tier | Enable read-ahead buffering for index reads on this query. |
skip.remote.table.cache | false | Tier | Bypass the remote table cache and read straight from object storage. |
skip.columnIndex.cache | false | Tier | Bypass the column index cache for this query. |
snapshotVersionByTable | — | — | Time-travel: query specific Iceberg snapshots, as a JSON map {"table": snapshotId}. |
pinot.parquet.pagereader.max.in.flight.chunk.reads | 0 | Server | Per-query cap on concurrent chunk reads. |
pinot.parquet.pagereader.read.timeout.ms | 30000 | Cluster | Per-query page-read timeout. |
pinot.parquet.prefetch.segment.lookahead.enabled | false | Cluster | Enable segment look-ahead warming for this query. |
enable.tiered.storage.trace | false | — | Collect a tiered-storage trace for this query (diagnostics). |
Example — bypass the cache for one query:
SET "skip.remote.table.cache" = 'true';
SELECT count(*) FROM nyc_taxi_trips WHERE fare > 100;
Example — query a specific Iceberg snapshot (time travel):
SET "snapshotVersionByTable" = '{"nyc_taxi_trips_OFFLINE": 1234567890123456789}';
SELECT count(*) FROM nyc_taxi_trips;