Skip to main content
This feature requires StarTree release 0.14.0 or later, and must be enabled on demand — contact StarTree support to activate it.
Best practices for running External Tables, followed by the full reference for every configuration and query option: the table-level sync config, the S3 tier and caching properties, the server/cluster tuning knobs, and the per-query SET options.

Best practices

Onboarding
  • Enter the bucket name only (no s3://) and set the AWS region in the connection config — a region in config means you don’t depend on the AWS_REGION environment variable.
  • Grant the cluster access to the source bucket with an assumed IAM role (roleArn + optional externalId) or the cluster’s node role; verify with aws s3 ls s3://<bucket>/<prefix>/ before onboarding.
  • For very large sources, start with a smaller prefix subset and scale the cluster before onboarding the full dataset.
  • Set continueOnFileError=true if a sync run should skip an unreadable file instead of failing the whole run (default is to fail).
Schema
  • Keep the data types the preview step infers; overriding them can break segment generation.
  • Set a time column where the data has one — it enables time-based segment pruning and retention. Recommended but not mandatory; tables without a time column still work.
Indexes
  • Add indexes for the columns you filter and group by — without them, queries scan remote data.
  • For dictionary-backed indexes (inverted, FST, IFST), keep the forward index RAW and add an explicit dictionary block.
  • Avoid grouping by derived/computed columns — it defeats segment pruning.
Tiered storage — data & index caching
  • Enable the page cache (enable.prefetch.page.cache) and preload.enable so index data is served locally — see Caching and index pinning.
  • Pin small, hot indexes (bloom filter, dictionary) with preload.index.keys.override; let the index cache serve the larger/colder ones.
  • Turn on index consolidation to collapse a segment’s per-column index files into one mmap, avoiding the OS max_map_count limit on wide tables — see Caching and index pinning.
  • Enable pre-warm (pinot.parquet.prewarm.enabled) to pay the cache-fill cost at segment load instead of on the first query.
Large-scan queries (OOM protection)
  • For tables that get heavy scans or aggregations, enable the query OOM killer so a single runaway query is killed instead of crashing the server — see Query OOM protection.
  • Optionally enable the pre-kill pause so the server briefly pauses to let GC reclaim memory before killing — this can save a query that would otherwise be killed.
Operations

Caching and index pinning

External Table queries read indexes from remote storage, so keeping index data local is what makes them fast. There are two ways to do that — pinning and the index cache — and they’re used together.
Pinning (preload)Index cache
Enabled bypreload.enable=true + preload.index.keys.overrideenable.prefetch.page.cache=true
What it doesDownloads the listed index files into local mmap at segment load and keeps them residentServes index byte ranges through an on-demand, evictable (LRU) cache
When it loadsEagerly, at segment loadLazily, on first access
EvictionNever — always localEvicted under memory pressure
Memory costFixed and predictableElastic, bounded by the cache size
Pin the small, always-needed indexes; let the cache handle the rest. Pinning is resident memory, so reserve it for compact indexes you hit on most queries — bloom filters, dictionaries, and small range/inverted indexes. Everything larger or colder (JSON, text, wide columns) should go through the index cache, which fetches on demand and evicts what isn’t used.
"tierBackendProperties": {
  "preload.enable": "true",
  "preload.index.keys.override": "*.bloom_filter,*.dictionary,*.range_index",
  "enable.prefetch.page.cache": "true"
}
Pin keys are <column>.<indexKey> (or * for all columns), where indexKey is the index’s file key — e.g. inverted_index, range_index, bloom_filter, or dictionary. Examples: payment_type.inverted_index, *.bloom_filter, *.dictionary. Index consolidation. On a wide table, each column’s index is a separate memory-mapped file, and many segments × many columns can exhaust the OS memory-map limit (max_map_count) — surfacing as native-memory / mmap allocation failures and server restarts. Set preload.enable.index.consolidation=true to pack a segment’s index regions into a single file per segment, drastically cutting the mmap count. Onboarding enables this by default.

Where configs live

ScopeSet inApplies to
Table config — ExternalTableSyncTasktask.taskTypeConfigsMap.ExternalTableSyncTask in the table configCatalog connection, scheduling, snapshot processing.
Tier backend propertiestierConfigs[].tierBackendProperties (S3 tier)Per-table caching and preload behavior.
Server / cluster configHelix cluster config or server PinotConfigurationPage cache sizing, pre-warm, segment cache.
Query optionSET key = 'value' per queryPer-query cache/prefetch behavior.
Server and cluster keys resolve in the order cluster config (Helix) → JVM -D property → built-in default. Prefer cluster config so values can change without a restart.

Table config — ExternalTableSyncTask

Catalog & mode

KeyDefaultDescription
catalogType(required)s3 or iceberg-rest. Selects the catalog provider.
executor(minion)Set to controller for the watcher path used throughout these docs. Onboarding sets controller; set it explicitly or you lose the watcher and all observability endpoints.
enabledtrueEnable/disable the per-table sync.
inputFormatparquetSource file format.

Connection keys

The catalog connection keys are prefixed catalog.s3.* or catalog.iceberg-rest.* (REST URI, service type, warehouse, namespace/table, and auth.rest.* / auth.storage.* credentials). The full list with examples is in Onboarding via APIs.

Scheduling & task sizing

KeyDefaultDescription
scheduleQuartz cron for sync frequency, e.g. 0 */5 * * * ?.
desiredSegmentSize500MBTarget size for generated segments.
taskMaxNumFiles100Max source files per task run (minion mode).
taskMaxDataSize1GBMax source bytes per task run (minion mode).
tableMaxNumTasks8Max concurrent tasks for this table (minion mode).
maxParallelOps(CPU cores)Parallelism within a single run.
continueOnFileErrorfalseSkip an unreadable file instead of failing the run.

Snapshot processing (Iceberg)

KeyDefaultDescription
iceberg.snapshotProcessing.initialVersionSelectorLATESTWhere to start: LATEST, EARLIEST, or SNAPSHOT_ID.
iceberg.snapshotProcessing.initialSnapshotIdSnapshot ID to start from (when selector is SNAPSHOT_ID).
iceberg.snapshotProcessing.batchSize1000Max new files ingested per run.
iceberg.snapshotProcessing.maxFilesScannedPerRun100000Upper bound on files enumerated per run (≥ batchSize).
iceberg.snapshotProcessing.maxRetries3Consecutive failures before a snapshot is marked FAILED.

Checkpointing & schema evolution

KeyDefaultDescription
enableCheckpointingtruePersist the ingestion watermark between runs.
checkpointTTLSec0Checkpoint TTL in seconds (0 = bootstrap mode).
schemaEvolution.enabledfalseRefine the Pinot schema to the source schema before each sync.
schemaEvolution.failFastfalseAbort (vs. log and continue) when a schema change is rejected.

Tier backend properties (caching & preload)

Set on the S3 tier in tierConfigs[].tierBackendProperties. The onboarding/preview flow sets sensible defaults.

Common

KeyDefaultDescription
bucket / pathPrefix / region(required)S3 location of the tier storage.
enable.prefetch.page.cachefalseRoute this table’s index reads through the page cache. Set to true by onboarding.
preload.enablefalseEagerly download index files into mmap at segment load instead of on first query.
preload.enable.index.consolidationfalse¹Pack a segment’s index regions into one file to reduce mmap count on wide tables.
enable.startree.indexLoad StarTree index files for this tier.
¹ Defaults to true for tables scaffolded through the onboarding API.

Preload tuning

KeyDefaultDescription
preload.dirDirectory for preloaded mmap files.
preload.total.size100GBDisk budget for the preload cache.
preload.index.keys.overrideComma-separated index keys to preload, e.g. *.inverted_index,*.range_index.
preload.load.existing.bufferstrueReuse existing mmap files on restart.
The S3 tier also exposes advanced on-demand buffer, mmap, and read-ahead sub-properties (ondemand.*, mmap.*, readAhead.*). These are internal tuning knobs — leave them at defaults unless StarTree support advises otherwise.

Server / cluster configs

Parquet page cache sizing

KeyDefaultDescription
pinot.parquet.page.cache.memory.data.size.mb256In-memory data page cache size.
pinot.parquet.page.cache.memory.dict.size.mb64In-memory dictionary page cache size.
pinot.parquet.page.cache.prefetch.size.mb1024Prefetch buffer size (decoded hot pages).
pinot.parquet.page.cache.disk.storage.percent30Percent of the disk partition for the on-disk cache. 0 disables it.
pinot.parquet.page.cache.disk.data.size.mb(percent)Absolute disk data cache size; overrides the percent.
pinot.parquet.page.cache.disk.dict.size.mb(percent)Absolute disk dictionary cache size; overrides the percent.
pinot.parquet.page.cache.disk.snapshot.enabledfalsePersist the on-disk cache across restarts (otherwise wiped on startup).
Append .PARQUET_INDEX (data cache) or .SEGMENT_INDEX (index cache) to any disk.* key to size the two caches independently — e.g. pinot.parquet.page.cache.disk.storage.percent.SEGMENT_INDEX = 50.

Pre-warm at segment load

KeyDefaultDescription
pinot.parquet.prewarm.enabledfalsePre-warm the first data page of each column when a segment loads.
pinot.parquet.prewarm.max.inflight.per.table64Per-table cap on concurrent pre-warm reads.
pinot.parquet.prewarm.max.bytes.inflight.per.table64MBPer-table cap on concurrent pre-warm bytes.
pinot.parquet.prewarm.max.inflight.per.table.override""Per-table overrides: table1,N;table2,M.

Page reader & prefetch

KeyDefaultDescription
pinot.parquet.pagereader.read.timeout.ms30000Timeout for a single page S3 GET.
pinot.parquet.pagereader.max.in.flight.chunk.reads0Cap on concurrent chunk reads (0 = unbounded).
pinot.parquet.pagereader.pool.sizemax(4, 2×cores)Thread-pool parallelism for async page reads.
pinot.parquet.prefetch.deserializetrueDecode prefetched pages to values (not just raw bytes).
pinot.parquet.prefetch.segment.lookahead.enabledfalseWarm upcoming segments’ chunks ahead of time.
pinot.parquet.prefetch.segment.lookahead.max.inflight.chunks12Server-wide cap on concurrent look-ahead chunk reads.
Prefetch depth is bounded by the query option prefetch.projection.queue.size (default 10) — how many projection blocks are fetched ahead when the scan queue refills. The other prefetch-related knobs are the prefetch buffer size (...prefetch.size.mb), the look-ahead in-flight cap (...segment.lookahead.max.inflight.chunks), the in-flight chunk-read cap (...max.in.flight.chunk.reads), and the page-reader pool size (...pagereader.pool.size).
KeyDefaultDescription
pinot.server.instance.segment.cache.directory(disabled if unset)Directory for the per-segment cache (index headers + Parquet footer).
pinot.server.instance.segment.cache.parquet.footer.enabledtruePersist/reload the Parquet footer (parquet_footer.bin).
pinot.server.instance.segment.cache.load.existing.cachetrueReload the cache on server restart.

Index reader

KeyDefaultDescription
pinot.server.index.inverted.enable.startree.readerfalseUse StarTree’s S3-optimized inverted index reader for eligible segments.

Query OOM protection (large scans)

A query that scans a large amount of remote data can grow the server heap until the process OOMs. Server-side per-query memory accounting protects against this: when heap usage crosses a threshold, the most memory-hungry query is killed (or briefly paused first) instead of the whole server crashing. Recommended for tables that get heavy scans or aggregations.
KeyDefaultDescription
pinot.query.scheduler.accounting.enable.thread.memory.samplingfalseTrack per-query memory. Required for memory-based killing to work.
pinot.query.scheduler.accounting.oom.enable.killing.queryfalseEnable the OOM killer — kill the biggest query when heap is critical, instead of crashing the server.
pinot.query.scheduler.accounting.oom.critical.heap.usage.ratio0.96Heap ratio at which the heaviest query is killed.
pinot.query.scheduler.accounting.oom.panic.heap.usage.ratio0.99Heap ratio for panic — kill even without per-query sampling.
pinot.query.scheduler.accounting.oom.alarming.usage.ratio0.75Heap ratio at which usage is flagged (logging / back-pressure).
pinot.query.scheduler.accounting.oom.panic.allow.pre.query.kill.pausefalseOOM pause: at panic level, pause briefly to let GC reclaim memory before killing — can save a query that would otherwise be killed.
pinot.query.scheduler.accounting.oom.pre.query.kill.pause.duration.ms-1How long the pre-kill pause waits (ms). -1 disables it.
To protect a cluster running large External Table scans, set (full keys, all prefixed pinot.query.scheduler.accounting.): ...enable.thread.memory.sampling=true and ...oom.enable.killing.query=true. Add the pre-kill pause (...oom.panic.allow.pre.query.kill.pause=true with ...oom.pre.query.kill.pause.duration.ms=2000) so transient spikes recover via GC instead of killing the query.

Query options

Set per query with SET "key" = 'value'. Several of these are also tier or server configs — setting them as a query option overrides the config for that one query.
Query optionDefaultAlso a config?Effect
enable.prefetch.page.cachefalseTier + serverRoute index reads through the page cache for this query (overrides read-ahead).
prefetch.projection.queue.size10Projection blocks prefetched ahead on a scan queue refill (prefetch depth).
readAhead.enablefalseTierEnable read-ahead buffering for index reads on this query.
skip.remote.table.cachefalseTierBypass the remote table cache and read straight from object storage.
skip.columnIndex.cachefalseTierBypass the column index cache for this query.
snapshotVersionByTableTime-travel: query specific Iceberg snapshots, as a JSON map {"table": snapshotId}.
pinot.parquet.pagereader.max.in.flight.chunk.reads0ServerPer-query cap on concurrent chunk reads.
pinot.parquet.pagereader.read.timeout.ms30000ClusterPer-query page-read timeout.
pinot.parquet.prefetch.segment.lookahead.enabledfalseClusterEnable segment look-ahead warming for this query.
enable.tiered.storage.tracefalseCollect a tiered-storage trace for this query (diagnostics).
Example — bypass the cache for one query:
SET "skip.remote.table.cache" = 'true';
SELECT count(*) FROM nyc_taxi_trips WHERE fare > 100;
Example — query a specific Iceberg snapshot (time travel):
SET "snapshotVersionByTable" = '{"nyc_taxi_trips_OFFLINE": 1234567890123456789}';
SELECT count(*) FROM nyc_taxi_trips;