Best Practices and Configs

This feature requires StarTree release 0.14.0 or later, and must be enabled on demand — contact StarTree support to activate it.

Best practices for running External Tables, followed by the full reference for every configuration and query option: the table-level sync config, the S3 tier and caching properties, the server/cluster tuning knobs, and the per-query SET options.

Best practices

Onboarding

Enter the bucket name only (no s3://) and set the AWS region in the connection config — a region in config means you don’t depend on the AWS_REGION environment variable.
Grant the cluster access to the source bucket with an assumed IAM role (roleArn + optional externalId) or the cluster’s node role; verify with aws s3 ls s3://<bucket>/<prefix>/ before onboarding.
For very large sources, start with a smaller prefix subset and scale the cluster before onboarding the full dataset.
Set continueOnFileError=true if a sync run should skip an unreadable file instead of failing the whole run (default is to fail).

Schema

Keep the data types the preview step infers; overriding them can break segment generation.
Set a time column where the data has one — it enables time-based segment pruning and retention. Recommended but not mandatory; tables without a time column still work.

Indexes

Add indexes for the columns you filter and group by — without them, queries scan remote data.
For dictionary-backed indexes (inverted, FST, IFST), keep the forward index RAW and add an explicit dictionary block.
Avoid grouping by derived/computed columns — it defeats segment pruning.

Tiered storage — data & index caching

Enable the page cache (enable.prefetch.page.cache) and preload.enable so index data is served locally — see Caching and index pinning.
Pin small, hot indexes (bloom filter, dictionary) with preload.index.keys.override; let the index cache serve the larger/colder ones.
Turn on index consolidation to collapse a segment’s per-column index files into one mmap, avoiding the OS max_map_count limit on wide tables — see Caching and index pinning.
Enable pre-warm (pinot.parquet.prewarm.enabled) to pay the cache-fill cost at segment load instead of on the first query.

Large-scan queries (OOM protection)

For tables that get heavy scans or aggregations, enable the query OOM killer so a single runaway query is killed instead of crashing the server — see Query OOM protection.
Optionally enable the pre-kill pause so the server briefly pauses to let GC reclaim memory before killing — this can save a query that would otherwise be killed.

Operations

Let the controller watcher schedule syncs; monitor health with the observability endpoints.

Caching and index pinning

External Table queries read indexes from remote storage, so keeping index data local is what makes them fast. There are two ways to do that — pinning and the index cache — and they’re used together.

	Pinning (preload)	Index cache
Enabled by	`preload.enable=true` + `preload.index.keys.override`	`enable.prefetch.page.cache=true`
What it does	Downloads the listed index files into local mmap at segment load and keeps them resident	Serves index byte ranges through an on-demand, evictable (LRU) cache
When it loads	Eagerly, at segment load	Lazily, on first access
Eviction	Never — always local	Evicted under memory pressure
Memory cost	Fixed and predictable	Elastic, bounded by the cache size

Pin the small, always-needed indexes; let the cache handle the rest. Pinning is resident memory, so reserve it for compact indexes you hit on most queries — bloom filters, dictionaries, and small range/inverted indexes. Everything larger or colder (JSON, text, wide columns) should go through the index cache, which fetches on demand and evicts what isn’t used.

"tierBackendProperties": {
  "preload.enable": "true",
  "preload.index.keys.override": "*.bloom_filter,*.dictionary,*.range_index",
  "enable.prefetch.page.cache": "true"
}

Pin keys are <column>.<indexKey> (or * for all columns), where indexKey is the index’s file key — e.g. inverted_index, range_index, bloom_filter, or dictionary. Examples: payment_type.inverted_index, *.bloom_filter, *.dictionary. Index consolidation. On a wide table, each column’s index is a separate memory-mapped file, and many segments × many columns can exhaust the OS memory-map limit (max_map_count) — surfacing as native-memory / mmap allocation failures and server restarts. Set preload.enable.index.consolidation=true to pack a segment’s index regions into a single file per segment, drastically cutting the mmap count. Onboarding enables this by default.

Where configs live

Scope	Set in	Applies to
Table config — `ExternalTableSyncTask`	`task.taskTypeConfigsMap.ExternalTableSyncTask` in the table config	Catalog connection, scheduling, snapshot processing.
Tier backend properties	`tierConfigs[].tierBackendProperties` (S3 tier)	Per-table caching and preload behavior.
Server / cluster config	Helix cluster config or server `PinotConfiguration`	Page cache sizing, pre-warm, segment cache.
Query option	`SET key = 'value'` per query	Per-query cache/prefetch behavior.

Server and cluster keys resolve in the order cluster config (Helix) → JVM -D property → built-in default. Prefer cluster config so values can change without a restart.

Table config — `ExternalTableSyncTask`

Catalog & mode

Key	Default	Description
`catalogType`	(required)	`s3` or `iceberg-rest`. Selects the catalog provider.
`executor`	(minion)	Set to `controller` for the watcher path used throughout these docs. Onboarding sets `controller`; set it explicitly or you lose the watcher and all observability endpoints.
`enabled`	`true`	Enable/disable the per-table sync.
`inputFormat`	`parquet`	Source file format.

Connection keys

The catalog connection keys are prefixed catalog.s3.* or catalog.iceberg-rest.* (REST URI, service type, warehouse, namespace/table, and auth.rest.* / auth.storage.* credentials). The full list with examples is in Onboarding via APIs.

Scheduling & task sizing

Key	Default	Description
`schedule`	—	Quartz cron for sync frequency, e.g. `0 /5 * * ?`.
`desiredSegmentSize`	`500MB`	Target size for generated segments.
`taskMaxNumFiles`	`100`	Max source files per task run (minion mode).
`taskMaxDataSize`	`1GB`	Max source bytes per task run (minion mode).
`tableMaxNumTasks`	`8`	Max concurrent tasks for this table (minion mode).
`maxParallelOps`	(CPU cores)	Parallelism within a single run.
`continueOnFileError`	`false`	Skip an unreadable file instead of failing the run.

Snapshot processing (Iceberg)

Key	Default	Description
`iceberg.snapshotProcessing.initialVersionSelector`	`LATEST`	Where to start: `LATEST`, `EARLIEST`, or `SNAPSHOT_ID`.
`iceberg.snapshotProcessing.initialSnapshotId`	—	Snapshot ID to start from (when selector is `SNAPSHOT_ID`).
`iceberg.snapshotProcessing.batchSize`	`1000`	Max new files ingested per run.
`iceberg.snapshotProcessing.maxFilesScannedPerRun`	`100000`	Upper bound on files enumerated per run (≥ `batchSize`).
`iceberg.snapshotProcessing.maxRetries`	`3`	Consecutive failures before a snapshot is marked `FAILED`.

Checkpointing & schema evolution

Key	Default	Description
`enableCheckpointing`	`true`	Persist the ingestion watermark between runs.
`checkpointTTLSec`	`0`	Checkpoint TTL in seconds (`0` = bootstrap mode).
`schemaEvolution.enabled`	`false`	Refine the Pinot schema to the source schema before each sync.
`schemaEvolution.failFast`	`false`	Abort (vs. log and continue) when a schema change is rejected.

Tier backend properties (caching & preload)

Set on the S3 tier in tierConfigs[].tierBackendProperties. The onboarding/preview flow sets sensible defaults.

Common

Key	Default	Description
`bucket` / `pathPrefix` / `region`	(required)	S3 location of the tier storage.
`enable.prefetch.page.cache`	`false`	Route this table’s index reads through the page cache. Set to `true` by onboarding.
`preload.enable`	`false`	Eagerly download index files into mmap at segment load instead of on first query.
`preload.enable.index.consolidation`	`false`¹	Pack a segment’s index regions into one file to reduce mmap count on wide tables.
`enable.startree.index`	—	Load StarTree index files for this tier.

¹ Defaults to true for tables scaffolded through the onboarding API.

Preload tuning

Key	Default	Description
`preload.dir`	—	Directory for preloaded mmap files.
`preload.total.size`	`100GB`	Disk budget for the preload cache.
`preload.index.keys.override`	—	Comma-separated index keys to preload, e.g. `.inverted_index,.range_index`.
`preload.load.existing.buffers`	`true`	Reuse existing mmap files on restart.

The S3 tier also exposes advanced on-demand buffer, mmap, and read-ahead sub-properties (ondemand.*, mmap.*, readAhead.*). These are internal tuning knobs — leave them at defaults unless StarTree support advises otherwise.

Server / cluster configs

Parquet page cache sizing

Key	Default	Description
`pinot.parquet.page.cache.memory.data.size.mb`	`256`	In-memory data page cache size.
`pinot.parquet.page.cache.memory.dict.size.mb`	`64`	In-memory dictionary page cache size.
`pinot.parquet.page.cache.prefetch.size.mb`	`1024`	Prefetch buffer size (decoded hot pages).
`pinot.parquet.page.cache.disk.storage.percent`	`30`	Percent of the disk partition for the on-disk cache. `0` disables it.
`pinot.parquet.page.cache.disk.data.size.mb`	(percent)	Absolute disk data cache size; overrides the percent.
`pinot.parquet.page.cache.disk.dict.size.mb`	(percent)	Absolute disk dictionary cache size; overrides the percent.
`pinot.parquet.page.cache.disk.snapshot.enabled`	`false`	Persist the on-disk cache across restarts (otherwise wiped on startup).

Append .PARQUET_INDEX (data cache) or .SEGMENT_INDEX (index cache) to any disk.* key to size the two caches independently — e.g. pinot.parquet.page.cache.disk.storage.percent.SEGMENT_INDEX = 50.

Pre-warm at segment load

Key	Default	Description
`pinot.parquet.prewarm.enabled`	`false`	Pre-warm the first data page of each column when a segment loads.
`pinot.parquet.prewarm.max.inflight.per.table`	`64`	Per-table cap on concurrent pre-warm reads.
`pinot.parquet.prewarm.max.bytes.inflight.per.table`	`64MB`	Per-table cap on concurrent pre-warm bytes.
`pinot.parquet.prewarm.max.inflight.per.table.override`	`""`	Per-table overrides: `table1,N;table2,M`.

Page reader & prefetch

Key	Default	Description
`pinot.parquet.pagereader.read.timeout.ms`	`30000`	Timeout for a single page S3 GET.
`pinot.parquet.pagereader.max.in.flight.chunk.reads`	`0`	Cap on concurrent chunk reads (`0` = unbounded).
`pinot.parquet.pagereader.pool.size`	`max(4, 2×cores)`	Thread-pool parallelism for async page reads.
`pinot.parquet.prefetch.deserialize`	`true`	Decode prefetched pages to values (not just raw bytes).
`pinot.parquet.prefetch.segment.lookahead.enabled`	`false`	Warm upcoming segments’ chunks ahead of time.
`pinot.parquet.prefetch.segment.lookahead.max.inflight.chunks`	`12`	Server-wide cap on concurrent look-ahead chunk reads.

Prefetch depth is bounded by the query option prefetch.projection.queue.size (default 10) — how many projection blocks are fetched ahead when the scan queue refills. The other prefetch-related knobs are the prefetch buffer size (...prefetch.size.mb), the look-ahead in-flight cap (...segment.lookahead.max.inflight.chunks), the in-flight chunk-read cap (...max.in.flight.chunk.reads), and the page-reader pool size (...pagereader.pool.size).

Key	Default	Description
`pinot.server.instance.segment.cache.directory`	(disabled if unset)	Directory for the per-segment cache (index headers + Parquet footer).
`pinot.server.instance.segment.cache.parquet.footer.enabled`	`true`	Persist/reload the Parquet footer (`parquet_footer.bin`).
`pinot.server.instance.segment.cache.load.existing.cache`	`true`	Reload the cache on server restart.

Index reader

Key	Default	Description
`pinot.server.index.inverted.enable.startree.reader`	`false`	Use StarTree’s S3-optimized inverted index reader for eligible segments.

Query OOM protection (large scans)

A query that scans a large amount of remote data can grow the server heap until the process OOMs. Server-side per-query memory accounting protects against this: when heap usage crosses a threshold, the most memory-hungry query is killed (or briefly paused first) instead of the whole server crashing. Recommended for tables that get heavy scans or aggregations.

Key	Default	Description
`pinot.query.scheduler.accounting.enable.thread.memory.sampling`	`false`	Track per-query memory. Required for memory-based killing to work.
`pinot.query.scheduler.accounting.oom.enable.killing.query`	`false`	Enable the OOM killer — kill the biggest query when heap is critical, instead of crashing the server.
`pinot.query.scheduler.accounting.oom.critical.heap.usage.ratio`	`0.96`	Heap ratio at which the heaviest query is killed.
`pinot.query.scheduler.accounting.oom.panic.heap.usage.ratio`	`0.99`	Heap ratio for panic — kill even without per-query sampling.
`pinot.query.scheduler.accounting.oom.alarming.usage.ratio`	`0.75`	Heap ratio at which usage is flagged (logging / back-pressure).
`pinot.query.scheduler.accounting.oom.panic.allow.pre.query.kill.pause`	`false`	OOM pause: at panic level, pause briefly to let GC reclaim memory before killing — can save a query that would otherwise be killed.
`pinot.query.scheduler.accounting.oom.pre.query.kill.pause.duration.ms`	`-1`	How long the pre-kill pause waits (ms). `-1` disables it.

To protect a cluster running large External Table scans, set (full keys, all prefixed pinot.query.scheduler.accounting.): ...enable.thread.memory.sampling=true and ...oom.enable.killing.query=true. Add the pre-kill pause (...oom.panic.allow.pre.query.kill.pause=true with ...oom.pre.query.kill.pause.duration.ms=2000) so transient spikes recover via GC instead of killing the query.

Query options

Set per query with SET "key" = 'value'. Several of these are also tier or server configs — setting them as a query option overrides the config for that one query.

Query option	Default	Also a config?	Effect
`enable.prefetch.page.cache`	`false`	Tier + server	Route index reads through the page cache for this query (overrides read-ahead).
`prefetch.projection.queue.size`	`10`	—	Projection blocks prefetched ahead on a scan queue refill (prefetch depth).
`readAhead.enable`	`false`	Tier	Enable read-ahead buffering for index reads on this query.
`skip.remote.table.cache`	`false`	Tier	Bypass the remote table cache and read straight from object storage.
`skip.columnIndex.cache`	`false`	Tier	Bypass the column index cache for this query.
`snapshotVersionByTable`	—	—	Time-travel: query specific Iceberg snapshots, as a JSON map `{"table": snapshotId}`.
`pinot.parquet.pagereader.max.in.flight.chunk.reads`	`0`	Server	Per-query cap on concurrent chunk reads.
`pinot.parquet.pagereader.read.timeout.ms`	`30000`	Cluster	Per-query page-read timeout.
`pinot.parquet.prefetch.segment.lookahead.enabled`	`false`	Cluster	Enable segment look-ahead warming for this query.
`enable.tiered.storage.trace`	`false`	—	Collect a tiered-storage trace for this query (diagnostics).

Example — bypass the cache for one query:

SET "skip.remote.table.cache" = 'true';
SELECT count(*) FROM nyc_taxi_trips WHERE fare > 100;

Example — query a specific Iceberg snapshot (time travel):

SET "snapshotVersionByTable" = '{"nyc_taxi_trips_OFFLINE": 1234567890123456789}';
SELECT count(*) FROM nyc_taxi_trips;

​Best practices

​Caching and index pinning

​Where configs live

​Table config — ExternalTableSyncTask

​Catalog & mode

​Connection keys

​Scheduling & task sizing

​Snapshot processing (Iceberg)

​Checkpointing & schema evolution

​Tier backend properties (caching & preload)

​Common

​Preload tuning

​Server / cluster configs

​Parquet page cache sizing

​Pre-warm at segment load

​Page reader & prefetch

​Segment data cache (footer & index headers)

​Index reader

​Query OOM protection (large scans)

​Query options

Best practices

Caching and index pinning

Where configs live

Table config — `ExternalTableSyncTask`

Catalog & mode

Connection keys

Scheduling & task sizing

Snapshot processing (Iceberg)

Checkpointing & schema evolution

Tier backend properties (caching & preload)

Common

Preload tuning

Server / cluster configs

Parquet page cache sizing

Pre-warm at segment load

Page reader & prefetch

Segment data cache (footer & index headers)

Index reader

Query OOM protection (large scans)

Query options