Troubleshooting - StarTree Docs

This feature requires StarTree release 0.14.0 or later, and must be enabled on demand — contact StarTree support to activate it.

Common issues when onboarding and querying External Tables, grouped by symptom. Each entry lists the likely cause and the fix. If an issue isn’t covered here, reach out to StarTree support.

Quick lookup — find your error:

Error message or symptom	Jump to
AWS region error / `s3://` in bucket field	Onboarding fails with a region error
`Access Denied` / HTTP 403	Access Denied when creating the table
Preview succeeds but no data / wrong files	Preview can’t read the files
`NumberFormatException: For input string: "Binary{...}"`	Segment generation fails with NumberFormatException
`Cannot create inverted index on column ... without dictionary`	Cannot create inverted index without dictionary
`Failed to create FieldIndexConfigs`	Failed to create FieldIndexConfigs
`servers not responded` / query timeout	Query times out or returns servers not responded
First query slow, later queries fast	First query on a column is slow
`Native memory allocation (mmap) failed` / pod OOM restarts	Server OOM / pod restarts
`Could not acquire table level distributed lock`	Distributed lock error
Sync fails on one bad file	Sync run fails because of one unreadable file
Table created, `status` stays `IDLE`	Table created but no sync ever starts
`FAILED` status — what does `failurePhase` mean?	Sync not progressing

Onboarding

Onboarding fails with a region error

Symptom: Validation or preview fails with an AWS region error, often after entering the bucket path. Cause: The AWS region isn’t set, or the bucket field contains an s3://... URL or a trailing slash. Fix:

Enter the bucket name only in the bucket field — not s3://bucket/....
Remove any trailing / from the prefix.
Set the AWS region in the connection config (catalog.s3.region / the tier region). When the region is in config, the AWS_REGION environment variable is only a fallback and isn’t required.

`Access Denied` (HTTP 403) when creating the table

Symptom:

software.amazon.awssdk.services.s3.model.AccessDeniedException: Access Denied
(Service: S3, Status Code: 403, ...)

Cause: The cluster cannot read the source bucket. Credentials valid elsewhere are not necessarily the ones the cluster uses. Fix:

Confirm whatever the table uses for access — an assumed IAM role (roleArn/externalId), the cluster’s node IAM role, or static access keys — has s3:GetObject and s3:ListBucket on the source bucket and prefix.
Verify from inside the cluster (e.g. a debug pod):
```
aws s3 ls s3://<bucket>/<prefix>/
```
If listing fails there, fix the bucket policy / role before retrying onboarding.

Preview can’t read the files / wrong path

Symptom: Validation succeeds but preview errors, or it looks for a default table. Cause: The prefix points at the wrong level — for raw Parquet it should point at the folder that directly contains the .parquet files (or the table root for Iceberg). Fix: Adjust the prefix to the correct level. Use aws s3 ls to confirm the path you give actually lists Parquet files (or an Iceberg metadata/ + data/ layout).

Very large source (hundreds of thousands of files)

Symptom: Onboarding a path with hundreds of thousands or millions of files is slow or never produces segments. Cause: Every file becomes work for the catalog scan and segment generation. Fix: For an initial proof-of-concept, point at a smaller sub-prefix (for example, a single month’s partition). Scale the cluster for the full dataset. There is a per-table segment threshold; work with StarTree support for very large tables.

Schema & table creation

Segment generation fails with `NumberFormatException`

Symptom:

java.lang.NumberFormatException: For input string: "Binary{2 reused bytes [48 50]}"

Cause: A column’s data type was changed away from what preview inferred — for example a binary/string column was set to INT/LONG. The Parquet data no longer matches the declared Pinot type. Fix: Use the data types the preview step produced. Don’t override inferred types in the schema. See Data Type Mapping.

`Cannot create inverted index on column ... without dictionary`

Symptom: Table creation is rejected for an inverted (or FST/IFST) index on a RAW column. Cause: Since release 2.164.0, dictionary-backed indexes require an explicit dictionary block — they no longer build an implicit dictionary. Fix: Keep the forward index RAW and add a dictionary block alongside the index:

"indexes": {
  "forward":    { "encodingType": "RAW" },
  "dictionary": {},
  "inverted":   { "disabled": false }
}

`Failed to create FieldIndexConfigs`

Symptom: Table creation fails with this generic message. Cause: A malformed or conflicting index configuration — often a hand-edited combination of forward, dictionary, and an index entry. Fix: Start from the table config the preview step generates and add only supported indexes. Don’t mix incompatible options on one column.

Inverted index on a multi-value column fails to build

Symptom: Segment build fails on a multi-value column that has an inverted index (e.g. Cannot create inverted index for raw index column, or a “raw inverted index not supported for multi-value columns” message). Cause: The inverted index needs a dictionary, and a raw (no-dictionary) inverted index isn’t supported — multi-value columns are especially likely to hit this. Fix: Add a dictionary block to the column (as above). If the build still fails specifically on a multi-value column, remove the inverted index from it and reach out to support.

Queries

Query times out or returns `servers not responded`

Symptom:

427: N servers [...] not responded

or a group-by/aggregation that never returns within the timeout. Causes & fixes — check in order:

Missing or conflicting index config (most common). Aggregations and filters scanning remote data without the right index are slow. Add a supported index for your filter/group-by columns, and remove conflicting or leftover index configs.
Caching not enabled. Turn on the page cache and preload so index data is local. See Best Practices and Configs (enable.prefetch.page.cache, preload.enable, preload.index.keys.override).
Group-by on a derived/computed column or a $segmentName filter — these defeat pruning. Group by a real column and drop debugging filters.
Under-provisioned servers. A single small server against a large dataset will be CPU-bound. Scale out.
Raise the query timeout while debugging: SET "timeoutMs" = '60000';.

First query on a column is slow, later queries are fast

Symptom: Cold query is slow; the same query is fast afterward. Cause: The first read populates the cache from object storage. Fix: Expected behavior. To pay this cost at load time instead of on the first query, enable pre-warm (pinot.parquet.prewarm.enabled) and preload.enable. See Data and Index Caching.

Tuning a large scan

For wide scans over big datasets, these query options help (see Best Practices and Configs):

SET "enable.prefetch.page.cache" = 'true';
SET "prefetch.projection.queue.size" = '10';
SET "readAhead.enable" = 'true';

Server OOM / pod restarts under query load

Symptom: A server runs out of native memory or is OOM-killed, often during a large scan (Native memory allocation (mmap) failed, or repeated pod restarts). Causes & fixes:

Too many memory maps. Wide tables create one mmap per column index; this can exhaust the OS max_map_count. Enable index consolidation (preload.enable.index.consolidation=true) to pack a segment’s indexes into one file.
Cache / prefetch over-allocation. Cap the in-memory caches and prefetch buffer (pinot.parquet.page.cache.memory.*, ...prefetch.size.mb) relative to server heap.
A heavy query. Enable query OOM protection so one query is killed instead of the server.

Sync & operations

`Could not acquire table level distributed lock ... ExternalTableSyncTask`

Symptom:

Could not acquire table level distributed lock for scheduled task type:
ExternalTableSyncTask, table: <name>_OFFLINE. Another controller is likely
generating tasks for this table. Please try again later.

Cause: Another controller is already running a sync for the table. Fix: Benign — retry later. The run in progress continues normally.

A sync run fails because of one unreadable file

Symptom: A sync run fails, and the failure traces back to a single problematic Parquet file. Cause: By default a run fails if any file can’t be read, so the whole snapshot is rejected. Fix: Set continueOnFileError=true in the ExternalTableSyncTask config to skip unreadable files and continue. The run still ends as status=COMPLETED; compare filesDiscovered vs segmentsUploaded in the status endpoint to spot skipped files, then investigate them separately.

Table created, but no sync ever starts

Symptom: The table exists, but status stays IDLE and the observability endpoints return nothing useful. Cause: The table isn’t on the controller-watcher path — usually executor=controller is missing from the ExternalTableSyncTask config, or the feature isn’t enabled on the cluster. Fix: Confirm executor: controller is in the task config (the preview/onboarding flow sets it), and that the External Table feature is enabled on the cluster. Then trigger a run to start immediately: POST /tasks/schedule?taskType=ExternalTableSyncTask&tableName=<name>_OFFLINE.

Sync not progressing

Use the status endpoint to diagnose:

IDLE and never advancing — check executor=controller, a valid schedule cron, and that the feature is enabled (see above).
RUNNING for a long time — large source or under-provisioned servers; reduce the prefix or scale out.
FAILED — read failurePhase:
- FILE_LISTING → credentials/path issue
- SEGMENT_GENERATION → data-type mismatch (see NumberFormatException above; escalate if the config is valid)
- SEGMENT_COMPRESSION → server resources
- SEGMENT_UPLOAD → deep-store permissions
- CHECKPOINT_SAVE → controller/ZooKeeper issue

Checking sync health

To see why ingestion isn’t progressing, use the observability endpoints: run status (and failurePhase on failure), the ingestion checkpoint, and the source file count.

When to escalate to engineering

Most issues are self-serviceable — auth/region/path, index and dictionary config, cron, and cache/OOM tuning are all covered above. Escalate to engineering when:

SEGMENT_GENERATION or CHECKPOINT_SAVE keeps failing with a valid config.
Servers still OOM after capping caches and enabling consolidation + OOM protection.
status=COMPLETED but query results are wrong or missing.
A specific Parquet type fails to read (e.g. a getBytes/decimal error), or you hit a very-large-table segment threshold.

​Onboarding

​Onboarding fails with a region error

​Access Denied (HTTP 403) when creating the table

​Preview can’t read the files / wrong path

​Very large source (hundreds of thousands of files)

​Schema & table creation

​Segment generation fails with NumberFormatException

​Cannot create inverted index on column ... without dictionary

​Failed to create FieldIndexConfigs

​Inverted index on a multi-value column fails to build

​Queries

​Query times out or returns servers not responded

​First query on a column is slow, later queries are fast

​Tuning a large scan

​Server OOM / pod restarts under query load

​Sync & operations

​Could not acquire table level distributed lock ... ExternalTableSyncTask

​A sync run fails because of one unreadable file

​Table created, but no sync ever starts

​Sync not progressing

​Checking sync health

​When to escalate to engineering

Onboarding

Onboarding fails with a region error

`Access Denied` (HTTP 403) when creating the table

Preview can’t read the files / wrong path

Very large source (hundreds of thousands of files)

Schema & table creation

Segment generation fails with `NumberFormatException`

`Cannot create inverted index on column ... without dictionary`

`Failed to create FieldIndexConfigs`

Inverted index on a multi-value column fails to build

Queries

Query times out or returns `servers not responded`

First query on a column is slow, later queries are fast

Tuning a large scan

Server OOM / pod restarts under query load

Sync & operations

`Could not acquire table level distributed lock ... ExternalTableSyncTask`

A sync run fails because of one unreadable file

Table created, but no sync ever starts

Sync not progressing

Checking sync health

When to escalate to engineering