This feature requires StarTree release 0.14.0 or later, and must be enabled on demand — contact StarTree support to activate it.
This page shows how to onboard an S3 Data Lake External Table with the StarTree controller REST APIs instead of the Data Portal UI. Use it for automation, infrastructure-as-code, or onboarding many tables at once.
How it works
Onboarding is four calls. Each one feeds the next:
1. Validate & browse POST /connections/browse → validate; list files under the S3 prefix
2. Preview POST /tables/preview → infer schema + enriched table config
3. Create schema POST /schemas → register the schema
4. Create table POST /tables → register the External Table
The preview call (step 2) is the core step: it samples the source, infers a Pinot schema, and returns a ready-to-use table config. Steps 3 and 4 just persist its output.
Once the table exists, the controller’s watcher runs the first sync and re-syncs on schedule — no manual trigger. Track progress with the observability endpoints.
All paths are relative to your controller base URL. The examples assume export CONTROLLER=https://<your-controller>. On StarTree Cloud, the controller is reached through the data-plane proxy, so use export CONTROLLER=https://<data-plane-host>/api/pinot.If your controller requires authentication, add an Authorization header to every request — e.g. -H "Authorization: Bearer <token>". The examples below omit it for brevity.
Prerequisites
- StarTree 0.14.0 or later with the External Table Beta feature enabled, and tiered storage configured on the cluster. Contact StarTree support if unsure.
- Network access to the controller REST endpoint, plus an
Authorization token if your cluster requires auth.
- AWS credentials (access key + secret key) with
s3:GetObject and s3:ListBucket on the source bucket and prefix.
- The S3 bucket name, key prefix, and region.
Authentication
For S3 access, choose one credential method:
| Method | Keys | Use when |
|---|
| Assumed IAM role | roleArn (+ externalId) | Recommended for production — no static secrets. |
| Cluster node role | (none — uses the instance role) | The cluster’s node role already has access to the bucket. |
| Static access keys | accessKey / secretKey | Quick tests, or when role-based access isn’t available. |
Verify access from the cluster before onboarding:
aws s3 ls s3://<bucket>/<prefix>/
Step 1: Validate and browse the connection
POST /connections/browse
There is no separate “validate” endpoint. Browsing the catalog is the validation step: a 200 with an items list (even an empty one) confirms your credentials and connectivity. Bad credentials or an unreachable prefix return an error.
- Set
path to "" to browse the root prefix — this both validates the connection and lists files/directories.
Request
| Field | Type | Required | Description |
|---|
connection.type | string | Yes | Use CATALOG for all External Table sources. |
connection.params | object | Yes | S3 connection settings (see example below). |
path | string | No | Where to browse. "" or omitted = root prefix. |
curl -X POST "$CONTROLLER/connections/browse" \
-H "Content-Type: application/json" \
-d '{
"connection": {
"type": "CATALOG",
"params": {
"catalogType": "s3",
"catalog.s3.bucketName": "<bucket>",
"catalog.s3.prefix": "<s3-prefix>",
"catalog.s3.region": "<region>",
"catalog.s3.accessKey": "<key>",
"catalog.s3.secretKey": "<secret>"
}
},
"path": ""
}'
Response
{
"items": [
{ "name": "2024/01/", "type": "DIR" },
{ "name": "data.parquet", "type": "FILE" }
]
}
items[].type | Meaning |
|---|
DIR | A directory you can drill into — pass its name as path. |
FILE | A Parquet file under the prefix. |
Step 2: Preview the schema
POST /tables/preview
Samples the Parquet files, infers a Pinot schema, and returns an enriched table config (S3 tier, raw field configs, time column) plus sample rows. Review it, tweak if needed, then carry the schema and config forward to steps 3 and 4.
The request and response share the same JSON shape. You send a tableConfig describing the source; the response fills in schema, the enriched tableConfigs.offline, sampled rows, and a summary.
Request
Top-level fields:
| Field | Type | Required | Description |
|---|
tableConfig | object | Yes | A single OFFLINE table config whose ExternalTableSyncTask block describes the source. |
config | object | No | Sampling and inference controls (see below). Defaults are applied if omitted. |
schema | object | No | An explicit schema to use instead of inferring one. |
config.inference — how the schema is derived:
| Field | Type | Default | Description |
|---|
embeddedSchemaEnabled | boolean | false | Read the schema embedded in the Parquet metadata. Set this to true for External Tables. |
schemaInferenceEnabled | boolean | false | Infer the schema from sampled rows. Fallback when no embedded schema exists. |
config.sampling — how rows are sampled:
| Field | Type | Default | Description |
|---|
min | integer | 1 | Minimum records to sample. |
max | integer | 100 | Maximum records to sample. |
To list source files without sampling any data, set config.previewFiles.previewFilesOnly = true. The response returns matching file URIs in sourceFiles and skips schema inference. (OFFLINE only.)
curl -X POST "$CONTROLLER/tables/preview" \
-H "Content-Type: application/json" \
-d '{
"tableConfig": {
"tableName": "nyc_taxi_trips_OFFLINE",
"tableType": "OFFLINE",
"task": {
"taskTypeConfigsMap": {
"ExternalTableSyncTask": {
"catalogType": "s3",
"executor": "controller",
"inputFormat": "parquet",
"catalog.s3.bucketName": "<bucket>",
"catalog.s3.prefix": "<s3-prefix>",
"catalog.s3.region": "<region>",
"catalog.s3.accessKey": "<key>",
"catalog.s3.secretKey": "<secret>",
"catalog.s3.table.namespace": "default",
"catalog.s3.table.tableName": "nyc_taxi_trips"
}
}
}
},
"config": {
"inference": { "embeddedSchemaEnabled": true },
"sampling": { "min": 1, "max": 100 }
}
}'
Setting the namespace and table. For S3, browse only validates the connection and lists files; the names are values you choose — set catalog.s3.table.namespace (use default for a flat prefix) and catalog.s3.table.tableName (any logical name).Use executor: controller — it’s required for the controller-watcher flow and the observability endpoints. The input tableConfig is intentionally minimal; /tables/preview returns the complete tableConfigs.offline you persist in Step 4.
Response
| Field | Type | Description |
|---|
schema | object | The resolved Pinot schema. Use this in step 3. |
tableConfigs.offline | object | The enriched OFFLINE table config, ready to persist. Use this in step 4. Secrets are masked as *****. |
rows | array | Sample records after schema + ingestion transforms. |
sourceRows | array | Raw records before transformation. |
summary | object | Run summary: nSourceRows, nRows, nColumns, and summary.batch.nMatchingFiles (files found). |
Adjust the schema (time column, column names, null handling) before moving on.
Step 3: Create the schema
POST /schemas
Send the schema object from the preview response. Rename its schemaName to match your table.
curl -X POST "$CONTROLLER/schemas" \
-H "Content-Type: application/json" \
-d @schema.json
{ "status": "nyc_taxi_trips successfully added" }
Step 4: Create the table
POST /tables
Send the enriched tableConfigs.offline object from the preview response. Its ExternalTableSyncTask block is what marks the table as external.
curl -X POST "$CONTROLLER/tables" \
-H "Content-Type: application/json" \
-d @tableConfig.json
{ "status": "Table nyc_taxi_trips_OFFLINE successfully added" }
The controller’s External Table watcher then discovers the table, runs the first sync, and re-syncs at the schedule (cron) interval. There is no separate start call.
The first sync runs on the watcher’s next tick. To kick it off immediately instead of waiting, you can manually trigger a run:curl -X POST "$CONTROLLER/tasks/schedule?taskType=ExternalTableSyncTask&tableName=nyc_taxi_trips_OFFLINE"
Quickstart: onboard a table end-to-end
The whole flow as one script. Fill in the variables at the top, run the script, and the first sync starts automatically.
#!/usr/bin/env bash
set -euo pipefail
CONTROLLER="https://<your-controller>"
AUTH="Authorization: Bearer <token>"
BUCKET="<bucket>"
PREFIX="<s3-prefix>"
REGION="<region>"
ACCESS_KEY="<key>"
SECRET_KEY="<secret>"
TABLE="nyc_taxi_trips"
# 1. Validate & browse — confirms credentials and lists files.
curl -sf -X POST "$CONTROLLER/connections/browse" \
-H "$AUTH" -H 'Content-Type: application/json' \
--data-raw "{
\"connection\": {
\"type\": \"CATALOG\",
\"params\": {
\"catalogType\": \"s3\",
\"catalog.s3.bucketName\": \"$BUCKET\",
\"catalog.s3.prefix\": \"$PREFIX\",
\"catalog.s3.region\": \"$REGION\",
\"catalog.s3.accessKey\": \"$ACCESS_KEY\",
\"catalog.s3.secretKey\": \"$SECRET_KEY\"
}
},
\"path\": \"\"
}" | jq .
# 2. Preview — infer schema + enriched table config.
curl -sf -X POST "$CONTROLLER/tables/preview" \
-H "$AUTH" -H 'Content-Type: application/json' \
--data-raw "{
\"tableConfig\": {
\"tableName\": \"${TABLE}_OFFLINE\",
\"tableType\": \"OFFLINE\",
\"task\": {
\"taskTypeConfigsMap\": {
\"ExternalTableSyncTask\": {
\"catalogType\": \"s3\",
\"executor\": \"controller\",
\"inputFormat\": \"parquet\",
\"catalog.s3.bucketName\": \"$BUCKET\",
\"catalog.s3.prefix\": \"$PREFIX\",
\"catalog.s3.region\": \"$REGION\",
\"catalog.s3.accessKey\": \"$ACCESS_KEY\",
\"catalog.s3.secretKey\": \"$SECRET_KEY\",
\"catalog.s3.table.namespace\": \"default\",
\"catalog.s3.table.tableName\": \"$TABLE\"
}
}
}
},
\"config\": {
\"inference\": { \"embeddedSchemaEnabled\": true },
\"sampling\": { \"min\": 1, \"max\": 100 }
}
}" > preview.json
# 3. Extract schema and table config from preview response.
jq --arg t "$TABLE" '.schema | .schemaName = $t' preview.json > schema.json
jq '.tableConfigs.offline' preview.json > tableConfig.json
# 4. Create the schema.
curl -sf -X POST "$CONTROLLER/schemas" \
-H "$AUTH" -H 'Content-Type: application/json' -d @schema.json
# 5. Create the table — watcher starts the first sync automatically.
curl -sf -X POST "$CONTROLLER/tables" \
-H "$AUTH" -H 'Content-Type: application/json' -d @tableConfig.json
# 6. (optional) Poll until sync completes.
curl -sf "$CONTROLLER/tables/${TABLE}_OFFLINE/externalTable/status" -H "$AUTH" \
| jq -e '.status == "COMPLETED" and .segmentsUploaded == .filesDiscovered'
Once step 6 exits 0, verify it’s queryable.
Monitor onboarding
Three read-only endpoints report ingestion progress — run status, ingestion checkpoint, and source file count — and require executor=controller (set automatically). See Observability for full request and response details.
Verify it’s queryable
When status is COMPLETED and segmentsUploaded matches filesDiscovered, run a query against the broker (or the Data Portal query console) to confirm the data is live:
SELECT count(*) FROM nyc_taxi_trips;
The count will be much larger than the preview’s summary.nSourceRows (preview only samples up to ~100 rows) — confirm it’s non-zero and plausible for your dataset. If status is COMPLETED but the count is 0, give segments a moment to load on the servers, then recheck; if it persists, see Troubleshooting.
What’s next
Now that the table is created and data is loading, these are the highest-impact follow-up steps:
-
Add indexes for your query patterns. Without indexes, every query scans all remote Parquet data. Add a range index on time/numeric columns, an inverted index on low-cardinality filter columns, and a bloom filter on high-cardinality ID columns. → Indexes
-
Enable caching and preload. Set
enable.prefetch.page.cache=true and preload.enable=true on the S3 tier so index data is served from local disk on repeated queries instead of re-fetched from S3. → Data and Index Caching
-
Protect large-scan queries from OOM. For tables that receive heavy aggregations or wide scans, enable the query OOM killer so a runaway query is killed instead of crashing the server. → Best Practices & Configs — Query OOM protection
-
Monitor ongoing syncs. Use the observability endpoints to check run status, ingestion checkpoint, and source file count after each scheduled sync. → Observability
For common questions and failures, see the FAQ and Troubleshooting.