Skip to main content
This feature requires StarTree release 0.14.0 or later, and must be enabled on demand — contact StarTree support to activate it.
This page shows how to onboard an S3 Data Lake External Table with the StarTree controller REST APIs instead of the Data Portal UI. Use it for automation, infrastructure-as-code, or onboarding many tables at once.

How it works

Onboarding is four calls. Each one feeds the next:
1. Validate & browse   POST /connections/browse   →  validate; list files under the S3 prefix
2. Preview             POST /tables/preview       →  infer schema + enriched table config
3. Create schema       POST /schemas              →  register the schema
4. Create table        POST /tables               →  register the External Table
The preview call (step 2) is the core step: it samples the source, infers a Pinot schema, and returns a ready-to-use table config. Steps 3 and 4 just persist its output. Once the table exists, the controller’s watcher runs the first sync and re-syncs on schedule — no manual trigger. Track progress with the observability endpoints.
All paths are relative to your controller base URL. The examples assume export CONTROLLER=https://<your-controller>. On StarTree Cloud, the controller is reached through the data-plane proxy, so use export CONTROLLER=https://<data-plane-host>/api/pinot.If your controller requires authentication, add an Authorization header to every request — e.g. -H "Authorization: Bearer <token>". The examples below omit it for brevity.

Prerequisites

  • StarTree 0.14.0 or later with the External Table Beta feature enabled, and tiered storage configured on the cluster. Contact StarTree support if unsure.
  • Network access to the controller REST endpoint, plus an Authorization token if your cluster requires auth.
  • AWS credentials (access key + secret key) with s3:GetObject and s3:ListBucket on the source bucket and prefix.
  • The S3 bucket name, key prefix, and region.

Authentication

For S3 access, choose one credential method:
MethodKeysUse when
Assumed IAM roleroleArn (+ externalId)Recommended for production — no static secrets.
Cluster node role(none — uses the instance role)The cluster’s node role already has access to the bucket.
Static access keysaccessKey / secretKeyQuick tests, or when role-based access isn’t available.
Verify access from the cluster before onboarding:
aws s3 ls s3://<bucket>/<prefix>/

Step 1: Validate and browse the connection

POST /connections/browse There is no separate “validate” endpoint. Browsing the catalog is the validation step: a 200 with an items list (even an empty one) confirms your credentials and connectivity. Bad credentials or an unreachable prefix return an error.
  • Set path to "" to browse the root prefix — this both validates the connection and lists files/directories.

Request

FieldTypeRequiredDescription
connection.typestringYesUse CATALOG for all External Table sources.
connection.paramsobjectYesS3 connection settings (see example below).
pathstringNoWhere to browse. "" or omitted = root prefix.
curl -X POST "$CONTROLLER/connections/browse" \
  -H "Content-Type: application/json" \
  -d '{
    "connection": {
      "type": "CATALOG",
      "params": {
        "catalogType": "s3",
        "catalog.s3.bucketName": "<bucket>",
        "catalog.s3.prefix": "<s3-prefix>",
        "catalog.s3.region": "<region>",
        "catalog.s3.accessKey": "<key>",
        "catalog.s3.secretKey": "<secret>"
      }
    },
    "path": ""
  }'

Response

{
  "items": [
    { "name": "2024/01/", "type": "DIR" },
    { "name": "data.parquet", "type": "FILE" }
  ]
}
items[].typeMeaning
DIRA directory you can drill into — pass its name as path.
FILEA Parquet file under the prefix.

Step 2: Preview the schema

POST /tables/preview Samples the Parquet files, infers a Pinot schema, and returns an enriched table config (S3 tier, raw field configs, time column) plus sample rows. Review it, tweak if needed, then carry the schema and config forward to steps 3 and 4.
The request and response share the same JSON shape. You send a tableConfig describing the source; the response fills in schema, the enriched tableConfigs.offline, sampled rows, and a summary.

Request

Top-level fields:
FieldTypeRequiredDescription
tableConfigobjectYesA single OFFLINE table config whose ExternalTableSyncTask block describes the source.
configobjectNoSampling and inference controls (see below). Defaults are applied if omitted.
schemaobjectNoAn explicit schema to use instead of inferring one.
config.inference — how the schema is derived:
FieldTypeDefaultDescription
embeddedSchemaEnabledbooleanfalseRead the schema embedded in the Parquet metadata. Set this to true for External Tables.
schemaInferenceEnabledbooleanfalseInfer the schema from sampled rows. Fallback when no embedded schema exists.
config.sampling — how rows are sampled:
FieldTypeDefaultDescription
mininteger1Minimum records to sample.
maxinteger100Maximum records to sample.
To list source files without sampling any data, set config.previewFiles.previewFilesOnly = true. The response returns matching file URIs in sourceFiles and skips schema inference. (OFFLINE only.)
curl -X POST "$CONTROLLER/tables/preview" \
  -H "Content-Type: application/json" \
  -d '{
    "tableConfig": {
      "tableName": "nyc_taxi_trips_OFFLINE",
      "tableType": "OFFLINE",
      "task": {
        "taskTypeConfigsMap": {
          "ExternalTableSyncTask": {
            "catalogType": "s3",
            "executor": "controller",
            "inputFormat": "parquet",
            "catalog.s3.bucketName": "<bucket>",
            "catalog.s3.prefix": "<s3-prefix>",
            "catalog.s3.region": "<region>",
            "catalog.s3.accessKey": "<key>",
            "catalog.s3.secretKey": "<secret>",
            "catalog.s3.table.namespace": "default",
            "catalog.s3.table.tableName": "nyc_taxi_trips"
          }
        }
      }
    },
    "config": {
      "inference": { "embeddedSchemaEnabled": true },
      "sampling":  { "min": 1, "max": 100 }
    }
  }'
Setting the namespace and table. For S3, browse only validates the connection and lists files; the names are values you choose — set catalog.s3.table.namespace (use default for a flat prefix) and catalog.s3.table.tableName (any logical name).Use executor: controller — it’s required for the controller-watcher flow and the observability endpoints. The input tableConfig is intentionally minimal; /tables/preview returns the complete tableConfigs.offline you persist in Step 4.

Response

FieldTypeDescription
schemaobjectThe resolved Pinot schema. Use this in step 3.
tableConfigs.offlineobjectThe enriched OFFLINE table config, ready to persist. Use this in step 4. Secrets are masked as *****.
rowsarraySample records after schema + ingestion transforms.
sourceRowsarrayRaw records before transformation.
summaryobjectRun summary: nSourceRows, nRows, nColumns, and summary.batch.nMatchingFiles (files found).
Adjust the schema (time column, column names, null handling) before moving on.

Step 3: Create the schema

POST /schemas Send the schema object from the preview response. Rename its schemaName to match your table.
curl -X POST "$CONTROLLER/schemas" \
  -H "Content-Type: application/json" \
  -d @schema.json
{ "status": "nyc_taxi_trips successfully added" }

Step 4: Create the table

POST /tables Send the enriched tableConfigs.offline object from the preview response. Its ExternalTableSyncTask block is what marks the table as external.
curl -X POST "$CONTROLLER/tables" \
  -H "Content-Type: application/json" \
  -d @tableConfig.json
{ "status": "Table nyc_taxi_trips_OFFLINE successfully added" }
The controller’s External Table watcher then discovers the table, runs the first sync, and re-syncs at the schedule (cron) interval. There is no separate start call.
The first sync runs on the watcher’s next tick. To kick it off immediately instead of waiting, you can manually trigger a run:
curl -X POST "$CONTROLLER/tasks/schedule?taskType=ExternalTableSyncTask&tableName=nyc_taxi_trips_OFFLINE"

Quickstart: onboard a table end-to-end

The whole flow as one script. Fill in the variables at the top, run the script, and the first sync starts automatically.
#!/usr/bin/env bash
set -euo pipefail

CONTROLLER="https://<your-controller>"
AUTH="Authorization: Bearer <token>"
BUCKET="<bucket>"
PREFIX="<s3-prefix>"
REGION="<region>"
ACCESS_KEY="<key>"
SECRET_KEY="<secret>"
TABLE="nyc_taxi_trips"

# 1. Validate & browse — confirms credentials and lists files.
curl -sf -X POST "$CONTROLLER/connections/browse" \
  -H "$AUTH" -H 'Content-Type: application/json' \
  --data-raw "{
    \"connection\": {
      \"type\": \"CATALOG\",
      \"params\": {
        \"catalogType\": \"s3\",
        \"catalog.s3.bucketName\": \"$BUCKET\",
        \"catalog.s3.prefix\": \"$PREFIX\",
        \"catalog.s3.region\": \"$REGION\",
        \"catalog.s3.accessKey\": \"$ACCESS_KEY\",
        \"catalog.s3.secretKey\": \"$SECRET_KEY\"
      }
    },
    \"path\": \"\"
  }" | jq .

# 2. Preview — infer schema + enriched table config.
curl -sf -X POST "$CONTROLLER/tables/preview" \
  -H "$AUTH" -H 'Content-Type: application/json' \
  --data-raw "{
    \"tableConfig\": {
      \"tableName\": \"${TABLE}_OFFLINE\",
      \"tableType\": \"OFFLINE\",
      \"task\": {
        \"taskTypeConfigsMap\": {
          \"ExternalTableSyncTask\": {
            \"catalogType\": \"s3\",
            \"executor\": \"controller\",
            \"inputFormat\": \"parquet\",
            \"catalog.s3.bucketName\": \"$BUCKET\",
            \"catalog.s3.prefix\": \"$PREFIX\",
            \"catalog.s3.region\": \"$REGION\",
            \"catalog.s3.accessKey\": \"$ACCESS_KEY\",
            \"catalog.s3.secretKey\": \"$SECRET_KEY\",
            \"catalog.s3.table.namespace\": \"default\",
            \"catalog.s3.table.tableName\": \"$TABLE\"
          }
        }
      }
    },
    \"config\": {
      \"inference\": { \"embeddedSchemaEnabled\": true },
      \"sampling\":  { \"min\": 1, \"max\": 100 }
    }
  }" > preview.json

# 3. Extract schema and table config from preview response.
jq --arg t "$TABLE" '.schema | .schemaName = $t' preview.json > schema.json
jq '.tableConfigs.offline'                        preview.json > tableConfig.json

# 4. Create the schema.
curl -sf -X POST "$CONTROLLER/schemas" \
  -H "$AUTH" -H 'Content-Type: application/json' -d @schema.json

# 5. Create the table — watcher starts the first sync automatically.
curl -sf -X POST "$CONTROLLER/tables" \
  -H "$AUTH" -H 'Content-Type: application/json' -d @tableConfig.json

# 6. (optional) Poll until sync completes.
curl -sf "$CONTROLLER/tables/${TABLE}_OFFLINE/externalTable/status" -H "$AUTH" \
  | jq -e '.status == "COMPLETED" and .segmentsUploaded == .filesDiscovered'
Once step 6 exits 0, verify it’s queryable.

Monitor onboarding

Three read-only endpoints report ingestion progress — run status, ingestion checkpoint, and source file count — and require executor=controller (set automatically). See Observability for full request and response details.

Verify it’s queryable

When status is COMPLETED and segmentsUploaded matches filesDiscovered, run a query against the broker (or the Data Portal query console) to confirm the data is live:
SELECT count(*) FROM nyc_taxi_trips;
The count will be much larger than the preview’s summary.nSourceRows (preview only samples up to ~100 rows) — confirm it’s non-zero and plausible for your dataset. If status is COMPLETED but the count is 0, give segments a moment to load on the servers, then recheck; if it persists, see Troubleshooting.

What’s next

Now that the table is created and data is loading, these are the highest-impact follow-up steps:
  1. Add indexes for your query patterns. Without indexes, every query scans all remote Parquet data. Add a range index on time/numeric columns, an inverted index on low-cardinality filter columns, and a bloom filter on high-cardinality ID columns. → Indexes
  2. Enable caching and preload. Set enable.prefetch.page.cache=true and preload.enable=true on the S3 tier so index data is served from local disk on repeated queries instead of re-fetched from S3. → Data and Index Caching
  3. Protect large-scan queries from OOM. For tables that receive heavy aggregations or wide scans, enable the query OOM killer so a runaway query is killed instead of crashing the server. → Best Practices & Configs — Query OOM protection
  4. Monitor ongoing syncs. Use the observability endpoints to check run status, ingestion checkpoint, and source file count after each scheduled sync. → Observability

For common questions and failures, see the FAQ and Troubleshooting.