S3: Onboarding via API - StarTree Docs

This feature requires StarTree release 0.14.0 or later, and must be enabled on demand — contact StarTree support to activate it.

This page shows how to onboard an S3 Data Lake External Table with the StarTree controller REST APIs instead of the Data Portal UI. Use it for automation, infrastructure-as-code, or onboarding many tables at once.

How it works

Onboarding is four calls. Each one feeds the next:

Validate & browse   POST /connections/browse   →  validate; list files under the S3 prefix
Preview             POST /tables/preview       →  infer schema + enriched table config
Create schema       POST /schemas              →  register the schema
Create table        POST /tables               →  register the External Table

The preview call (step 2) is the core step: it samples the source, infers a Pinot schema, and returns a ready-to-use table config. Steps 3 and 4 just persist its output. Once the table exists, the controller’s watcher runs the first sync and re-syncs on schedule — no manual trigger. Track progress with the observability endpoints.

All paths are relative to your controller base URL. The examples assume export CONTROLLER=https://<your-controller>. On StarTree Cloud, the controller is reached through the data-plane proxy, so use export CONTROLLER=https://<data-plane-host>/api/pinot.If your controller requires authentication, add an Authorization header to every request — e.g. -H "Authorization: Bearer <token>". The examples below omit it for brevity.

Prerequisites

StarTree 0.14.0 or later with the External Table Beta feature enabled, and tiered storage configured on the cluster. Contact StarTree support if unsure.
Network access to the controller REST endpoint, plus an Authorization token if your cluster requires auth.
AWS credentials (access key + secret key) with s3:GetObject and s3:ListBucket on the source bucket and prefix.
The S3 bucket name, key prefix, and region.

Authentication

For S3 access, choose one credential method:

Method	Keys	Use when
Assumed IAM role	`roleArn` (+ `externalId`)	Recommended for production — no static secrets.
Cluster node role	(none — uses the instance role)	The cluster’s node role already has access to the bucket.
Static access keys	`accessKey` / `secretKey`	Quick tests, or when role-based access isn’t available.

Verify access from the cluster before onboarding:

aws s3 ls s3://<bucket>/<prefix>/

Step 1: Validate and browse the connection

POST /connections/browse There is no separate “validate” endpoint. Browsing the catalog is the validation step: a 200 with an items list (even an empty one) confirms your credentials and connectivity. Bad credentials or an unreachable prefix return an error.

Set path to "" to browse the root prefix — this both validates the connection and lists files/directories.

Request

Field	Type	Required	Description
`connection.type`	string	Yes	Use `CATALOG` for all External Table sources.
`connection.params`	object	Yes	S3 connection settings (see example below).
`path`	string	No	Where to browse. `""` or omitted = root prefix.

curl -X POST "$CONTROLLER/connections/browse" \
  -H "Content-Type: application/json" \
  -d '{
    "connection": {
      "type": "CATALOG",
      "params": {
        "catalogType": "s3",
        "catalog.s3.bucketName": "<bucket>",
        "catalog.s3.prefix": "<s3-prefix>",
        "catalog.s3.region": "<region>",
        "catalog.s3.accessKey": "<key>",
        "catalog.s3.secretKey": "<secret>"
      }
    },
    "path": ""
  }'

Response

{
  "items": [
    { "name": "2024/01/", "type": "DIR" },
    { "name": "data.parquet", "type": "FILE" }
  ]
}

`items[].type`	Meaning
`DIR`	A directory you can drill into — pass its `name` as `path`.
`FILE`	A Parquet file under the prefix.

Step 2: Preview the schema

POST /tables/preview Samples the Parquet files, infers a Pinot schema, and returns an enriched table config (S3 tier, raw field configs, time column) plus sample rows. Review it, tweak if needed, then carry the schema and config forward to steps 3 and 4.

The request and response share the same JSON shape. You send a tableConfig describing the source; the response fills in schema, the enriched tableConfigs.offline, sampled rows, and a summary.

Request

Top-level fields:

Field	Type	Required	Description
`tableConfig`	object	Yes	A single OFFLINE table config whose `ExternalTableSyncTask` block describes the source.
`config`	object	No	Sampling and inference controls (see below). Defaults are applied if omitted.
`schema`	object	No	An explicit schema to use instead of inferring one.

config.inference — how the schema is derived:

Field	Type	Default	Description
`embeddedSchemaEnabled`	boolean	`false`	Read the schema embedded in the Parquet metadata. Set this to `true` for External Tables.
`schemaInferenceEnabled`	boolean	`false`	Infer the schema from sampled rows. Fallback when no embedded schema exists.

config.sampling — how rows are sampled:

Field	Type	Default	Description
`min`	integer	`1`	Minimum records to sample.
`max`	integer	`100`	Maximum records to sample.

To list source files without sampling any data, set config.previewFiles.previewFilesOnly = true. The response returns matching file URIs in sourceFiles and skips schema inference. (OFFLINE only.)

curl -X POST "$CONTROLLER/tables/preview" \
  -H "Content-Type: application/json" \
  -d '{
    "tableConfig": {
      "tableName": "nyc_taxi_trips_OFFLINE",
      "tableType": "OFFLINE",
      "task": {
        "taskTypeConfigsMap": {
          "ExternalTableSyncTask": {
            "catalogType": "s3",
            "executor": "controller",
            "inputFormat": "parquet",
            "catalog.s3.bucketName": "<bucket>",
            "catalog.s3.prefix": "<s3-prefix>",
            "catalog.s3.region": "<region>",
            "catalog.s3.accessKey": "<key>",
            "catalog.s3.secretKey": "<secret>",
            "catalog.s3.table.namespace": "default",
            "catalog.s3.table.tableName": "nyc_taxi_trips"
          }
        }
      }
    },
    "config": {
      "inference": { "embeddedSchemaEnabled": true },
      "sampling":  { "min": 1, "max": 100 }
    }
  }'

Setting the namespace and table. For S3, browse only validates the connection and lists files; the names are values you choose — set catalog.s3.table.namespace (use default for a flat prefix) and catalog.s3.table.tableName (any logical name).Use executor: controller — it’s required for the controller-watcher flow and the observability endpoints. The input tableConfig is intentionally minimal; /tables/preview returns the complete tableConfigs.offline you persist in Step 4.

Response

Field	Type	Description
`schema`	object	The resolved Pinot schema. Use this in step 3.
`tableConfigs.offline`	object	The enriched OFFLINE table config, ready to persist. Use this in step 4. Secrets are masked as `*****`.
`rows`	array	Sample records after schema + ingestion transforms.
`sourceRows`	array	Raw records before transformation.
`summary`	object	Run summary: `nSourceRows`, `nRows`, `nColumns`, and `summary.batch.nMatchingFiles` (files found).

Adjust the schema (time column, column names, null handling) before moving on.

Step 3: Create the schema

POST /schemas Send the schema object from the preview response. Rename its schemaName to match your table.

curl -X POST "$CONTROLLER/schemas" \
  -H "Content-Type: application/json" \
  -d @schema.json

{ "status": "nyc_taxi_trips successfully added" }

Step 4: Create the table

POST /tables Send the enriched tableConfigs.offline object from the preview response. Its ExternalTableSyncTask block is what marks the table as external.

curl -X POST "$CONTROLLER/tables" \
  -H "Content-Type: application/json" \
  -d @tableConfig.json

{ "status": "Table nyc_taxi_trips_OFFLINE successfully added" }

The controller’s External Table watcher then discovers the table, runs the first sync, and re-syncs at the schedule (cron) interval. There is no separate start call.

The first sync runs on the watcher’s next tick. To kick it off immediately instead of waiting, you can manually trigger a run:

curl -X POST "$CONTROLLER/tasks/schedule?taskType=ExternalTableSyncTask&tableName=nyc_taxi_trips_OFFLINE"

Quickstart: onboard a table end-to-end

The whole flow as one script. Fill in the variables at the top, run the script, and the first sync starts automatically.

#!/usr/bin/env bash
set -euo pipefail

CONTROLLER="https://<your-controller>"
AUTH="Authorization: Bearer <token>"
BUCKET="<bucket>"
PREFIX="<s3-prefix>"
REGION="<region>"
ACCESS_KEY="<key>"
SECRET_KEY="<secret>"
TABLE="nyc_taxi_trips"

# 1. Validate & browse — confirms credentials and lists files.
curl -sf -X POST "$CONTROLLER/connections/browse" \
  -H "$AUTH" -H 'Content-Type: application/json' \
  --data-raw "{
    \"connection\": {
      \"type\": \"CATALOG\",
      \"params\": {
        \"catalogType\": \"s3\",
        \"catalog.s3.bucketName\": \"$BUCKET\",
        \"catalog.s3.prefix\": \"$PREFIX\",
        \"catalog.s3.region\": \"$REGION\",
        \"catalog.s3.accessKey\": \"$ACCESS_KEY\",
        \"catalog.s3.secretKey\": \"$SECRET_KEY\"
      }
    },
    \"path\": \"\"
  }" | jq .

# 2. Preview — infer schema + enriched table config.
curl -sf -X POST "$CONTROLLER/tables/preview" \
  -H "$AUTH" -H 'Content-Type: application/json' \
  --data-raw "{
    \"tableConfig\": {
      \"tableName\": \"${TABLE}_OFFLINE\",
      \"tableType\": \"OFFLINE\",
      \"task\": {
        \"taskTypeConfigsMap\": {
          \"ExternalTableSyncTask\": {
            \"catalogType\": \"s3\",
            \"executor\": \"controller\",
            \"inputFormat\": \"parquet\",
            \"catalog.s3.bucketName\": \"$BUCKET\",
            \"catalog.s3.prefix\": \"$PREFIX\",
            \"catalog.s3.region\": \"$REGION\",
            \"catalog.s3.accessKey\": \"$ACCESS_KEY\",
            \"catalog.s3.secretKey\": \"$SECRET_KEY\",
            \"catalog.s3.table.namespace\": \"default\",
            \"catalog.s3.table.tableName\": \"$TABLE\"
          }
        }
      }
    },
    \"config\": {
      \"inference\": { \"embeddedSchemaEnabled\": true },
      \"sampling\":  { \"min\": 1, \"max\": 100 }
    }
  }" > preview.json

# 3. Extract schema and table config from preview response.
jq --arg t "$TABLE" '.schema | .schemaName = $t' preview.json > schema.json
jq '.tableConfigs.offline'                        preview.json > tableConfig.json

# 4. Create the schema.
curl -sf -X POST "$CONTROLLER/schemas" \
  -H "$AUTH" -H 'Content-Type: application/json' -d @schema.json

# 5. Create the table — watcher starts the first sync automatically.
curl -sf -X POST "$CONTROLLER/tables" \
  -H "$AUTH" -H 'Content-Type: application/json' -d @tableConfig.json

# 6. (optional) Poll until sync completes.
curl -sf "$CONTROLLER/tables/${TABLE}_OFFLINE/externalTable/status" -H "$AUTH" \
  | jq -e '.status == "COMPLETED" and .segmentsUploaded == .filesDiscovered'

Once step 6 exits 0, verify it’s queryable.

Monitor onboarding

Three read-only endpoints report ingestion progress — run status, ingestion checkpoint, and source file count — and require executor=controller (set automatically). See Observability for full request and response details.

Verify it’s queryable

When status is COMPLETED and segmentsUploaded matches filesDiscovered, run a query against the broker (or the Data Portal query console) to confirm the data is live:

SELECT count(*) FROM nyc_taxi_trips;

The count will be much larger than the preview’s summary.nSourceRows (preview only samples up to ~100 rows) — confirm it’s non-zero and plausible for your dataset. If status is COMPLETED but the count is 0, give segments a moment to load on the servers, then recheck; if it persists, see Troubleshooting.

What’s next

Now that the table is created and data is loading, these are the highest-impact follow-up steps:

Add indexes for your query patterns. Without indexes, every query scans all remote Parquet data. Add a range index on time/numeric columns, an inverted index on low-cardinality filter columns, and a bloom filter on high-cardinality ID columns. → Indexes
Enable caching and preload. Set enable.prefetch.page.cache=true and preload.enable=true on the S3 tier so index data is served from local disk on repeated queries instead of re-fetched from S3. → Data and Index Caching
Protect large-scan queries from OOM. For tables that receive heavy aggregations or wide scans, enable the query OOM killer so a runaway query is killed instead of crashing the server. → Best Practices & Configs — Query OOM protection
Monitor ongoing syncs. Use the observability endpoints to check run status, ingestion checkpoint, and source file count after each scheduled sync. → Observability

For common questions and failures, see the FAQ and Troubleshooting.

​How it works

​Prerequisites

​Authentication

​Step 1: Validate and browse the connection

​Request

​Response

​Step 2: Preview the schema

​Request

​Response

​Step 3: Create the schema

​Step 4: Create the table

​Quickstart: onboard a table end-to-end

​Monitor onboarding

​Verify it’s queryable

​What’s next

How it works

Prerequisites

Authentication

Step 1: Validate and browse the connection

Request

Response

Step 2: Preview the schema

Request

Response

Step 3: Create the schema

Step 4: Create the table

Quickstart: onboard a table end-to-end

Monitor onboarding

Verify it’s queryable

What’s next