> ## Documentation Index
> Fetch the complete documentation index at: https://docs.startree.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Iceberg: Onboarding via API

<Warning>
  This feature requires StarTree release 0.14.0 or later, and must be enabled on demand — contact StarTree support to activate it.
</Warning>

This page shows how to onboard an Iceberg External Table with the StarTree controller REST APIs instead of the [Data Portal UI](./onboarding-data-portal). Use it for automation, infrastructure-as-code, or onboarding many tables at once.

## How it works

Onboarding is four calls. Each one feeds the next:

```text theme={null}
1. Validate & browse   POST /connections/browse   →  validate; list namespaces/tables
2. Preview             POST /tables/preview       →  infer schema + enriched table config
3. Create schema       POST /schemas              →  register the schema
4. Create table        POST /tables               →  register the External Table
```

The preview call (step 2) is the core step: it samples the source, infers a Pinot schema, and returns a ready-to-use table config. Steps 3 and 4 just persist its output.

Once the table exists, the controller's watcher runs the first sync and re-syncs on schedule — no manual trigger. Track progress with the [observability endpoints](#monitor-onboarding).

<Note>
  All paths are relative to your controller base URL. The examples assume `export CONTROLLER=https://<your-controller>`. On StarTree Cloud, the controller is reached through the data-plane proxy, so use `export CONTROLLER=https://<data-plane-host>/api/pinot`.

  If your controller requires authentication, add an `Authorization` header to every request — e.g. `-H "Authorization: Bearer <token>"`. The examples below omit it for brevity.
</Note>

## Supported Iceberg Sources

| Source                           | Description                                                                             | `catalogType`                                                  |
| -------------------------------- | --------------------------------------------------------------------------------------- | -------------------------------------------------------------- |
| **AWS Glue (Iceberg REST)**      | Iceberg tables managed by AWS Glue, over the Iceberg REST protocol.                     | `iceberg-rest` (`serviceType=glue`)                            |
| **AWS S3 Tables (Iceberg REST)** | Iceberg tables in S3 Tables buckets, over the Iceberg REST protocol.                    | `iceberg-rest` (`serviceType=s3Tables`)                        |
| **Other Iceberg REST catalogs**  | Any catalog that speaks the Iceberg REST protocol (e.g. Nessie, Polaris, generic REST). | `iceberg-rest` (`serviceType=nessie` or omit for generic REST) |

`catalogType=iceberg-rest` works with **any Iceberg REST–compliant catalog**. Built-in service adapters exist for **AWS Glue** (`glue`), **AWS S3 Tables** (`s3Tables`), and **Nessie** (`nessie`); other REST catalogs use the generic REST path. Data files must be **Parquet**.

## Prerequisites

* StarTree 0.14.0 or later with the External Table Beta feature enabled, and **tiered storage configured** on the cluster. Contact StarTree support if unsure.
* Network access to the controller REST endpoint, plus an `Authorization` token if your cluster requires auth.
* AWS credentials (access key + secret key) with read access to the catalog **and** the underlying S3 data.
* **AWS Glue**: the AWS account ID (the catalog warehouse) and the Glue database name.
* **AWS S3 Tables**: the full table bucket ARN.

## Authentication

An External Table authenticates in two places, configured independently so the catalog and the data can use different credentials:

* **Catalog (REST)** — keys under `catalog.iceberg-rest.auth.rest.*`. Set `authType` to one of `aws-sigv4` (Glue / S3 Tables), `oauth2`, `token`, or `none`.
* **Storage (S3 data files)** — keys under `catalog.iceberg-rest.auth.storage.*`.

For S3 access, choose one credential method:

| Method             | Keys                                                             | Use when                                                  |
| ------------------ | ---------------------------------------------------------------- | --------------------------------------------------------- |
| Assumed IAM role   | `roleArn` (+ `externalId`)                                       | Recommended for production — no static secrets.           |
| Cluster node role  | *(none — uses the instance role)*                                | The cluster's node role already has access to the bucket. |
| Static access keys | `accessKey` / `secretKey` (or `accessKeyId` / `secretAccessKey`) | Quick tests, or when role-based access isn't available.   |

The role or keys need `s3:GetObject` and `s3:ListBucket` on the source bucket and prefix. Verify from the cluster before onboarding:

```bash theme={null}
aws s3 ls s3://<bucket>/<prefix>/
```

***

## Step 1: Validate and browse the connection

**`POST /connections/browse`**

There is no separate "validate" endpoint. Browsing the catalog *is* the validation step: a `200` with an `items` list (even an empty one) confirms your credentials and connectivity. Bad credentials, an unreachable host, or an unknown catalog type return an error.

* Set `path` to `""` to browse the **root** — this both validates the connection and lists namespaces.
* Set `path` to a namespace name to list the **tables** inside it.

### Request

| Field               | Type   | Required | Description                                                                          |
| ------------------- | ------ | -------- | ------------------------------------------------------------------------------------ |
| `connection.type`   | string | Yes      | Use `CATALOG` for all External Table sources.                                        |
| `connection.params` | object | Yes      | Catalog-specific connection settings (see examples below).                           |
| `path`              | string | No       | Where to browse. `""` or omitted = root. A namespace name = that namespace's tables. |

```bash theme={null}
curl -X POST "$CONTROLLER/connections/browse" \
  -H "Content-Type: application/json" \
  -d '{
    "connection": {
      "type": "CATALOG",
      "params": {
        "catalogType": "iceberg-rest",
        "catalog.iceberg-rest.restUri": "https://glue.us-west-1.amazonaws.com",
        "catalog.iceberg-rest.serviceType": "glue",
        "catalog.iceberg-rest.warehouse": "123456789012",
        "catalog.iceberg-rest.auth.rest.authType": "aws-sigv4",
        "catalog.iceberg-rest.auth.rest.accessKeyId": "<key>",
        "catalog.iceberg-rest.auth.rest.secretAccessKey": "<secret>",
        "catalog.iceberg-rest.auth.rest.region": "us-west-1",
        "catalog.iceberg-rest.auth.rest.service": "glue",
        "catalog.iceberg-rest.auth.storage.authType": "aws-sigv4",
        "catalog.iceberg-rest.auth.storage.accessKeyId": "<key>",
        "catalog.iceberg-rest.auth.storage.secretAccessKey": "<secret>",
        "catalog.iceberg-rest.auth.storage.region": "us-west-1"
      }
    },
    "path": ""
  }'
```

### Response

```json theme={null}
{
  "items": [
    { "name": "transportation", "type": "NAMESPACE" },
    { "name": "nyc_taxi_trips",  "type": "TABLE" }
  ]
}
```

| `items[].type` | Meaning                                                     |
| -------------- | ----------------------------------------------------------- |
| `NAMESPACE`    | A namespace you can drill into — pass its `name` as `path`. |
| `TABLE`        | A selectable table.                                         |

***

## Step 2: Preview the schema

**`POST /tables/preview`**

Samples the source, infers a Pinot schema, and returns an enriched table config (S3 tier, raw field configs, time column) plus sample rows. Review it, tweak if needed, then carry the schema and config forward to steps 3 and 4.

<Note>
  The request and response share the same JSON shape. You send a `tableConfig` describing the source; the response fills in `schema`, the enriched `tableConfigs.offline`, sampled `rows`, and a `summary`.
</Note>

### Request

**Top-level fields:**

| Field         | Type   | Required | Description                                                                             |
| ------------- | ------ | -------- | --------------------------------------------------------------------------------------- |
| `tableConfig` | object | Yes      | A single OFFLINE table config whose `ExternalTableSyncTask` block describes the source. |
| `config`      | object | No       | Sampling and inference controls (see below). Defaults are applied if omitted.           |
| `schema`      | object | No       | An explicit schema to use instead of inferring one.                                     |

**`config.inference` — how the schema is derived:**

| Field                    | Type    | Default | Description                                                                                   |
| ------------------------ | ------- | ------- | --------------------------------------------------------------------------------------------- |
| `embeddedSchemaEnabled`  | boolean | `false` | Read the schema embedded in the Iceberg metadata. **Set this to `true` for External Tables.** |
| `schemaInferenceEnabled` | boolean | `false` | Infer the schema from sampled rows. Fallback when no embedded schema exists.                  |

Resolution order: explicit `schema` → embedded schema → inferred schema.

**`config.sampling` — how rows are sampled:**

| Field | Type    | Default | Description                |
| ----- | ------- | ------- | -------------------------- |
| `min` | integer | `1`     | Minimum records to sample. |
| `max` | integer | `100`   | Maximum records to sample. |

<Tip>
  To list source files without sampling any data, set `config.previewFiles.previewFilesOnly = true`. The response returns matching file URIs in `sourceFiles` and skips schema inference. (OFFLINE only.)
</Tip>

```bash theme={null}
curl -X POST "$CONTROLLER/tables/preview" \
  -H "Content-Type: application/json" \
  -d '{
    "tableConfig": {
      "tableName": "nyc_taxi_trips_OFFLINE",
      "tableType": "OFFLINE",
      "task": {
        "taskTypeConfigsMap": {
          "ExternalTableSyncTask": {
            "catalogType": "iceberg-rest",
            "executor": "controller",
            "inputFormat": "parquet",
            "catalog.iceberg-rest.restUri": "https://glue.us-west-1.amazonaws.com",
            "catalog.iceberg-rest.serviceType": "glue",
            "catalog.iceberg-rest.warehouse": "123456789012",
            "catalog.iceberg-rest.table.namespace": "transportation",
            "catalog.iceberg-rest.table.tableName": "nyc_taxi_trips",
            "catalog.iceberg-rest.auth.rest.authType": "aws-sigv4",
            "catalog.iceberg-rest.auth.rest.accessKeyId": "<key>",
            "catalog.iceberg-rest.auth.rest.secretAccessKey": "<secret>",
            "catalog.iceberg-rest.auth.rest.region": "us-west-1",
            "catalog.iceberg-rest.auth.rest.service": "glue",
            "catalog.iceberg-rest.auth.storage.authType": "aws-sigv4",
            "catalog.iceberg-rest.auth.storage.accessKeyId": "<key>",
            "catalog.iceberg-rest.auth.storage.secretAccessKey": "<secret>",
            "catalog.iceberg-rest.auth.storage.region": "us-west-1"
          }
        }
      }
    },
    "config": { "inference": { "embeddedSchemaEnabled": true } }
  }'
```

<Note>
  **Setting the namespace and table.** Take them from the [browse](#step-1-validate-and-browse-the-connection) response — set `catalog.iceberg-rest.table.namespace` / `.tableName` to the `NAMESPACE` and `TABLE` you selected.

  **Use `executor: controller`** — it's required for the controller-watcher flow and the [observability endpoints](../observability). The input `tableConfig` is intentionally minimal; `/tables/preview` returns the complete `tableConfigs.offline` you persist in Step 4.
</Note>

### Response

| Field                  | Type   | Description                                                                                                 |
| ---------------------- | ------ | ----------------------------------------------------------------------------------------------------------- |
| `schema`               | object | The resolved Pinot schema. **Use this in step 3.**                                                          |
| `tableConfigs.offline` | object | The enriched OFFLINE table config, ready to persist. **Use this in step 4.** Secrets are masked as `*****`. |
| `rows`                 | array  | Sample records after schema + ingestion transforms.                                                         |
| `sourceRows`           | array  | Raw records before transformation.                                                                          |
| `summary`              | object | Run summary: `nSourceRows`, `nRows`, `nColumns`, and `summary.batch.nMatchingFiles` (files found).          |

Adjust the schema (time column, column names, null handling) before moving on.

***

## Step 3: Create the schema

**`POST /schemas`**

Send the `schema` object from the preview response. Rename its `schemaName` to match your table.

```bash theme={null}
curl -X POST "$CONTROLLER/schemas" \
  -H "Content-Type: application/json" \
  -d @schema.json
```

```json theme={null}
{ "status": "nyc_taxi_trips successfully added" }
```

***

## Step 4: Create the table

**`POST /tables`**

Send the enriched `tableConfigs.offline` object from the preview response. Its `ExternalTableSyncTask` block is what marks the table as external.

```bash theme={null}
curl -X POST "$CONTROLLER/tables" \
  -H "Content-Type: application/json" \
  -d @tableConfig.json
```

```json theme={null}
{ "status": "Table nyc_taxi_trips_OFFLINE successfully added" }
```

The controller's External Table watcher then discovers the table, runs the first sync, and re-syncs at the `schedule` (cron) interval. There is no separate start call.

<Tip>
  The first sync runs on the watcher's next tick. To kick it off immediately instead of waiting, you can manually trigger a run:

  ```bash theme={null}
  curl -X POST "$CONTROLLER/tasks/schedule?taskType=ExternalTableSyncTask&tableName=nyc_taxi_trips_OFFLINE"
  ```
</Tip>

***

## Quickstart: onboard a table end-to-end

The whole flow as one script. Fill in the variables at the top, run the script, and the first sync starts automatically.

```bash theme={null}
#!/usr/bin/env bash
set -euo pipefail

CONTROLLER="https://<your-controller>"
AUTH="Authorization: Bearer <token>"
GLUE_URI="https://glue.us-west-1.amazonaws.com"
WAREHOUSE="123456789012"          # AWS account ID
NAMESPACE="transportation"        # Glue database
TABLE="nyc_taxi_trips"
REGION="us-west-1"
ACCESS_KEY="<key>"
SECRET_KEY="<secret>"

# 1. Validate & browse root — lists Glue namespaces.
curl -sf -X POST "$CONTROLLER/connections/browse" \
  -H "$AUTH" -H 'Content-Type: application/json' \
  --data-raw "{
    \"connection\": {
      \"type\": \"CATALOG\",
      \"params\": {
        \"catalogType\": \"iceberg-rest\",
        \"catalog.iceberg-rest.restUri\": \"$GLUE_URI\",
        \"catalog.iceberg-rest.serviceType\": \"glue\",
        \"catalog.iceberg-rest.warehouse\": \"$WAREHOUSE\",
        \"catalog.iceberg-rest.auth.rest.authType\": \"aws-sigv4\",
        \"catalog.iceberg-rest.auth.rest.accessKeyId\": \"$ACCESS_KEY\",
        \"catalog.iceberg-rest.auth.rest.secretAccessKey\": \"$SECRET_KEY\",
        \"catalog.iceberg-rest.auth.rest.region\": \"$REGION\",
        \"catalog.iceberg-rest.auth.rest.service\": \"glue\",
        \"catalog.iceberg-rest.auth.storage.authType\": \"aws-sigv4\",
        \"catalog.iceberg-rest.auth.storage.accessKeyId\": \"$ACCESS_KEY\",
        \"catalog.iceberg-rest.auth.storage.secretAccessKey\": \"$SECRET_KEY\",
        \"catalog.iceberg-rest.auth.storage.region\": \"$REGION\"
      }
    },
    \"path\": \"\"
  }" | jq .

# 2. Preview — infer schema + enriched table config.
curl -sf -X POST "$CONTROLLER/tables/preview" \
  -H "$AUTH" -H 'Content-Type: application/json' \
  --data-raw "{
    \"tableConfig\": {
      \"tableName\": \"${TABLE}_OFFLINE\",
      \"tableType\": \"OFFLINE\",
      \"task\": {
        \"taskTypeConfigsMap\": {
          \"ExternalTableSyncTask\": {
            \"catalogType\": \"iceberg-rest\",
            \"executor\": \"controller\",
            \"inputFormat\": \"parquet\",
            \"catalog.iceberg-rest.restUri\": \"$GLUE_URI\",
            \"catalog.iceberg-rest.serviceType\": \"glue\",
            \"catalog.iceberg-rest.warehouse\": \"$WAREHOUSE\",
            \"catalog.iceberg-rest.table.namespace\": \"$NAMESPACE\",
            \"catalog.iceberg-rest.table.tableName\": \"$TABLE\",
            \"catalog.iceberg-rest.auth.rest.authType\": \"aws-sigv4\",
            \"catalog.iceberg-rest.auth.rest.accessKeyId\": \"$ACCESS_KEY\",
            \"catalog.iceberg-rest.auth.rest.secretAccessKey\": \"$SECRET_KEY\",
            \"catalog.iceberg-rest.auth.rest.region\": \"$REGION\",
            \"catalog.iceberg-rest.auth.rest.service\": \"glue\",
            \"catalog.iceberg-rest.auth.storage.authType\": \"aws-sigv4\",
            \"catalog.iceberg-rest.auth.storage.accessKeyId\": \"$ACCESS_KEY\",
            \"catalog.iceberg-rest.auth.storage.secretAccessKey\": \"$SECRET_KEY\",
            \"catalog.iceberg-rest.auth.storage.region\": \"$REGION\"
          }
        }
      }
    },
    \"config\": { \"inference\": { \"embeddedSchemaEnabled\": true } }
  }" > preview.json

# 3. Extract schema and table config from preview response.
jq --arg t "$TABLE" '.schema | .schemaName = $t' preview.json > schema.json
jq '.tableConfigs.offline'                        preview.json > tableConfig.json

# 4. Create the schema.
curl -sf -X POST "$CONTROLLER/schemas" \
  -H "$AUTH" -H 'Content-Type: application/json' -d @schema.json

# 5. Create the table — watcher starts the first sync automatically.
curl -sf -X POST "$CONTROLLER/tables" \
  -H "$AUTH" -H 'Content-Type: application/json' -d @tableConfig.json

# 6. (optional) Poll until sync completes.
curl -sf "$CONTROLLER/tables/${TABLE}_OFFLINE/externalTable/status" -H "$AUTH" \
  | jq -e '.status == "COMPLETED" and .segmentsUploaded == .filesDiscovered'
```

Once step 6 exits `0`, [verify it's queryable](#verify-its-queryable).

## Monitor onboarding

Three read-only endpoints report ingestion progress — run status, ingestion checkpoint, and source file count — and require `executor=controller` (set automatically). See [Observability](../observability) for full request and response details.

## Verify it's queryable

When `status` is `COMPLETED` and `segmentsUploaded` matches `filesDiscovered`, run a query against the broker (or the Data Portal query console) to confirm the data is live:

```sql theme={null}
SELECT count(*) FROM nyc_taxi_trips;
```

The count will be much larger than the preview's `summary.nSourceRows` (preview only samples up to \~100 rows) — confirm it's non-zero and plausible for your dataset. If `status` is `COMPLETED` but the count is `0`, give segments a moment to load on the servers, then recheck; if it persists, see [Troubleshooting](../troubleshooting).

***

## What's next

Now that the table is created and data is loading, these are the highest-impact follow-up steps:

1. **Add indexes for your query patterns.** Without indexes, every query scans all remote Parquet data. Add a range index on time/numeric columns, an inverted index on low-cardinality filter columns, and a bloom filter on high-cardinality ID columns. → [Indexes](../indexes)

2. **Enable caching and preload.** Set `enable.prefetch.page.cache=true` and `preload.enable=true` on the S3 tier so index data is served from local disk on repeated queries instead of re-fetched from S3. → [Data and Index Caching](../data-and-index-caching)

3. **Protect large-scan queries from OOM.** For tables that receive heavy aggregations or wide scans, enable the query OOM killer so a runaway query is killed instead of crashing the server. → [Best Practices & Configs — Query OOM protection](../best-practices-and-configs#query-oom-protection-large-scans)

4. **Monitor ongoing syncs.** Use the observability endpoints to check run status, ingestion checkpoint, and source file count after each scheduled sync. → [Observability](../observability)

***

For common questions and failures, see the [FAQ](../faq) and [Troubleshooting](../troubleshooting).
