GCS Data Lake: Onboarding via API

This feature requires StarTree release 0.14.0 or later, and must be enabled on demand — contact StarTree support to activate it.

This page shows how to onboard a GCS Data Lake External Table with the StarTree controller REST APIs instead of the Data Portal UI. Use it for automation, infrastructure-as-code, or onboarding many tables at once. GCS Data Lake uses Google Cloud Storage accessed through its S3-compatible (“interop”) endpoint (storage.googleapis.com). The catalog type is gcs-interop and it reuses the same S3 wiring as S3 Data Lake — only the endpoint, addressing style, and credential source differ.

How it works

Onboarding is four calls. Each one feeds the next:

Validate & browse   POST /connections/browse   →  validate; list files under the GCS prefix
Preview             POST /tables/preview       →  infer schema + enriched table config
Create schema       POST /schemas              →  register the schema
Create table        POST /tables               →  register the External Table

The preview call (step 2) is the core step: it samples the source, infers a Pinot schema, and returns a ready-to-use table config. Steps 3 and 4 just persist its output. Once the table exists, the controller’s watcher runs the first sync and re-syncs on schedule — no manual trigger. Track progress with the observability endpoints.

All paths are relative to your controller base URL. The examples assume export CONTROLLER=https://<your-controller>. On StarTree Cloud, the controller is reached through the data-plane proxy, so use export CONTROLLER=https://<data-plane-host>/api/pinot.If your controller requires authentication, add an Authorization header to every request — e.g. -H "Authorization: Bearer <token>". The examples below omit it for brevity.

Prerequisites

StarTree release with GCS External Table support enabled, and tiered storage configured on the cluster with a GCS_INTEROPERABLE tier backend. Contact StarTree support if unsure.
Network access to the controller REST endpoint, plus an Authorization token if your cluster requires auth.
A GCS bucket with Parquet files and a GCS HMAC key (access ID + secret) with storage.objects.get and storage.objects.list on the bucket. Generate HMAC keys in the Google Cloud Console → Cloud Storage → Settings → Interoperability.
The GCS bucket name, key prefix, and the project’s region (used as a placeholder — GCS ignores it but the field is required).

Authentication

GCS Data Lake authenticates using HMAC keys (Google’s S3-interop credentials). There are two ways to supply them:

Method	Keys	Use when
Inline HMAC	`accessKey` (HMAC access ID) + `secretKey` (HMAC secret)	Quick tests or when Secret Manager is not available.
GCP Secret Manager	`keyType=SECRET` + secret names for `accessKey`/`secretKey`	Recommended for production — no HMAC material stored in the table config.

For Secret Manager, provide:

keyType: SECRET
secretmanagertype: GCS
gcpprojectid: your GCP project ID
gcpkeypath: path to a service-account key JSON file with Secret Manager read access

disable.integrity.protections must be set to true for all GCS connections. GCS rejects the AWS SDK’s default request checksums on writes and returns whole-object checksums on range GETs that the SDK mis-validates. Setting this flag relaxes both to WHEN_REQUIRED, which is required for reads and writes to succeed.

Step 1: Validate and browse the connection

POST /connections/browse There is no separate “validate” endpoint. Browsing the catalog is the validation step: a 200 with an items list (even an empty one) confirms your credentials and connectivity.

Set path to "" to browse the root prefix — this both validates the connection and lists files/directories.

Request

Field	Type	Required	Description
`connection.type`	string	Yes	Use `CATALOG` for all External Table sources.
`connection.params`	object	Yes	GCS connection settings (see examples below).
`path`	string	No	Where to browse. `""` or omitted = root prefix.

Inline HMAC
GCP Secret Manager

curl -X POST "$CONTROLLER/connections/browse" \
  -H "Content-Type: application/json" \
  -d '{
    "connection": {
      "type": "CATALOG",
      "params": {
        "catalogType": "gcs-interop",
        "catalog.gcs-interop.bucketName": "<gcs-bucket>",
        "catalog.gcs-interop.prefix": "<prefix>",
        "catalog.gcs-interop.region": "<region>",
        "catalog.gcs-interop.endpoint": "https://storage.googleapis.com",
        "catalog.gcs-interop.accessKey": "<hmac-access-id>",
        "catalog.gcs-interop.secretKey": "<hmac-secret>",
        "catalog.gcs-interop.disable.integrity.protections": "true"
      }
    },
    "path": ""
  }'

curl -X POST "$CONTROLLER/connections/browse" \
  -H "Content-Type: application/json" \
  -d '{
    "connection": {
      "type": "CATALOG",
      "params": {
        "catalogType": "gcs-interop",
        "catalog.gcs-interop.bucketName": "<gcs-bucket>",
        "catalog.gcs-interop.prefix": "<prefix>",
        "catalog.gcs-interop.region": "<region>",
        "catalog.gcs-interop.endpoint": "https://storage.googleapis.com",
        "catalog.gcs-interop.accessKey": "<access-id-secret-name>",
        "catalog.gcs-interop.secretKey": "<secret-key-secret-name>",
        "catalog.gcs-interop.keyType": "SECRET",
        "catalog.gcs-interop.secretmanagertype": "GCS",
        "catalog.gcs-interop.gcpprojectid": "<gcp-project-id>",
        "catalog.gcs-interop.gcpkeypath": "<path-to-sa-key.json>",
        "catalog.gcs-interop.disable.integrity.protections": "true"
      }
    },
    "path": ""
  }'

Response

{
  "items": [
    { "name": "2024/01/", "type": "DIR" },
    { "name": "data.parquet", "type": "FILE" }
  ]
}

`items[].type`	Meaning
`DIR`	A directory you can drill into — pass its `name` as `path`.
`FILE`	A Parquet file under the prefix.

Step 2: Preview the schema

POST /tables/preview Samples the Parquet files, infers a Pinot schema, and returns an enriched table config plus sample rows. Review it, tweak if needed, then carry the schema and config forward to steps 3 and 4.

Inline HMAC
GCP Secret Manager

curl -X POST "$CONTROLLER/tables/preview" \
  -H "Content-Type: application/json" \
  -d '{
    "tableConfig": {
      "tableName": "my_gcs_table_OFFLINE",
      "tableType": "OFFLINE",
      "task": {
        "taskTypeConfigsMap": {
          "ExternalTableSyncTask": {
            "catalogType": "gcs-interop",
            "executor": "controller",
            "inputFormat": "parquet",
            "catalog.gcs-interop.bucketName": "<gcs-bucket>",
            "catalog.gcs-interop.prefix": "<prefix>",
            "catalog.gcs-interop.region": "<region>",
            "catalog.gcs-interop.endpoint": "https://storage.googleapis.com",
            "catalog.gcs-interop.accessKey": "<hmac-access-id>",
            "catalog.gcs-interop.secretKey": "<hmac-secret>",
            "catalog.gcs-interop.table.namespace": "default",
            "catalog.gcs-interop.table.tableName": "my_gcs_table",
            "catalog.gcs-interop.disable.integrity.protections": "true"
          }
        }
      }
    },
    "config": {
      "inference": { "embeddedSchemaEnabled": true },
      "sampling":  { "min": 1, "max": 100 }
    }
  }'

curl -X POST "$CONTROLLER/tables/preview" \
  -H "Content-Type: application/json" \
  -d '{
    "tableConfig": {
      "tableName": "my_gcs_table_OFFLINE",
      "tableType": "OFFLINE",
      "task": {
        "taskTypeConfigsMap": {
          "ExternalTableSyncTask": {
            "catalogType": "gcs-interop",
            "executor": "controller",
            "inputFormat": "parquet",
            "catalog.gcs-interop.bucketName": "<gcs-bucket>",
            "catalog.gcs-interop.prefix": "<prefix>",
            "catalog.gcs-interop.region": "<region>",
            "catalog.gcs-interop.endpoint": "https://storage.googleapis.com",
            "catalog.gcs-interop.accessKey": "<access-id-secret-name>",
            "catalog.gcs-interop.secretKey": "<secret-key-secret-name>",
            "catalog.gcs-interop.keyType": "SECRET",
            "catalog.gcs-interop.secretmanagertype": "GCS",
            "catalog.gcs-interop.gcpprojectid": "<gcp-project-id>",
            "catalog.gcs-interop.gcpkeypath": "<path-to-sa-key.json>",
            "catalog.gcs-interop.table.namespace": "default",
            "catalog.gcs-interop.table.tableName": "my_gcs_table",
            "catalog.gcs-interop.disable.integrity.protections": "true"
          }
        }
      }
    },
    "config": {
      "inference": { "embeddedSchemaEnabled": true },
      "sampling":  { "min": 1, "max": 100 }
    }
  }'

Response

Field	Type	Description
`schema`	object	The resolved Pinot schema. Use this in step 3.
`tableConfigs.offline`	object	The enriched OFFLINE table config, ready to persist. Use this in step 4. Secrets are masked as `*****`.
`rows`	array	Sample records after schema + ingestion transforms.
`sourceRows`	array	Raw records before transformation.
`summary`	object	Run summary: `nSourceRows`, `nRows`, `nColumns`, and `summary.batch.nMatchingFiles` (files found).

Step 3: Create the schema

POST /schemas Send the schema object from the preview response. Rename its schemaName to match your table.

curl -X POST "$CONTROLLER/schemas" \
  -H "Content-Type: application/json" \
  -d @schema.json

{ "status": "my_gcs_table successfully added" }

Step 4: Create the table

POST /tables Send the enriched tableConfigs.offline object from the preview response. The ExternalTableSyncTask block marks the table as external, and the tierConfigs array pins it to the GCS_INTEROPERABLE tier.

curl -X POST "$CONTROLLER/tables" \
  -H "Content-Type: application/json" \
  -d @tableConfig.json

{ "status": "Table my_gcs_table_OFFLINE successfully added" }

The controller’s External Table watcher then discovers the table, runs the first sync, and re-syncs at the schedule (cron) interval. There is no separate start call.

The tableConfigs.offline returned by /tables/preview includes a tierConfigs block pre-configured for GCS_INTEROPERABLE. Verify that the tier backend credentials match the source catalog credentials (they can differ if you use separate HMAC keys or Secret Manager secrets for the deep store).

The first sync runs on the watcher’s next tick. To kick it off immediately:

curl -X POST "$CONTROLLER/tasks/schedule?taskType=ExternalTableSyncTask&tableName=my_gcs_table_OFFLINE"

Quickstart: onboard a table end-to-end

The whole flow as one script (inline HMAC mode). Fill in the variables at the top, run the script, and the first sync starts automatically.

#!/usr/bin/env bash
set -euo pipefail

CONTROLLER="https://<your-controller>"
AUTH="Authorization: Bearer <token>"
BUCKET="<gcs-bucket>"
PREFIX="<prefix>"
REGION="<region>"       # Required field; GCS ignores the value
HMAC_KEY="<hmac-access-id>"
HMAC_SECRET="<hmac-secret>"
TABLE="my_gcs_table"

# 1. Validate & browse — confirms credentials and lists files.
curl -sf -X POST "$CONTROLLER/connections/browse" \
  -H "$AUTH" -H 'Content-Type: application/json' \
  --data-raw "{
    \"connection\": {
      \"type\": \"CATALOG\",
      \"params\": {
        \"catalogType\": \"gcs-interop\",
        \"catalog.gcs-interop.bucketName\": \"$BUCKET\",
        \"catalog.gcs-interop.prefix\": \"$PREFIX\",
        \"catalog.gcs-interop.region\": \"$REGION\",
        \"catalog.gcs-interop.endpoint\": \"https://storage.googleapis.com\",
        \"catalog.gcs-interop.accessKey\": \"$HMAC_KEY\",
        \"catalog.gcs-interop.secretKey\": \"$HMAC_SECRET\",
        \"catalog.gcs-interop.disable.integrity.protections\": \"true\"
      }
    },
    \"path\": \"\"
  }" | jq .

# 2. Preview — infer schema + enriched table config.
curl -sf -X POST "$CONTROLLER/tables/preview" \
  -H "$AUTH" -H 'Content-Type: application/json' \
  --data-raw "{
    \"tableConfig\": {
      \"tableName\": \"${TABLE}_OFFLINE\",
      \"tableType\": \"OFFLINE\",
      \"task\": {
        \"taskTypeConfigsMap\": {
          \"ExternalTableSyncTask\": {
            \"catalogType\": \"gcs-interop\",
            \"executor\": \"controller\",
            \"inputFormat\": \"parquet\",
            \"catalog.gcs-interop.bucketName\": \"$BUCKET\",
            \"catalog.gcs-interop.prefix\": \"$PREFIX\",
            \"catalog.gcs-interop.region\": \"$REGION\",
            \"catalog.gcs-interop.endpoint\": \"https://storage.googleapis.com\",
            \"catalog.gcs-interop.accessKey\": \"$HMAC_KEY\",
            \"catalog.gcs-interop.secretKey\": \"$HMAC_SECRET\",
            \"catalog.gcs-interop.table.namespace\": \"default\",
            \"catalog.gcs-interop.table.tableName\": \"$TABLE\",
            \"catalog.gcs-interop.disable.integrity.protections\": \"true\"
          }
        }
      }
    },
    \"config\": {
      \"inference\": { \"embeddedSchemaEnabled\": true },
      \"sampling\":  { \"min\": 1, \"max\": 100 }
    }
  }" > preview.json

# 3. Extract schema and table config from preview response.
jq --arg t "$TABLE" '.schema | .schemaName = $t' preview.json > schema.json
jq '.tableConfigs.offline'                        preview.json > tableConfig.json

# 4. Create the schema.
curl -sf -X POST "$CONTROLLER/schemas" \
  -H "$AUTH" -H 'Content-Type: application/json' -d @schema.json

# 5. Create the table — watcher starts the first sync automatically.
curl -sf -X POST "$CONTROLLER/tables" \
  -H "$AUTH" -H 'Content-Type: application/json' -d @tableConfig.json

# 6. (optional) Poll until sync completes.
curl -sf "$CONTROLLER/tables/${TABLE}_OFFLINE/externalTable/status" -H "$AUTH" \
  | jq -e '.status == "COMPLETED" and .segmentsUploaded == .filesDiscovered'

Once step 6 exits 0, verify it’s queryable.

Monitor onboarding

Three read-only endpoints report ingestion progress — run status, ingestion checkpoint, and source file count. See Observability for full request and response details.

Verify it’s queryable

When status is COMPLETED and segmentsUploaded matches filesDiscovered, run a query to confirm the data is live:

SELECT count(*) FROM my_gcs_table;

If status is COMPLETED but the count is 0, give segments a moment to load on the servers, then recheck; if it persists, see Troubleshooting.

GCS-specific config reference

All keys go under the ExternalTableSyncTask block in taskTypeConfigsMap.

Key	Required	Description
`catalog.gcs-interop.bucketName`	Yes	GCS bucket name.
`catalog.gcs-interop.prefix`	Yes	Key prefix for the Parquet data (e.g. `path/to/data/`).
`catalog.gcs-interop.region`	Yes	Region string. GCS ignores the value but the field must be present.
`catalog.gcs-interop.endpoint`	Yes	Always `https://storage.googleapis.com`.
`catalog.gcs-interop.accessKey`	Yes	HMAC access ID, or GCP Secret Manager secret name if `keyType=SECRET`.
`catalog.gcs-interop.secretKey`	Yes	HMAC secret, or GCP Secret Manager secret name if `keyType=SECRET`.
`catalog.gcs-interop.disable.integrity.protections`	Yes	Must be `"true"`. Required for GCS range-GET and PUT compatibility with the AWS SDK.
`catalog.gcs-interop.keyType`	No	Set to `SECRET` to resolve `accessKey`/`secretKey` via GCP Secret Manager.
`catalog.gcs-interop.secretmanagertype`	If `keyType=SECRET`	Must be `GCS`.
`catalog.gcs-interop.gcpprojectid`	If `keyType=SECRET`	GCP project ID for Secret Manager lookups.
`catalog.gcs-interop.gcpkeypath`	If `keyType=SECRET`	Path to a service-account key JSON with Secret Manager read access.
`catalog.gcs-interop.table.namespace`	No	Logical namespace. Use `default` for a flat prefix.
`catalog.gcs-interop.table.tableName`	No	Logical table name (any value; used internally).

What’s next

Add indexes for your query patterns. → Indexes
Enable caching and preload. → Data and Index Caching
Protect large-scan queries from OOM. → Best Practices & Configs — Query OOM protection
Monitor ongoing syncs. → Observability

For common questions and failures, see the FAQ and Troubleshooting.

​How it works

​Prerequisites

​Authentication

​Step 1: Validate and browse the connection

​Request

​Response

​Step 2: Preview the schema

​Response

​Step 3: Create the schema

​Step 4: Create the table

​Quickstart: onboard a table end-to-end

​Monitor onboarding

​Verify it’s queryable

​GCS-specific config reference

​What’s next

How it works

Prerequisites

Authentication

Step 1: Validate and browse the connection

Request

Response

Step 2: Preview the schema

Response

Step 3: Create the schema

Step 4: Create the table

Quickstart: onboard a table end-to-end

Monitor onboarding

Verify it’s queryable

GCS-specific config reference

What’s next