Skip to main content
This feature requires StarTree release 0.14.0 or later, and must be enabled on demand — contact StarTree support to activate it.
This page shows how to onboard a GCS Data Lake External Table with the StarTree controller REST APIs instead of the Data Portal UI. Use it for automation, infrastructure-as-code, or onboarding many tables at once. GCS Data Lake uses Google Cloud Storage accessed through its S3-compatible (“interop”) endpoint (storage.googleapis.com). The catalog type is gcs-interop and it reuses the same S3 wiring as S3 Data Lake — only the endpoint, addressing style, and credential source differ.

How it works

Onboarding is four calls. Each one feeds the next:
1. Validate & browse   POST /connections/browse   →  validate; list files under the GCS prefix
2. Preview             POST /tables/preview       →  infer schema + enriched table config
3. Create schema       POST /schemas              →  register the schema
4. Create table        POST /tables               →  register the External Table
The preview call (step 2) is the core step: it samples the source, infers a Pinot schema, and returns a ready-to-use table config. Steps 3 and 4 just persist its output. Once the table exists, the controller’s watcher runs the first sync and re-syncs on schedule — no manual trigger. Track progress with the observability endpoints.
All paths are relative to your controller base URL. The examples assume export CONTROLLER=https://<your-controller>. On StarTree Cloud, the controller is reached through the data-plane proxy, so use export CONTROLLER=https://<data-plane-host>/api/pinot.If your controller requires authentication, add an Authorization header to every request — e.g. -H "Authorization: Bearer <token>". The examples below omit it for brevity.

Prerequisites

  • StarTree release with GCS External Table support enabled, and tiered storage configured on the cluster with a GCS_INTEROPERABLE tier backend. Contact StarTree support if unsure.
  • Network access to the controller REST endpoint, plus an Authorization token if your cluster requires auth.
  • A GCS bucket with Parquet files and a GCS HMAC key (access ID + secret) with storage.objects.get and storage.objects.list on the bucket. Generate HMAC keys in the Google Cloud Console → Cloud Storage → Settings → Interoperability.
  • The GCS bucket name, key prefix, and the project’s region (used as a placeholder — GCS ignores it but the field is required).

Authentication

GCS Data Lake authenticates using HMAC keys (Google’s S3-interop credentials). There are two ways to supply them:
MethodKeysUse when
Inline HMACaccessKey (HMAC access ID) + secretKey (HMAC secret)Quick tests or when Secret Manager is not available.
GCP Secret ManagerkeyType=SECRET + secret names for accessKey/secretKeyRecommended for production — no HMAC material stored in the table config.
For Secret Manager, provide:
  • keyType: SECRET
  • secretmanagertype: GCS
  • gcpprojectid: your GCP project ID
  • gcpkeypath: path to a service-account key JSON file with Secret Manager read access
disable.integrity.protections must be set to true for all GCS connections. GCS rejects the AWS SDK’s default request checksums on writes and returns whole-object checksums on range GETs that the SDK mis-validates. Setting this flag relaxes both to WHEN_REQUIRED, which is required for reads and writes to succeed.

Step 1: Validate and browse the connection

POST /connections/browse There is no separate “validate” endpoint. Browsing the catalog is the validation step: a 200 with an items list (even an empty one) confirms your credentials and connectivity.
  • Set path to "" to browse the root prefix — this both validates the connection and lists files/directories.

Request

FieldTypeRequiredDescription
connection.typestringYesUse CATALOG for all External Table sources.
connection.paramsobjectYesGCS connection settings (see examples below).
pathstringNoWhere to browse. "" or omitted = root prefix.
curl -X POST "$CONTROLLER/connections/browse" \
  -H "Content-Type: application/json" \
  -d '{
    "connection": {
      "type": "CATALOG",
      "params": {
        "catalogType": "gcs-interop",
        "catalog.gcs-interop.bucketName": "<gcs-bucket>",
        "catalog.gcs-interop.prefix": "<prefix>",
        "catalog.gcs-interop.region": "<region>",
        "catalog.gcs-interop.endpoint": "https://storage.googleapis.com",
        "catalog.gcs-interop.accessKey": "<hmac-access-id>",
        "catalog.gcs-interop.secretKey": "<hmac-secret>",
        "catalog.gcs-interop.disable.integrity.protections": "true"
      }
    },
    "path": ""
  }'

Response

{
  "items": [
    { "name": "2024/01/", "type": "DIR" },
    { "name": "data.parquet", "type": "FILE" }
  ]
}
items[].typeMeaning
DIRA directory you can drill into — pass its name as path.
FILEA Parquet file under the prefix.

Step 2: Preview the schema

POST /tables/preview Samples the Parquet files, infers a Pinot schema, and returns an enriched table config plus sample rows. Review it, tweak if needed, then carry the schema and config forward to steps 3 and 4.
curl -X POST "$CONTROLLER/tables/preview" \
  -H "Content-Type: application/json" \
  -d '{
    "tableConfig": {
      "tableName": "my_gcs_table_OFFLINE",
      "tableType": "OFFLINE",
      "task": {
        "taskTypeConfigsMap": {
          "ExternalTableSyncTask": {
            "catalogType": "gcs-interop",
            "executor": "controller",
            "inputFormat": "parquet",
            "catalog.gcs-interop.bucketName": "<gcs-bucket>",
            "catalog.gcs-interop.prefix": "<prefix>",
            "catalog.gcs-interop.region": "<region>",
            "catalog.gcs-interop.endpoint": "https://storage.googleapis.com",
            "catalog.gcs-interop.accessKey": "<hmac-access-id>",
            "catalog.gcs-interop.secretKey": "<hmac-secret>",
            "catalog.gcs-interop.table.namespace": "default",
            "catalog.gcs-interop.table.tableName": "my_gcs_table",
            "catalog.gcs-interop.disable.integrity.protections": "true"
          }
        }
      }
    },
    "config": {
      "inference": { "embeddedSchemaEnabled": true },
      "sampling":  { "min": 1, "max": 100 }
    }
  }'

Response

FieldTypeDescription
schemaobjectThe resolved Pinot schema. Use this in step 3.
tableConfigs.offlineobjectThe enriched OFFLINE table config, ready to persist. Use this in step 4. Secrets are masked as *****.
rowsarraySample records after schema + ingestion transforms.
sourceRowsarrayRaw records before transformation.
summaryobjectRun summary: nSourceRows, nRows, nColumns, and summary.batch.nMatchingFiles (files found).

Step 3: Create the schema

POST /schemas Send the schema object from the preview response. Rename its schemaName to match your table.
curl -X POST "$CONTROLLER/schemas" \
  -H "Content-Type: application/json" \
  -d @schema.json
{ "status": "my_gcs_table successfully added" }

Step 4: Create the table

POST /tables Send the enriched tableConfigs.offline object from the preview response. The ExternalTableSyncTask block marks the table as external, and the tierConfigs array pins it to the GCS_INTEROPERABLE tier.
curl -X POST "$CONTROLLER/tables" \
  -H "Content-Type: application/json" \
  -d @tableConfig.json
{ "status": "Table my_gcs_table_OFFLINE successfully added" }
The controller’s External Table watcher then discovers the table, runs the first sync, and re-syncs at the schedule (cron) interval. There is no separate start call.
The tableConfigs.offline returned by /tables/preview includes a tierConfigs block pre-configured for GCS_INTEROPERABLE. Verify that the tier backend credentials match the source catalog credentials (they can differ if you use separate HMAC keys or Secret Manager secrets for the deep store).
The first sync runs on the watcher’s next tick. To kick it off immediately:
curl -X POST "$CONTROLLER/tasks/schedule?taskType=ExternalTableSyncTask&tableName=my_gcs_table_OFFLINE"

Quickstart: onboard a table end-to-end

The whole flow as one script (inline HMAC mode). Fill in the variables at the top, run the script, and the first sync starts automatically.
#!/usr/bin/env bash
set -euo pipefail

CONTROLLER="https://<your-controller>"
AUTH="Authorization: Bearer <token>"
BUCKET="<gcs-bucket>"
PREFIX="<prefix>"
REGION="<region>"       # Required field; GCS ignores the value
HMAC_KEY="<hmac-access-id>"
HMAC_SECRET="<hmac-secret>"
TABLE="my_gcs_table"

# 1. Validate & browse — confirms credentials and lists files.
curl -sf -X POST "$CONTROLLER/connections/browse" \
  -H "$AUTH" -H 'Content-Type: application/json' \
  --data-raw "{
    \"connection\": {
      \"type\": \"CATALOG\",
      \"params\": {
        \"catalogType\": \"gcs-interop\",
        \"catalog.gcs-interop.bucketName\": \"$BUCKET\",
        \"catalog.gcs-interop.prefix\": \"$PREFIX\",
        \"catalog.gcs-interop.region\": \"$REGION\",
        \"catalog.gcs-interop.endpoint\": \"https://storage.googleapis.com\",
        \"catalog.gcs-interop.accessKey\": \"$HMAC_KEY\",
        \"catalog.gcs-interop.secretKey\": \"$HMAC_SECRET\",
        \"catalog.gcs-interop.disable.integrity.protections\": \"true\"
      }
    },
    \"path\": \"\"
  }" | jq .

# 2. Preview — infer schema + enriched table config.
curl -sf -X POST "$CONTROLLER/tables/preview" \
  -H "$AUTH" -H 'Content-Type: application/json' \
  --data-raw "{
    \"tableConfig\": {
      \"tableName\": \"${TABLE}_OFFLINE\",
      \"tableType\": \"OFFLINE\",
      \"task\": {
        \"taskTypeConfigsMap\": {
          \"ExternalTableSyncTask\": {
            \"catalogType\": \"gcs-interop\",
            \"executor\": \"controller\",
            \"inputFormat\": \"parquet\",
            \"catalog.gcs-interop.bucketName\": \"$BUCKET\",
            \"catalog.gcs-interop.prefix\": \"$PREFIX\",
            \"catalog.gcs-interop.region\": \"$REGION\",
            \"catalog.gcs-interop.endpoint\": \"https://storage.googleapis.com\",
            \"catalog.gcs-interop.accessKey\": \"$HMAC_KEY\",
            \"catalog.gcs-interop.secretKey\": \"$HMAC_SECRET\",
            \"catalog.gcs-interop.table.namespace\": \"default\",
            \"catalog.gcs-interop.table.tableName\": \"$TABLE\",
            \"catalog.gcs-interop.disable.integrity.protections\": \"true\"
          }
        }
      }
    },
    \"config\": {
      \"inference\": { \"embeddedSchemaEnabled\": true },
      \"sampling\":  { \"min\": 1, \"max\": 100 }
    }
  }" > preview.json

# 3. Extract schema and table config from preview response.
jq --arg t "$TABLE" '.schema | .schemaName = $t' preview.json > schema.json
jq '.tableConfigs.offline'                        preview.json > tableConfig.json

# 4. Create the schema.
curl -sf -X POST "$CONTROLLER/schemas" \
  -H "$AUTH" -H 'Content-Type: application/json' -d @schema.json

# 5. Create the table — watcher starts the first sync automatically.
curl -sf -X POST "$CONTROLLER/tables" \
  -H "$AUTH" -H 'Content-Type: application/json' -d @tableConfig.json

# 6. (optional) Poll until sync completes.
curl -sf "$CONTROLLER/tables/${TABLE}_OFFLINE/externalTable/status" -H "$AUTH" \
  | jq -e '.status == "COMPLETED" and .segmentsUploaded == .filesDiscovered'
Once step 6 exits 0, verify it’s queryable.

Monitor onboarding

Three read-only endpoints report ingestion progress — run status, ingestion checkpoint, and source file count. See Observability for full request and response details.

Verify it’s queryable

When status is COMPLETED and segmentsUploaded matches filesDiscovered, run a query to confirm the data is live:
SELECT count(*) FROM my_gcs_table;
If status is COMPLETED but the count is 0, give segments a moment to load on the servers, then recheck; if it persists, see Troubleshooting.

GCS-specific config reference

All keys go under the ExternalTableSyncTask block in taskTypeConfigsMap.
KeyRequiredDescription
catalog.gcs-interop.bucketNameYesGCS bucket name.
catalog.gcs-interop.prefixYesKey prefix for the Parquet data (e.g. path/to/data/).
catalog.gcs-interop.regionYesRegion string. GCS ignores the value but the field must be present.
catalog.gcs-interop.endpointYesAlways https://storage.googleapis.com.
catalog.gcs-interop.accessKeyYesHMAC access ID, or GCP Secret Manager secret name if keyType=SECRET.
catalog.gcs-interop.secretKeyYesHMAC secret, or GCP Secret Manager secret name if keyType=SECRET.
catalog.gcs-interop.disable.integrity.protectionsYesMust be "true". Required for GCS range-GET and PUT compatibility with the AWS SDK.
catalog.gcs-interop.keyTypeNoSet to SECRET to resolve accessKey/secretKey via GCP Secret Manager.
catalog.gcs-interop.secretmanagertypeIf keyType=SECRETMust be GCS.
catalog.gcs-interop.gcpprojectidIf keyType=SECRETGCP project ID for Secret Manager lookups.
catalog.gcs-interop.gcpkeypathIf keyType=SECRETPath to a service-account key JSON with Secret Manager read access.
catalog.gcs-interop.table.namespaceNoLogical namespace. Use default for a flat prefix.
catalog.gcs-interop.table.tableNameNoLogical table name (any value; used internally).

What’s next

  1. Add indexes for your query patterns.Indexes
  2. Enable caching and preload.Data and Index Caching
  3. Protect large-scan queries from OOM.Best Practices & Configs — Query OOM protection
  4. Monitor ongoing syncs.Observability

For common questions and failures, see the FAQ and Troubleshooting.