This feature requires StarTree release 0.14.0 or later, and must be enabled on demand — contact StarTree support to activate it.
This page shows how to onboard a GCS Data Lake External Table with the StarTree controller REST APIs instead of the Data Portal UI. Use it for automation, infrastructure-as-code, or onboarding many tables at once.
GCS Data Lake uses Google Cloud Storage accessed through its S3-compatible (“interop”) endpoint (storage.googleapis.com). The catalog type is gcs-interop and it reuses the same S3 wiring as S3 Data Lake — only the endpoint, addressing style, and credential source differ.
How it works
Onboarding is four calls. Each one feeds the next:
1. Validate & browse POST /connections/browse → validate; list files under the GCS prefix
2. Preview POST /tables/preview → infer schema + enriched table config
3. Create schema POST /schemas → register the schema
4. Create table POST /tables → register the External Table
The preview call (step 2) is the core step: it samples the source, infers a Pinot schema, and returns a ready-to-use table config. Steps 3 and 4 just persist its output.
Once the table exists, the controller’s watcher runs the first sync and re-syncs on schedule — no manual trigger. Track progress with the observability endpoints.
All paths are relative to your controller base URL. The examples assume export CONTROLLER=https://<your-controller>. On StarTree Cloud, the controller is reached through the data-plane proxy, so use export CONTROLLER=https://<data-plane-host>/api/pinot.If your controller requires authentication, add an Authorization header to every request — e.g. -H "Authorization: Bearer <token>". The examples below omit it for brevity.
Prerequisites
- StarTree release with GCS External Table support enabled, and tiered storage configured on the cluster with a
GCS_INTEROPERABLE tier backend. Contact StarTree support if unsure.
- Network access to the controller REST endpoint, plus an
Authorization token if your cluster requires auth.
- A GCS bucket with Parquet files and a GCS HMAC key (access ID + secret) with
storage.objects.get and storage.objects.list on the bucket. Generate HMAC keys in the Google Cloud Console → Cloud Storage → Settings → Interoperability.
- The GCS bucket name, key prefix, and the project’s region (used as a placeholder — GCS ignores it but the field is required).
Authentication
GCS Data Lake authenticates using HMAC keys (Google’s S3-interop credentials). There are two ways to supply them:
| Method | Keys | Use when |
|---|
| Inline HMAC | accessKey (HMAC access ID) + secretKey (HMAC secret) | Quick tests or when Secret Manager is not available. |
| GCP Secret Manager | keyType=SECRET + secret names for accessKey/secretKey | Recommended for production — no HMAC material stored in the table config. |
For Secret Manager, provide:
keyType: SECRET
secretmanagertype: GCS
gcpprojectid: your GCP project ID
gcpkeypath: path to a service-account key JSON file with Secret Manager read access
disable.integrity.protections must be set to true for all GCS connections. GCS rejects the AWS SDK’s default request checksums on writes and returns whole-object checksums on range GETs that the SDK mis-validates. Setting this flag relaxes both to WHEN_REQUIRED, which is required for reads and writes to succeed.
Step 1: Validate and browse the connection
POST /connections/browse
There is no separate “validate” endpoint. Browsing the catalog is the validation step: a 200 with an items list (even an empty one) confirms your credentials and connectivity.
- Set
path to "" to browse the root prefix — this both validates the connection and lists files/directories.
Request
| Field | Type | Required | Description |
|---|
connection.type | string | Yes | Use CATALOG for all External Table sources. |
connection.params | object | Yes | GCS connection settings (see examples below). |
path | string | No | Where to browse. "" or omitted = root prefix. |
Inline HMAC
GCP Secret Manager
curl -X POST "$CONTROLLER/connections/browse" \
-H "Content-Type: application/json" \
-d '{
"connection": {
"type": "CATALOG",
"params": {
"catalogType": "gcs-interop",
"catalog.gcs-interop.bucketName": "<gcs-bucket>",
"catalog.gcs-interop.prefix": "<prefix>",
"catalog.gcs-interop.region": "<region>",
"catalog.gcs-interop.endpoint": "https://storage.googleapis.com",
"catalog.gcs-interop.accessKey": "<hmac-access-id>",
"catalog.gcs-interop.secretKey": "<hmac-secret>",
"catalog.gcs-interop.disable.integrity.protections": "true"
}
},
"path": ""
}'
curl -X POST "$CONTROLLER/connections/browse" \
-H "Content-Type: application/json" \
-d '{
"connection": {
"type": "CATALOG",
"params": {
"catalogType": "gcs-interop",
"catalog.gcs-interop.bucketName": "<gcs-bucket>",
"catalog.gcs-interop.prefix": "<prefix>",
"catalog.gcs-interop.region": "<region>",
"catalog.gcs-interop.endpoint": "https://storage.googleapis.com",
"catalog.gcs-interop.accessKey": "<access-id-secret-name>",
"catalog.gcs-interop.secretKey": "<secret-key-secret-name>",
"catalog.gcs-interop.keyType": "SECRET",
"catalog.gcs-interop.secretmanagertype": "GCS",
"catalog.gcs-interop.gcpprojectid": "<gcp-project-id>",
"catalog.gcs-interop.gcpkeypath": "<path-to-sa-key.json>",
"catalog.gcs-interop.disable.integrity.protections": "true"
}
},
"path": ""
}'
Response
{
"items": [
{ "name": "2024/01/", "type": "DIR" },
{ "name": "data.parquet", "type": "FILE" }
]
}
items[].type | Meaning |
|---|
DIR | A directory you can drill into — pass its name as path. |
FILE | A Parquet file under the prefix. |
Step 2: Preview the schema
POST /tables/preview
Samples the Parquet files, infers a Pinot schema, and returns an enriched table config plus sample rows. Review it, tweak if needed, then carry the schema and config forward to steps 3 and 4.
Inline HMAC
GCP Secret Manager
curl -X POST "$CONTROLLER/tables/preview" \
-H "Content-Type: application/json" \
-d '{
"tableConfig": {
"tableName": "my_gcs_table_OFFLINE",
"tableType": "OFFLINE",
"task": {
"taskTypeConfigsMap": {
"ExternalTableSyncTask": {
"catalogType": "gcs-interop",
"executor": "controller",
"inputFormat": "parquet",
"catalog.gcs-interop.bucketName": "<gcs-bucket>",
"catalog.gcs-interop.prefix": "<prefix>",
"catalog.gcs-interop.region": "<region>",
"catalog.gcs-interop.endpoint": "https://storage.googleapis.com",
"catalog.gcs-interop.accessKey": "<hmac-access-id>",
"catalog.gcs-interop.secretKey": "<hmac-secret>",
"catalog.gcs-interop.table.namespace": "default",
"catalog.gcs-interop.table.tableName": "my_gcs_table",
"catalog.gcs-interop.disable.integrity.protections": "true"
}
}
}
},
"config": {
"inference": { "embeddedSchemaEnabled": true },
"sampling": { "min": 1, "max": 100 }
}
}'
curl -X POST "$CONTROLLER/tables/preview" \
-H "Content-Type: application/json" \
-d '{
"tableConfig": {
"tableName": "my_gcs_table_OFFLINE",
"tableType": "OFFLINE",
"task": {
"taskTypeConfigsMap": {
"ExternalTableSyncTask": {
"catalogType": "gcs-interop",
"executor": "controller",
"inputFormat": "parquet",
"catalog.gcs-interop.bucketName": "<gcs-bucket>",
"catalog.gcs-interop.prefix": "<prefix>",
"catalog.gcs-interop.region": "<region>",
"catalog.gcs-interop.endpoint": "https://storage.googleapis.com",
"catalog.gcs-interop.accessKey": "<access-id-secret-name>",
"catalog.gcs-interop.secretKey": "<secret-key-secret-name>",
"catalog.gcs-interop.keyType": "SECRET",
"catalog.gcs-interop.secretmanagertype": "GCS",
"catalog.gcs-interop.gcpprojectid": "<gcp-project-id>",
"catalog.gcs-interop.gcpkeypath": "<path-to-sa-key.json>",
"catalog.gcs-interop.table.namespace": "default",
"catalog.gcs-interop.table.tableName": "my_gcs_table",
"catalog.gcs-interop.disable.integrity.protections": "true"
}
}
}
},
"config": {
"inference": { "embeddedSchemaEnabled": true },
"sampling": { "min": 1, "max": 100 }
}
}'
Response
| Field | Type | Description |
|---|
schema | object | The resolved Pinot schema. Use this in step 3. |
tableConfigs.offline | object | The enriched OFFLINE table config, ready to persist. Use this in step 4. Secrets are masked as *****. |
rows | array | Sample records after schema + ingestion transforms. |
sourceRows | array | Raw records before transformation. |
summary | object | Run summary: nSourceRows, nRows, nColumns, and summary.batch.nMatchingFiles (files found). |
Step 3: Create the schema
POST /schemas
Send the schema object from the preview response. Rename its schemaName to match your table.
curl -X POST "$CONTROLLER/schemas" \
-H "Content-Type: application/json" \
-d @schema.json
{ "status": "my_gcs_table successfully added" }
Step 4: Create the table
POST /tables
Send the enriched tableConfigs.offline object from the preview response. The ExternalTableSyncTask block marks the table as external, and the tierConfigs array pins it to the GCS_INTEROPERABLE tier.
curl -X POST "$CONTROLLER/tables" \
-H "Content-Type: application/json" \
-d @tableConfig.json
{ "status": "Table my_gcs_table_OFFLINE successfully added" }
The controller’s External Table watcher then discovers the table, runs the first sync, and re-syncs at the schedule (cron) interval. There is no separate start call.
The tableConfigs.offline returned by /tables/preview includes a tierConfigs block pre-configured for GCS_INTEROPERABLE. Verify that the tier backend credentials match the source catalog credentials (they can differ if you use separate HMAC keys or Secret Manager secrets for the deep store).
The first sync runs on the watcher’s next tick. To kick it off immediately:curl -X POST "$CONTROLLER/tasks/schedule?taskType=ExternalTableSyncTask&tableName=my_gcs_table_OFFLINE"
Quickstart: onboard a table end-to-end
The whole flow as one script (inline HMAC mode). Fill in the variables at the top, run the script, and the first sync starts automatically.
#!/usr/bin/env bash
set -euo pipefail
CONTROLLER="https://<your-controller>"
AUTH="Authorization: Bearer <token>"
BUCKET="<gcs-bucket>"
PREFIX="<prefix>"
REGION="<region>" # Required field; GCS ignores the value
HMAC_KEY="<hmac-access-id>"
HMAC_SECRET="<hmac-secret>"
TABLE="my_gcs_table"
# 1. Validate & browse — confirms credentials and lists files.
curl -sf -X POST "$CONTROLLER/connections/browse" \
-H "$AUTH" -H 'Content-Type: application/json' \
--data-raw "{
\"connection\": {
\"type\": \"CATALOG\",
\"params\": {
\"catalogType\": \"gcs-interop\",
\"catalog.gcs-interop.bucketName\": \"$BUCKET\",
\"catalog.gcs-interop.prefix\": \"$PREFIX\",
\"catalog.gcs-interop.region\": \"$REGION\",
\"catalog.gcs-interop.endpoint\": \"https://storage.googleapis.com\",
\"catalog.gcs-interop.accessKey\": \"$HMAC_KEY\",
\"catalog.gcs-interop.secretKey\": \"$HMAC_SECRET\",
\"catalog.gcs-interop.disable.integrity.protections\": \"true\"
}
},
\"path\": \"\"
}" | jq .
# 2. Preview — infer schema + enriched table config.
curl -sf -X POST "$CONTROLLER/tables/preview" \
-H "$AUTH" -H 'Content-Type: application/json' \
--data-raw "{
\"tableConfig\": {
\"tableName\": \"${TABLE}_OFFLINE\",
\"tableType\": \"OFFLINE\",
\"task\": {
\"taskTypeConfigsMap\": {
\"ExternalTableSyncTask\": {
\"catalogType\": \"gcs-interop\",
\"executor\": \"controller\",
\"inputFormat\": \"parquet\",
\"catalog.gcs-interop.bucketName\": \"$BUCKET\",
\"catalog.gcs-interop.prefix\": \"$PREFIX\",
\"catalog.gcs-interop.region\": \"$REGION\",
\"catalog.gcs-interop.endpoint\": \"https://storage.googleapis.com\",
\"catalog.gcs-interop.accessKey\": \"$HMAC_KEY\",
\"catalog.gcs-interop.secretKey\": \"$HMAC_SECRET\",
\"catalog.gcs-interop.table.namespace\": \"default\",
\"catalog.gcs-interop.table.tableName\": \"$TABLE\",
\"catalog.gcs-interop.disable.integrity.protections\": \"true\"
}
}
}
},
\"config\": {
\"inference\": { \"embeddedSchemaEnabled\": true },
\"sampling\": { \"min\": 1, \"max\": 100 }
}
}" > preview.json
# 3. Extract schema and table config from preview response.
jq --arg t "$TABLE" '.schema | .schemaName = $t' preview.json > schema.json
jq '.tableConfigs.offline' preview.json > tableConfig.json
# 4. Create the schema.
curl -sf -X POST "$CONTROLLER/schemas" \
-H "$AUTH" -H 'Content-Type: application/json' -d @schema.json
# 5. Create the table — watcher starts the first sync automatically.
curl -sf -X POST "$CONTROLLER/tables" \
-H "$AUTH" -H 'Content-Type: application/json' -d @tableConfig.json
# 6. (optional) Poll until sync completes.
curl -sf "$CONTROLLER/tables/${TABLE}_OFFLINE/externalTable/status" -H "$AUTH" \
| jq -e '.status == "COMPLETED" and .segmentsUploaded == .filesDiscovered'
Once step 6 exits 0, verify it’s queryable.
Monitor onboarding
Three read-only endpoints report ingestion progress — run status, ingestion checkpoint, and source file count. See Observability for full request and response details.
Verify it’s queryable
When status is COMPLETED and segmentsUploaded matches filesDiscovered, run a query to confirm the data is live:
SELECT count(*) FROM my_gcs_table;
If status is COMPLETED but the count is 0, give segments a moment to load on the servers, then recheck; if it persists, see Troubleshooting.
GCS-specific config reference
All keys go under the ExternalTableSyncTask block in taskTypeConfigsMap.
| Key | Required | Description |
|---|
catalog.gcs-interop.bucketName | Yes | GCS bucket name. |
catalog.gcs-interop.prefix | Yes | Key prefix for the Parquet data (e.g. path/to/data/). |
catalog.gcs-interop.region | Yes | Region string. GCS ignores the value but the field must be present. |
catalog.gcs-interop.endpoint | Yes | Always https://storage.googleapis.com. |
catalog.gcs-interop.accessKey | Yes | HMAC access ID, or GCP Secret Manager secret name if keyType=SECRET. |
catalog.gcs-interop.secretKey | Yes | HMAC secret, or GCP Secret Manager secret name if keyType=SECRET. |
catalog.gcs-interop.disable.integrity.protections | Yes | Must be "true". Required for GCS range-GET and PUT compatibility with the AWS SDK. |
catalog.gcs-interop.keyType | No | Set to SECRET to resolve accessKey/secretKey via GCP Secret Manager. |
catalog.gcs-interop.secretmanagertype | If keyType=SECRET | Must be GCS. |
catalog.gcs-interop.gcpprojectid | If keyType=SECRET | GCP project ID for Secret Manager lookups. |
catalog.gcs-interop.gcpkeypath | If keyType=SECRET | Path to a service-account key JSON with Secret Manager read access. |
catalog.gcs-interop.table.namespace | No | Logical namespace. Use default for a flat prefix. |
catalog.gcs-interop.table.tableName | No | Logical table name (any value; used internally). |
What’s next
- Add indexes for your query patterns. → Indexes
- Enable caching and preload. → Data and Index Caching
- Protect large-scan queries from OOM. → Best Practices & Configs — Query OOM protection
- Monitor ongoing syncs. → Observability
For common questions and failures, see the FAQ and Troubleshooting.