> ## Documentation Index
> Fetch the complete documentation index at: https://docs.startree.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# GCS Data Lake: Onboarding via API

<Warning>
  This feature requires StarTree release 0.14.0 or later, and must be enabled on demand — contact StarTree support to activate it.
</Warning>

This page shows how to onboard a GCS Data Lake External Table with the StarTree controller REST APIs instead of the [Data Portal UI](./onboarding-data-portal). Use it for automation, infrastructure-as-code, or onboarding many tables at once.

GCS Data Lake uses Google Cloud Storage accessed through its **S3-compatible ("interop") endpoint** (`storage.googleapis.com`). The catalog type is `gcs-interop` and it reuses the same S3 wiring as [S3 Data Lake](../s3/onboarding-api) — only the endpoint, addressing style, and credential source differ.

## How it works

Onboarding is four calls. Each one feeds the next:

```text theme={null}
1. Validate & browse   POST /connections/browse   →  validate; list files under the GCS prefix
2. Preview             POST /tables/preview       →  infer schema + enriched table config
3. Create schema       POST /schemas              →  register the schema
4. Create table        POST /tables               →  register the External Table
```

The preview call (step 2) is the core step: it samples the source, infers a Pinot schema, and returns a ready-to-use table config. Steps 3 and 4 just persist its output.

Once the table exists, the controller's watcher runs the first sync and re-syncs on schedule — no manual trigger. Track progress with the [observability endpoints](#monitor-onboarding).

<Note>
  All paths are relative to your controller base URL. The examples assume `export CONTROLLER=https://<your-controller>`. On StarTree Cloud, the controller is reached through the data-plane proxy, so use `export CONTROLLER=https://<data-plane-host>/api/pinot`.

  If your controller requires authentication, add an `Authorization` header to every request — e.g. `-H "Authorization: Bearer <token>"`. The examples below omit it for brevity.
</Note>

## Prerequisites

* StarTree release with GCS External Table support enabled, and **tiered storage configured** on the cluster with a `GCS_INTEROPERABLE` tier backend. Contact StarTree support if unsure.
* Network access to the controller REST endpoint, plus an `Authorization` token if your cluster requires auth.
* A GCS bucket with Parquet files and a GCS HMAC key (access ID + secret) with `storage.objects.get` and `storage.objects.list` on the bucket. Generate HMAC keys in the [Google Cloud Console → Cloud Storage → Settings → Interoperability](https://console.cloud.google.com/storage/settings;tab=interoperability).
* The GCS bucket name, key prefix, and the project's region (used as a placeholder — GCS ignores it but the field is required).

## Authentication

GCS Data Lake authenticates using HMAC keys (Google's S3-interop credentials). There are two ways to supply them:

| Method                 | Keys                                                        | Use when                                                                  |
| ---------------------- | ----------------------------------------------------------- | ------------------------------------------------------------------------- |
| **Inline HMAC**        | `accessKey` (HMAC access ID) + `secretKey` (HMAC secret)    | Quick tests or when Secret Manager is not available.                      |
| **GCP Secret Manager** | `keyType=SECRET` + secret names for `accessKey`/`secretKey` | Recommended for production — no HMAC material stored in the table config. |

For Secret Manager, provide:

* `keyType`: `SECRET`
* `secretmanagertype`: `GCS`
* `gcpprojectid`: your GCP project ID
* `gcpkeypath`: path to a service-account key JSON file with Secret Manager read access

<Note>
  `disable.integrity.protections` must be set to `true` for all GCS connections. GCS rejects the AWS SDK's default request checksums on writes and returns whole-object checksums on range GETs that the SDK mis-validates. Setting this flag relaxes both to `WHEN_REQUIRED`, which is required for reads and writes to succeed.
</Note>

***

## Step 1: Validate and browse the connection

**`POST /connections/browse`**

There is no separate "validate" endpoint. Browsing the catalog *is* the validation step: a `200` with an `items` list (even an empty one) confirms your credentials and connectivity.

* Set `path` to `""` to browse the **root prefix** — this both validates the connection and lists files/directories.

### Request

| Field               | Type   | Required | Description                                     |
| ------------------- | ------ | -------- | ----------------------------------------------- |
| `connection.type`   | string | Yes      | Use `CATALOG` for all External Table sources.   |
| `connection.params` | object | Yes      | GCS connection settings (see examples below).   |
| `path`              | string | No       | Where to browse. `""` or omitted = root prefix. |

<Tabs>
  <Tab title="Inline HMAC">
    ```bash theme={null}
    curl -X POST "$CONTROLLER/connections/browse" \
      -H "Content-Type: application/json" \
      -d '{
        "connection": {
          "type": "CATALOG",
          "params": {
            "catalogType": "gcs-interop",
            "catalog.gcs-interop.bucketName": "<gcs-bucket>",
            "catalog.gcs-interop.prefix": "<prefix>",
            "catalog.gcs-interop.region": "<region>",
            "catalog.gcs-interop.endpoint": "https://storage.googleapis.com",
            "catalog.gcs-interop.accessKey": "<hmac-access-id>",
            "catalog.gcs-interop.secretKey": "<hmac-secret>",
            "catalog.gcs-interop.disable.integrity.protections": "true"
          }
        },
        "path": ""
      }'
    ```
  </Tab>

  <Tab title="GCP Secret Manager">
    ```bash theme={null}
    curl -X POST "$CONTROLLER/connections/browse" \
      -H "Content-Type: application/json" \
      -d '{
        "connection": {
          "type": "CATALOG",
          "params": {
            "catalogType": "gcs-interop",
            "catalog.gcs-interop.bucketName": "<gcs-bucket>",
            "catalog.gcs-interop.prefix": "<prefix>",
            "catalog.gcs-interop.region": "<region>",
            "catalog.gcs-interop.endpoint": "https://storage.googleapis.com",
            "catalog.gcs-interop.accessKey": "<access-id-secret-name>",
            "catalog.gcs-interop.secretKey": "<secret-key-secret-name>",
            "catalog.gcs-interop.keyType": "SECRET",
            "catalog.gcs-interop.secretmanagertype": "GCS",
            "catalog.gcs-interop.gcpprojectid": "<gcp-project-id>",
            "catalog.gcs-interop.gcpkeypath": "<path-to-sa-key.json>",
            "catalog.gcs-interop.disable.integrity.protections": "true"
          }
        },
        "path": ""
      }'
    ```
  </Tab>
</Tabs>

### Response

```json theme={null}
{
  "items": [
    { "name": "2024/01/", "type": "DIR" },
    { "name": "data.parquet", "type": "FILE" }
  ]
}
```

| `items[].type` | Meaning                                                     |
| -------------- | ----------------------------------------------------------- |
| `DIR`          | A directory you can drill into — pass its `name` as `path`. |
| `FILE`         | A Parquet file under the prefix.                            |

***

## Step 2: Preview the schema

**`POST /tables/preview`**

Samples the Parquet files, infers a Pinot schema, and returns an enriched table config plus sample rows. Review it, tweak if needed, then carry the schema and config forward to steps 3 and 4.

<Tabs>
  <Tab title="Inline HMAC">
    ```bash theme={null}
    curl -X POST "$CONTROLLER/tables/preview" \
      -H "Content-Type: application/json" \
      -d '{
        "tableConfig": {
          "tableName": "my_gcs_table_OFFLINE",
          "tableType": "OFFLINE",
          "task": {
            "taskTypeConfigsMap": {
              "ExternalTableSyncTask": {
                "catalogType": "gcs-interop",
                "executor": "controller",
                "inputFormat": "parquet",
                "catalog.gcs-interop.bucketName": "<gcs-bucket>",
                "catalog.gcs-interop.prefix": "<prefix>",
                "catalog.gcs-interop.region": "<region>",
                "catalog.gcs-interop.endpoint": "https://storage.googleapis.com",
                "catalog.gcs-interop.accessKey": "<hmac-access-id>",
                "catalog.gcs-interop.secretKey": "<hmac-secret>",
                "catalog.gcs-interop.table.namespace": "default",
                "catalog.gcs-interop.table.tableName": "my_gcs_table",
                "catalog.gcs-interop.disable.integrity.protections": "true"
              }
            }
          }
        },
        "config": {
          "inference": { "embeddedSchemaEnabled": true },
          "sampling":  { "min": 1, "max": 100 }
        }
      }'
    ```
  </Tab>

  <Tab title="GCP Secret Manager">
    ```bash theme={null}
    curl -X POST "$CONTROLLER/tables/preview" \
      -H "Content-Type: application/json" \
      -d '{
        "tableConfig": {
          "tableName": "my_gcs_table_OFFLINE",
          "tableType": "OFFLINE",
          "task": {
            "taskTypeConfigsMap": {
              "ExternalTableSyncTask": {
                "catalogType": "gcs-interop",
                "executor": "controller",
                "inputFormat": "parquet",
                "catalog.gcs-interop.bucketName": "<gcs-bucket>",
                "catalog.gcs-interop.prefix": "<prefix>",
                "catalog.gcs-interop.region": "<region>",
                "catalog.gcs-interop.endpoint": "https://storage.googleapis.com",
                "catalog.gcs-interop.accessKey": "<access-id-secret-name>",
                "catalog.gcs-interop.secretKey": "<secret-key-secret-name>",
                "catalog.gcs-interop.keyType": "SECRET",
                "catalog.gcs-interop.secretmanagertype": "GCS",
                "catalog.gcs-interop.gcpprojectid": "<gcp-project-id>",
                "catalog.gcs-interop.gcpkeypath": "<path-to-sa-key.json>",
                "catalog.gcs-interop.table.namespace": "default",
                "catalog.gcs-interop.table.tableName": "my_gcs_table",
                "catalog.gcs-interop.disable.integrity.protections": "true"
              }
            }
          }
        },
        "config": {
          "inference": { "embeddedSchemaEnabled": true },
          "sampling":  { "min": 1, "max": 100 }
        }
      }'
    ```
  </Tab>
</Tabs>

### Response

| Field                  | Type   | Description                                                                                                 |
| ---------------------- | ------ | ----------------------------------------------------------------------------------------------------------- |
| `schema`               | object | The resolved Pinot schema. **Use this in step 3.**                                                          |
| `tableConfigs.offline` | object | The enriched OFFLINE table config, ready to persist. **Use this in step 4.** Secrets are masked as `*****`. |
| `rows`                 | array  | Sample records after schema + ingestion transforms.                                                         |
| `sourceRows`           | array  | Raw records before transformation.                                                                          |
| `summary`              | object | Run summary: `nSourceRows`, `nRows`, `nColumns`, and `summary.batch.nMatchingFiles` (files found).          |

***

## Step 3: Create the schema

**`POST /schemas`**

Send the `schema` object from the preview response. Rename its `schemaName` to match your table.

```bash theme={null}
curl -X POST "$CONTROLLER/schemas" \
  -H "Content-Type: application/json" \
  -d @schema.json
```

```json theme={null}
{ "status": "my_gcs_table successfully added" }
```

***

## Step 4: Create the table

**`POST /tables`**

Send the enriched `tableConfigs.offline` object from the preview response. The `ExternalTableSyncTask` block marks the table as external, and the `tierConfigs` array pins it to the `GCS_INTEROPERABLE` tier.

```bash theme={null}
curl -X POST "$CONTROLLER/tables" \
  -H "Content-Type: application/json" \
  -d @tableConfig.json
```

```json theme={null}
{ "status": "Table my_gcs_table_OFFLINE successfully added" }
```

The controller's External Table watcher then discovers the table, runs the first sync, and re-syncs at the `schedule` (cron) interval. There is no separate start call.

<Note>
  The `tableConfigs.offline` returned by `/tables/preview` includes a `tierConfigs` block pre-configured for `GCS_INTEROPERABLE`. Verify that the tier backend credentials match the source catalog credentials (they can differ if you use separate HMAC keys or Secret Manager secrets for the deep store).
</Note>

<Tip>
  The first sync runs on the watcher's next tick. To kick it off immediately:

  ```bash theme={null}
  curl -X POST "$CONTROLLER/tasks/schedule?taskType=ExternalTableSyncTask&tableName=my_gcs_table_OFFLINE"
  ```
</Tip>

***

## Quickstart: onboard a table end-to-end

The whole flow as one script (inline HMAC mode). Fill in the variables at the top, run the script, and the first sync starts automatically.

```bash theme={null}
#!/usr/bin/env bash
set -euo pipefail

CONTROLLER="https://<your-controller>"
AUTH="Authorization: Bearer <token>"
BUCKET="<gcs-bucket>"
PREFIX="<prefix>"
REGION="<region>"       # Required field; GCS ignores the value
HMAC_KEY="<hmac-access-id>"
HMAC_SECRET="<hmac-secret>"
TABLE="my_gcs_table"

# 1. Validate & browse — confirms credentials and lists files.
curl -sf -X POST "$CONTROLLER/connections/browse" \
  -H "$AUTH" -H 'Content-Type: application/json' \
  --data-raw "{
    \"connection\": {
      \"type\": \"CATALOG\",
      \"params\": {
        \"catalogType\": \"gcs-interop\",
        \"catalog.gcs-interop.bucketName\": \"$BUCKET\",
        \"catalog.gcs-interop.prefix\": \"$PREFIX\",
        \"catalog.gcs-interop.region\": \"$REGION\",
        \"catalog.gcs-interop.endpoint\": \"https://storage.googleapis.com\",
        \"catalog.gcs-interop.accessKey\": \"$HMAC_KEY\",
        \"catalog.gcs-interop.secretKey\": \"$HMAC_SECRET\",
        \"catalog.gcs-interop.disable.integrity.protections\": \"true\"
      }
    },
    \"path\": \"\"
  }" | jq .

# 2. Preview — infer schema + enriched table config.
curl -sf -X POST "$CONTROLLER/tables/preview" \
  -H "$AUTH" -H 'Content-Type: application/json' \
  --data-raw "{
    \"tableConfig\": {
      \"tableName\": \"${TABLE}_OFFLINE\",
      \"tableType\": \"OFFLINE\",
      \"task\": {
        \"taskTypeConfigsMap\": {
          \"ExternalTableSyncTask\": {
            \"catalogType\": \"gcs-interop\",
            \"executor\": \"controller\",
            \"inputFormat\": \"parquet\",
            \"catalog.gcs-interop.bucketName\": \"$BUCKET\",
            \"catalog.gcs-interop.prefix\": \"$PREFIX\",
            \"catalog.gcs-interop.region\": \"$REGION\",
            \"catalog.gcs-interop.endpoint\": \"https://storage.googleapis.com\",
            \"catalog.gcs-interop.accessKey\": \"$HMAC_KEY\",
            \"catalog.gcs-interop.secretKey\": \"$HMAC_SECRET\",
            \"catalog.gcs-interop.table.namespace\": \"default\",
            \"catalog.gcs-interop.table.tableName\": \"$TABLE\",
            \"catalog.gcs-interop.disable.integrity.protections\": \"true\"
          }
        }
      }
    },
    \"config\": {
      \"inference\": { \"embeddedSchemaEnabled\": true },
      \"sampling\":  { \"min\": 1, \"max\": 100 }
    }
  }" > preview.json

# 3. Extract schema and table config from preview response.
jq --arg t "$TABLE" '.schema | .schemaName = $t' preview.json > schema.json
jq '.tableConfigs.offline'                        preview.json > tableConfig.json

# 4. Create the schema.
curl -sf -X POST "$CONTROLLER/schemas" \
  -H "$AUTH" -H 'Content-Type: application/json' -d @schema.json

# 5. Create the table — watcher starts the first sync automatically.
curl -sf -X POST "$CONTROLLER/tables" \
  -H "$AUTH" -H 'Content-Type: application/json' -d @tableConfig.json

# 6. (optional) Poll until sync completes.
curl -sf "$CONTROLLER/tables/${TABLE}_OFFLINE/externalTable/status" -H "$AUTH" \
  | jq -e '.status == "COMPLETED" and .segmentsUploaded == .filesDiscovered'
```

Once step 6 exits `0`, [verify it's queryable](#verify-its-queryable).

## Monitor onboarding

Three read-only endpoints report ingestion progress — run status, ingestion checkpoint, and source file count. See [Observability](../observability) for full request and response details.

## Verify it's queryable

When `status` is `COMPLETED` and `segmentsUploaded` matches `filesDiscovered`, run a query to confirm the data is live:

```sql theme={null}
SELECT count(*) FROM my_gcs_table;
```

If `status` is `COMPLETED` but the count is `0`, give segments a moment to load on the servers, then recheck; if it persists, see [Troubleshooting](../troubleshooting).

***

## GCS-specific config reference

All keys go under the `ExternalTableSyncTask` block in `taskTypeConfigsMap`.

| Key                                                 | Required            | Description                                                                          |
| --------------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------ |
| `catalog.gcs-interop.bucketName`                    | Yes                 | GCS bucket name.                                                                     |
| `catalog.gcs-interop.prefix`                        | Yes                 | Key prefix for the Parquet data (e.g. `path/to/data/`).                              |
| `catalog.gcs-interop.region`                        | Yes                 | Region string. GCS ignores the value but the field must be present.                  |
| `catalog.gcs-interop.endpoint`                      | Yes                 | Always `https://storage.googleapis.com`.                                             |
| `catalog.gcs-interop.accessKey`                     | Yes                 | HMAC access ID, or GCP Secret Manager secret name if `keyType=SECRET`.               |
| `catalog.gcs-interop.secretKey`                     | Yes                 | HMAC secret, or GCP Secret Manager secret name if `keyType=SECRET`.                  |
| `catalog.gcs-interop.disable.integrity.protections` | Yes                 | Must be `"true"`. Required for GCS range-GET and PUT compatibility with the AWS SDK. |
| `catalog.gcs-interop.keyType`                       | No                  | Set to `SECRET` to resolve `accessKey`/`secretKey` via GCP Secret Manager.           |
| `catalog.gcs-interop.secretmanagertype`             | If `keyType=SECRET` | Must be `GCS`.                                                                       |
| `catalog.gcs-interop.gcpprojectid`                  | If `keyType=SECRET` | GCP project ID for Secret Manager lookups.                                           |
| `catalog.gcs-interop.gcpkeypath`                    | If `keyType=SECRET` | Path to a service-account key JSON with Secret Manager read access.                  |
| `catalog.gcs-interop.table.namespace`               | No                  | Logical namespace. Use `default` for a flat prefix.                                  |
| `catalog.gcs-interop.table.tableName`               | No                  | Logical table name (any value; used internally).                                     |

***

## What's next

1. **Add indexes for your query patterns.** → [Indexes](../indexes)
2. **Enable caching and preload.** → [Data and Index Caching](../data-and-index-caching)
3. **Protect large-scan queries from OOM.** → [Best Practices & Configs — Query OOM protection](../best-practices-and-configs#query-oom-protection-large-scans)
4. **Monitor ongoing syncs.** → [Observability](../observability)

***

For common questions and failures, see the [FAQ](../faq) and [Troubleshooting](../troubleshooting).
