GCS Data Lake: Onboarding via Data Portal

This feature is available starting in StarTree release 0.14.0. It must be enabled on demand — contact your StarTree representative to have it activated for your environment.

This guide walks through connecting a GCS Data Lake to StarTree using the Data Portal UI. No API calls or JSON configuration required — Data Portal guides you through catalog connection, table selection, and onboarding setup through a point-and-click interface.

Looking for the API-based approach? See GCS Data Lake: Onboarding via API.

Overview

GCS Data Lake connects to raw Parquet files stored in a Google Cloud Storage bucket, accessed through GCS’s S3-compatible (“interop”) endpoint (storage.googleapis.com). StarTree scans the specified GCS prefix, discovers Parquet files, and makes them queryable through Pinot. Authentication uses HMAC keys (Google’s S3-interop credentials). Generate HMAC keys in the Google Cloud Console → Cloud Storage → Settings → Interoperability.

Prerequisites

Before starting, ensure you have:

StarTree 0.14.0 or later with the external table Beta feature enabled and a GCS_INTEROPERABLE tier backend configured for your environment.
A GCS HMAC key (access ID + secret) with storage.objects.get and storage.objects.list permissions on the source bucket.
The GCS bucket name, key prefix, and GCP project region.

Step 1: Open the External Tables

Log in to Data Portal.
In the left navigation, go to Tables.
Click + Connect External Table.

The wizard opens with a connection configuration screen.

Step 2: Select GCS Data Lake

Choose GCS Data Lake as the catalog type. Fill in the connection details:

Field	Description
GCS Bucket	Name of the GCS bucket containing your Parquet files.
Prefix	Key prefix (folder path) pointing to the Parquet data, e.g. `path/to/parquet/data/`. Matches all objects whose keys start with this string.
Region	GCP region. GCS ignores this value for routing but the field is required.
HMAC Access ID	The HMAC access ID from the Google Cloud Console Interoperability settings.
HMAC Secret	The HMAC secret corresponding to the access ID.

Click Validate Connection. Data Portal confirms credentials and connectivity via the GCS interop endpoint before proceeding.

Step 3: Browse and Select a Table

Once the connection is validated:

Data Portal lists the available GCS prefixes (directories) under your configured prefix.
Select a prefix to view the Parquet files it contains.
Click the table (prefix) you want to onboard.

StarTree samples the Parquet files to derive a schema automatically.

Step 4: Review the Schema

The auto-generated Pinot schema is displayed for review. You can:

Set a time column — select the column to use as the Pinot time dimension (optional; leave blank for no time partitioning).
Rename the schema — provide a custom schema name, or accept the default derived from the prefix.

Click Next when the schema looks correct.

Step 5: Configure the Table

Review and adjust the table configuration:

Setting	Default	Notes
Onboarding schedule	Every 5 minutes	Cron expression controlling how often new Parquet files are onboarded.
Null handling	Enabled	Required for schemas that include nullable columns.
Segment push type	Append	Each new batch of Parquet files is ingested as a new set of Pinot segments.

Click Create Table to register the schema and table with Pinot. Data Portal automatically triggers the first onboarding run immediately after creation — no manual step required.

Step 6: Monitor Onboarding

Once the table is created, onboarding starts automatically. The table detail view shows the status in real time:

Running — the task is actively reading Parquet files and building Pinot segments.
Completed — onboarding finished successfully.
Failed — onboarding encountered an error. The error message and the number of files discovered vs. segments generated are surfaced to help diagnose the issue.

For deeper observability — watcher status, checkpoint values, and per-file counts — see the Observability page.

Pausing Onboarding

To pause scheduled onboarding from Data Portal:

Open the table in the Tables view.
Click Pause Sync. This sets "enabled": "false" on the ExternalTableSyncTask. Any run currently in progress completes normally. Existing segments and the last checkpoint are preserved — when you re-enable, onboarding resumes from where it left off.

Frequently Asked Questions

The Validate step fails — what should I check?

Confirm the HMAC key has storage.objects.get and storage.objects.list permissions on the bucket.
Verify the bucket name and prefix are correct — GCS bucket names are globally unique.
Ensure the HMAC key was generated under the correct GCP project.

Can I onboard multiple prefixes from the same bucket? Yes. Each prefix is registered as an independent Pinot table with its own onboarding schedule. Start the wizard again and use the same bucket with a different prefix.

The table was created but onboarding hasn’t started — what should I check? Data Portal triggers the first onboarding run automatically after table creation. If onboarding hasn’t started, check the table’s detail page for an error status and review the error message. You can also trigger a run manually via the trigger API.

​Overview

​Prerequisites

​Step 1: Open the External Tables

​Step 2: Select GCS Data Lake

​Step 3: Browse and Select a Table

​Step 4: Review the Schema

​Step 5: Configure the Table

​Step 6: Monitor Onboarding

​Pausing Onboarding

​Frequently Asked Questions

Overview

Prerequisites

Step 1: Open the External Tables

Step 2: Select GCS Data Lake

Step 3: Browse and Select a Table

Step 4: Review the Schema

Step 5: Configure the Table

Step 6: Monitor Onboarding

Pausing Onboarding

Frequently Asked Questions