This feature is available starting in StarTree release 0.14.0. It must be enabled on demand — contact your StarTree representative to have it activated for your environment.
This guide walks through connecting an external data source to StarTree using the Data Portal UI. No API calls or JSON configuration required — Data Portal guides you through catalog connection, table selection, and ingestion setup through a point-and-click interface.
Looking for the API-based approach? See the API Onboarding Guide.
Supported Sources
Data Portal currently supports the following external catalog types in Beta:
| Source | Description |
|---|
| S3 Data Lake | Raw Parquet files stored directly in an S3 bucket — no catalog service required. |
| AWS Glue (Iceberg REST) | Iceberg tables managed by AWS Glue, accessed via the Iceberg REST protocol with SigV4 authentication. |
| AWS S3 Tables (Iceberg REST) | Iceberg-compatible tables in AWS S3 Tables buckets, accessed via the Iceberg REST protocol. |
Prerequisites
Before starting, ensure you have:
- StarTree 0.14.0 or later with the external table Beta feature enabled for your environment.
- AWS credentials (access key + secret key) with read permissions on both the catalog service and the underlying S3 data.
- For S3 Tables: the full ARN of your S3 Tables bucket.
- For Glue: your AWS account ID (used as the Glue warehouse identifier) and the target Glue database name.
Step 1: Open the External Tables
- Log in to Data Portal.
- In the left navigation, go to Tables.
- Click + Connect External Table.
The wizard opens with a connection configuration screen.
Step 2: Select a Catalog Provider
Choose the catalog type that matches your data source:
- S3 Data Lake — for raw Parquet files on S3.
- Iceberg REST — for Iceberg tables managed by AWS Glue and S3 Tables.
Fill in the credentials and connection details for your chosen catalog type.
S3 Data Lake
Glue REST
S3 Tables REST
| Field | Description |
|---|
| S3 Bucket | Name of the S3 bucket containing your Parquet files. |
| Prefix | Key prefix (folder path) pointing to the Parquet data, e.g. path/to/parquet/data/. |
| Region | AWS region where the bucket is located. |
| Access Key | AWS access key ID. |
| Secret Key | AWS secret access key. |
Details| Field | Description |
|---|
| Catalog Connection Name | A unique name to identify this connection in Data Portal. |
Metastore — credentials for authenticating with the AWS Glue catalog API.| Field | Description |
|---|
| REST Service | Set to Glue. |
| Warehouse | Your AWS account ID — used as the Glue warehouse identifier. |
| Access Key | AWS access key ID for Glue catalog API access. |
| Secret Key | AWS secret access key for Glue catalog API access. |
| Region | AWS region where your Glue catalog is located, e.g. us-east-1. |
Storage — credentials for reading the underlying Parquet data files from S3.| Field | Description |
|---|
| Access Key | AWS access key ID for S3 data access. Can be the same as the Metastore key if the same principal has both permissions. |
| Secret Key | AWS secret access key for S3 data access. |
| Region | AWS region where the S3 data files are stored. |
Details| Field | Description |
|---|
| Catalog Connection Name | A unique name to identify this connection in Data Portal. |
Metastore — credentials for authenticating with the S3 Tables REST catalog API.| Field | Description |
|---|
| REST Service | Set to S3Tables. |
| Table Bucket ARN | Full ARN of the S3 Tables bucket, e.g. arn:aws:s3tables:<region>:<account-id>:bucket/<bucket-name>. |
| Access Key | AWS access key ID for S3 Tables catalog API access. |
| Secret Key | AWS secret access key for S3 Tables catalog API access. |
| Region | AWS region where the S3 Tables bucket is located. |
Storage — credentials for reading the underlying Parquet data files.| Field | Description |
|---|
| Access Key | AWS access key ID for S3 data access. Can be the same as the Metastore key if the same principal has both permissions. |
| Secret Key | AWS secret access key for S3 data access. |
| Region | AWS region where the S3 data files are stored. |
Click Validate Connection. Data Portal calls the catalog’s validate endpoint and confirms credentials and connectivity before proceeding.
Step 4: Browse and Select a Table
Once the connection is validated:
- Data Portal lists the available namespaces (Glue databases, S3 namespaces, or S3 prefixes).
- Select a namespace to expand its tables.
- Click the table you want to onboard.
Data Portal reads the Iceberg schema and derives a Pinot schema automatically.
Step 5: Review the Schema
The auto-generated Pinot schema is displayed for review. You can:
- Set a time column — select the column to use as the Pinot time dimension (optional; leave blank for no time partitioning).
- Include or exclude partition columns — toggle whether Iceberg partition columns are added as Pinot dimension columns.
- Rename the schema — provide a custom schema name, or accept the default derived from the table name.
Click Next when the schema looks correct.
Review and adjust the table configuration:
| Setting | Default | Notes |
|---|
| Ingestion schedule | Every 5 minutes | Cron expression controlling how often new Iceberg snapshots are ingested. |
| Null handling | Enabled | Required for Iceberg schemas that include nullable columns. |
| Segment push type | Append | Each new Iceberg snapshot is ingested as a new set of Pinot segments. |
Click Create Table to register the schema and table with Pinot. Data Portal automatically triggers the first ingestion run immediately after creation — no manual step required.
Step 7: Monitor Ingestion
Once the table is created, ingestion starts automatically. The table detail view shows the status in real time:
- Running — the task is actively reading Iceberg snapshots and building Pinot segments.
- Completed — ingestion finished successfully. The last ingested snapshot ID is shown.
- Failed — ingestion encountered an error. The error message and the number of files discovered vs. segments generated are surfaced to help diagnose the issue.
For deeper observability — watcher status, checkpoint values, and per-snapshot file counts — see the Observability page.
Pausing Ingestion
To pause scheduled ingestion from Data Portal:
- Open the table in the Tables view.
- Click Pause Ingestion.
This sets
"enabled": "false" on the IcebergIngestionTask. Any run currently in progress completes normally. Existing segments and the last checkpoint are preserved — when you re-enable, ingestion resumes from where it left off.
Frequently Asked Questions
The Validate step fails — what should I check?
- Confirm the access key has
glue:GetTable, glue:GetDatabase, and s3:GetObject permissions (or equivalent for S3 Tables).
- Verify the region matches where your Glue database or S3 bucket lives.
- For Glue REST, ensure the warehouse value is your numeric AWS account ID, not an account alias.
Can I onboard multiple tables from the same catalog?
Yes. After creating the first table, start the wizard again and reuse the same connection credentials. Each table is registered as an independent Pinot table with its own ingestion schedule.
The table was created but ingestion hasn’t started — what should I check?
Data Portal triggers the first ingestion run automatically after table creation. If ingestion hasn’t started, check the table’s detail page for an error status and review the error message. You can also trigger a run manually via the trigger API.