This guide explains how to use the Pinot Controller’s external table catalog APIs to connect to a catalog, explore its tables, and register a Pinot schema and OFFLINE table with automated ingestion. Catalog discovery usesDocumentation Index
Fetch the complete documentation index at: https://docs.startree.ai/llms.txt
Use this file to discover all available pages before exploring further.
/externalTable/catalog/*; table creation uses the standard Controller POST /schemas and POST /tables APIs. Each section covers one catalog provider. Within each section, discovery calls use exact JSON bodies; the create sections document a reference payload for schema and table authoring.
Prefer a point-and-click setup? See the Data Portal Onboarding Guide.
How It Works
Onboarding an external table follows a linear discovery-to-ingestion workflow:- Validate your catalog credentials and connectivity.
- Discover available namespaces and tables in the catalog.
- Create the Pinot schema and table — this also registers the
ExternalTableSyncTaskon a cron schedule. - Trigger the first ingestion run manually, since the scheduled task does not fire immediately after table creation.
API Endpoints Quick Reference
| Step | Method | Endpoint | Purpose |
|---|---|---|---|
| 1 | POST | /externalTable/catalog/validate | Validate catalog connection |
| 2 | POST | /externalTable/catalog/namespaces | List namespaces |
| 3 | POST | /externalTable/catalog/tables/list | List tables in a namespace |
| 4a | POST | /schemas | Register Pinot schema (standard Controller API) |
| 4b | POST | /tables | Create OFFLINE table with ExternalTableSyncTask in task.taskTypeConfigsMap (standard Controller API) |
| 5 | POST | /periodictask/run | Manually trigger first ingestion run |
Recommended Workflow
Run these steps in order when onboarding a new External table:Pausing Ingestion
To stop theExternalTableSyncTask from running on its cron schedule, set enabled to false in the task config and update the table via the Pinot Controller API.
"enabled": "false" prevents the scheduler from creating new ingestion task instances. Any run currently in progress will complete normally. To resume ingestion, set "enabled": "true" and update the table again.
Note: Pausing does not delete existing segments or checkpoints. When you re-enable the task, ingestion resumes from the last recorded checkpoint — no data is re-ingested.
Catalog Providers
1. S3 Data Lake
What it is: For raw Parquet files on S3 that are not managed by an Iceberg catalog service. There is no catalog REST endpoint — Pinot reads files directly from the specified S3 bucket and prefix. Thenamespace and table discovery APIs still work but return values derived from the S3 path.
catalogType: "s3-catalog"
Note: Unlike the other catalog types, this provider usesaccessKey/secretKey(notrestAccessKeyId/restSecretAccessKey) for credentials.
1.1 Validate
POST /externalTable/catalog/validate
1.2 Create Pinot Table
Register the logical table with POST /schemas (Pinot schema JSON) followed by POST /tables (body is thetableConfig object in the JSON below). Use POST /tables/preview for an enriched draft when needed.
2. Glue REST
What it is: Connects to AWS Glue as an Iceberg catalog using the Iceberg REST protocol. Uses AWS SigV4 for both the Glue catalog API (rest* fields) and S3 data access (storage* fields).
catalogType: "iceberg-rest" with "serviceType": "glue"
2.1 Validate
POST /externalTable/catalog/validate
2.2 List Namespaces
POST /externalTable/catalog/namespaces
2.3 List Tables
POST /externalTable/catalog/tables/list
namespace — the Glue database name to list tables from.
2.4 Create Pinot Table
Register the logical table with POST /schemas (Pinot schema JSON) followed by POST /tables (body is thetableConfig object in the JSON below). Use POST /tables/preview for an enriched draft when needed.
Note: TheExternalTableSyncTaskconfig in create uses the hierarchicalcatalog.iceberg-rest.auth.*key format. Catalog-backed ingestion is selected when the task config includes a top-levelcatalogType.
3. Glue Native
What it is: Connects to AWS Glue using the native Glue SDK (as opposed to the Iceberg REST protocol). Simpler to configure — norestUri or serviceType needed. The Glue database to use is specified by the database field.
catalogType: "glue-catalog"
3.1 Validate
POST /externalTable/catalog/validate
3.2 List Namespaces
POST /externalTable/catalog/namespaces
3.3 List Tables
POST /externalTable/catalog/tables/list
3.4 Create Pinot Table
Register the logical table with POST /schemas (Pinot schema JSON) followed by POST /tables (body is thetableConfig object in the JSON below). Use POST /tables/preview for an enriched draft when needed.
4. Nessie REST
What it is: Connects to a Project Nessie server using the Iceberg REST protocol. Nessie itself requires no authentication in this configuration (restAuthType: "none"); only S3 credentials are needed to read the underlying data files.
catalogType: "iceberg-rest-s3" with "serviceType": "nessie"
4.1 Validate
POST /externalTable/catalog/validate
4.2 List Namespaces
POST /externalTable/catalog/namespaces
4.3 List Tables
POST /externalTable/catalog/tables/list
4.4 Create Pinot Table
Register the logical table with POST /schemas (Pinot schema JSON) followed by POST /tables (body is thetableConfig object in the JSON below). Use POST /tables/preview for an enriched draft when needed.
5. S3 Tables REST
What it is: Connects to AWS S3 Tables, a managed Iceberg-compatible table storage service. Uses the Iceberg REST protocol. Requires an additionaltableBucketArn field that identifies the S3 Tables bucket.
catalogType: "iceberg-rest" with "serviceType": "s3Tables"
5.1 Validate
POST /externalTable/catalog/validate
5.2 List Namespaces
POST /externalTable/catalog/namespaces
5.3 List Tables
POST /externalTable/catalog/tables/list
5.4 Create Pinot Table
Register the logical table with POST /schemas (Pinot schema JSON) followed by POST /tables (body is thetableConfig object in the JSON below). Use POST /tables/preview for an enriched draft when needed.
Request Body Field Reference
catalogConfig — Common Fields
| Field | Applies to | Description |
|---|---|---|
namespace | All (steps 2–3) | Namespace (Glue database, Nessie namespace, etc.) containing the target table. Required for List Tables and Create. |
tableName | Create | Name of the Iceberg table. Required for Create. |
restUri | iceberg-rest, iceberg-rest-s3 | URL of the Iceberg REST catalog endpoint. |
serviceType | iceberg-rest, iceberg-rest-s3 | Identifies the backing service: "glue", "nessie", or "s3Tables". |
warehouse | iceberg-rest (glue) | AWS account ID — used as the Glue warehouse identifier. |
tableBucketArn | iceberg-rest (s3Tables) | Full ARN of the S3 Tables bucket. |
restAuthType | iceberg-rest, iceberg-rest-s3, glue-catalog | Auth type for the catalog REST API: "aws-sigv4" or "none". |
restAccessKeyId / restSecretAccessKey | iceberg-rest, glue-catalog | AWS credentials for catalog API access. |
restRegion | iceberg-rest, glue-catalog | AWS region for the catalog endpoint. |
restService | iceberg-rest, glue-catalog | AWS service name for SigV4 signing: "glue" or "s3tables". |
storageAuthType | iceberg-rest, iceberg-rest-s3, glue-catalog | Auth type for S3 data access: always "aws-sigv4". |
storageAccessKeyId / storageSecretAccessKey | iceberg-rest, iceberg-rest-s3, glue-catalog | AWS credentials for reading Parquet data from S3. |
storageRegion | iceberg-rest, iceberg-rest-s3, glue-catalog | AWS region for S3 data access. |
region / database | glue-catalog | Region and default database for native Glue SDK access. |
bucketName / prefix | s3-catalog | S3 bucket name and key prefix pointing to the Parquet files. The prefix is passed to the S3 ListObjectsV2 API and will match all objects whose keys start with this string. |
accessKey / secretKey | s3-catalog | AWS credentials (note: different field names from other providers). |
tableConfig — Key Fields
| Field | Description |
|---|---|
tableName | Pinot table name. Convention: <schemaName>_OFFLINE. |
tableType | Always "OFFLINE" for Iceberg ingestion. |
segmentsConfig.timeColumnName | Time column for Pinot segments. Can be null if no time dimension. |
segmentsConfig.retentionTimeValue / retentionTimeUnit | How long Pinot retains segments. |
segmentsConfig.segmentPushType | Always "APPEND" — new Iceberg snapshots are appended as new segments. |
tableIndexConfig.nullHandlingEnabled | Set to true to handle nullable columns from Iceberg schemas. |
task.taskTypeConfigsMap.ExternalTableSyncTask.schedule | Cron expression for ingestion frequency. "0 */30 * * * ?" runs every 30 minutes. |
task.taskTypeConfigsMap.ExternalTableSyncTask.inputFormat | Always "parquet" — catalog data files use Parquet. |
Frequently Asked Questions
Why isn’t my table ingesting data after creation? TheExternalTableSyncTask runs on a cron schedule (default every 30 minutes) and does not fire automatically on table creation. You must manually trigger the first run:
Which catalog type should I use?
| My setup | Use |
|---|---|
| Raw Parquet files directly on S3 (no catalog service) | s3-catalog (S3 Data Lake) |
| AWS Glue via the Iceberg REST protocol | iceberg-rest with serviceType: "glue" (Glue REST) |
| AWS Glue via the native Glue SDK | glue-catalog (Glue Native) |
| Project Nessie server | iceberg-rest-s3 with serviceType: "nessie" (Nessie REST) |
| AWS S3 Tables | iceberg-rest with serviceType: "s3Tables" (S3 Tables REST) |
What credentials do I need? For all AWS-backed catalog types (
iceberg-rest, glue-catalog, iceberg-rest-s3), you need two sets of AWS credentials:
- Catalog credentials (
restAccessKeyId/restSecretAccessKey) — to authenticate against the catalog API (Glue, S3 Tables). - Storage credentials (
storageAccessKeyId/storageSecretAccessKey) — to read Parquet data files from S3.
s3-catalog), only one set is needed, using the field names accessKey / secretKey.
What file format does Iceberg ingestion support? Only Parquet. Set
"inputFormat": "parquet" in the ExternalTableSyncTask config. Iceberg-managed catalog tables use Parquet data files by default.
Can I set a custom ingestion schedule? Yes. The
schedule field in ExternalTableSyncTask accepts a standard cron expression. The default "0 */5 * * * ?" runs every 5 minutes. To run every hour, use "0 0 * * * ?". The schedule applies to all subsequent automatic runs; the first run must always be triggered manually.
How do I pause ingestion without deleting the table? Set
"enabled": "false" in the ExternalTableSyncTask config and update the table. This stops the scheduler from creating new ingestion runs while preserving all existing segments and the last checkpoint. Re-enable by setting "enabled": "true". See Pausing Ingestion for the full example.
What happens if
timeColumnName is null?
Pinot creates the table without a time dimension. Segments are still ingested and queryable, but time-based retention and time-partition pruning are disabled. Set timeColumnName to a timestamp column in your Iceberg schema if you need those features.
My Validate call succeeds but List Namespaces returns nothing — why? The
validate endpoint only confirms connectivity and credential validity. An empty namespace list typically means the credentials have access to the catalog service but the account or region contains no databases, or the warehouse / database field points to the wrong scope. Double-check that the region, warehouse, and database fields match your actual Glue or S3 Tables configuration.
Can I use IAM roles instead of access keys? Cross account IAM role based access is not supported yet.
How do I change the ingestion schedule after table creation? Update the
ExternalTableSyncTask.schedule field in the table config.
Where can I monitor ingestion health? See the Observability page for the Watcher Status, Checkpoint, and File Count APIs.

