Prefer a point-and-click setup? See the Data Portal Onboarding Guide.
How It Works
Onboarding an external table follows a linear discovery-to-ingestion workflow:- Validate your catalog credentials and connectivity.
- Discover available namespaces and tables in the catalog.
- Create the Pinot schema and table — this also registers the
IcebergIngestionTaskon a cron schedule. - Trigger the first ingestion run manually, since the scheduled task does not fire immediately after table creation.
API Endpoints Quick Reference
| Step | Method | Endpoint | Purpose |
|---|---|---|---|
| 1 | POST | /iceberg/catalog/validate | Validate catalog connection |
| 2 | POST | /iceberg/catalog/namespaces | List namespaces |
| 3 | POST | /iceberg/catalog/tables/list | List tables in a namespace |
| 4 | POST | /iceberg/catalog/tables | Create Pinot table and schema |
| 5 | POST | /periodictask/run | Manually trigger first ingestion run |
Recommended Workflow
Run these steps in order when onboarding a new External table:Pausing Ingestion
To stop theIcebergIngestionTask from running on its cron schedule, set enabled to false in the task config and update the table via the Pinot Controller API.
"enabled": "false" prevents the scheduler from creating new ingestion task instances. Any run currently in progress will complete normally. To resume ingestion, set "enabled": "true" and update the table again.
Note: Pausing does not delete existing segments or checkpoints. When you re-enable the task, ingestion resumes from the last recorded checkpoint — no data is re-ingested.
Catalog Providers
1. S3 Data Lake
What it is: For raw Parquet files on S3 that are not managed by an Iceberg catalog service. There is no catalog REST endpoint — Pinot reads files directly from the specified S3 bucket and prefix. Thenamespace and table discovery APIs still work but return values derived from the S3 path.
catalogType: "s3-catalog"
Note: Unlike the other catalog types, this provider usesaccessKey/secretKey(notrestAccessKeyId/restSecretAccessKey) for credentials.
1.1 Validate
POST /iceberg/catalog/validate
1.2 Create Pinot Table
POST /iceberg/catalog/tables
2. Glue REST
What it is: Connects to AWS Glue as an Iceberg catalog using the Iceberg REST protocol. Uses AWS SigV4 for both the Glue catalog API (rest* fields) and S3 data access (storage* fields).
catalogType: "iceberg-rest" with "serviceType": "glue"
2.1 Validate
POST /iceberg/catalog/validate
2.2 List Namespaces
POST /iceberg/catalog/namespaces
2.3 List Tables
POST /iceberg/catalog/tables/list
namespace — the Glue database name to list tables from.
2.4 Create Pinot Table
POST /iceberg/catalog/tables
Note: TheIcebergIngestionTaskconfig in Create uses the hierarchicalcatalog.iceberg-rest.auth.*key format.
3. Glue Native
What it is: Connects to AWS Glue using the native Glue SDK (as opposed to the Iceberg REST protocol). Simpler to configure — norestUri or serviceType needed. The Glue database to use is specified by the database field.
catalogType: "glue-catalog"
3.1 Validate
POST /iceberg/catalog/validate
3.2 List Namespaces
POST /iceberg/catalog/namespaces
3.3 List Tables
POST /iceberg/catalog/tables/list
3.4 Create Pinot Table
POST /iceberg/catalog/tables
4. Nessie REST
What it is: Connects to a Project Nessie server using the Iceberg REST protocol. Nessie itself requires no authentication in this configuration (restAuthType: "none"); only S3 credentials are needed to read the underlying data files.
catalogType: "iceberg-rest-s3" with "serviceType": "nessie"
4.1 Validate
POST /iceberg/catalog/validate
4.2 List Namespaces
POST /iceberg/catalog/namespaces
4.3 List Tables
POST /iceberg/catalog/tables/list
4.4 Create Pinot Table
POST /iceberg/catalog/tables
5. S3 Tables REST
What it is: Connects to AWS S3 Tables, a managed Iceberg-compatible table storage service. Uses the Iceberg REST protocol. Requires an additionaltableBucketArn field that identifies the S3 Tables bucket.
catalogType: "iceberg-rest" with "serviceType": "s3Tables"
5.1 Validate
POST /iceberg/catalog/validate
5.2 List Namespaces
POST /iceberg/catalog/namespaces
5.3 List Tables
POST /iceberg/catalog/tables/list
5.4 Create Pinot Table
POST /iceberg/catalog/tables
Request Body Field Reference
catalogConfig — Common Fields
| Field | Applies to | Description |
|---|---|---|
namespace | All (steps 2–3) | Namespace (Glue database, Nessie namespace, etc.) containing the target table. Required for List Tables and Create. |
tableName | Create | Name of the Iceberg table. Required for Create. |
restUri | iceberg-rest, iceberg-rest-s3 | URL of the Iceberg REST catalog endpoint. |
serviceType | iceberg-rest, iceberg-rest-s3 | Identifies the backing service: "glue", "nessie", or "s3Tables". |
warehouse | iceberg-rest (glue) | AWS account ID — used as the Glue warehouse identifier. |
tableBucketArn | iceberg-rest (s3Tables) | Full ARN of the S3 Tables bucket. |
restAuthType | iceberg-rest, iceberg-rest-s3, glue-catalog | Auth type for the catalog REST API: "aws-sigv4" or "none". |
restAccessKeyId / restSecretAccessKey | iceberg-rest, glue-catalog | AWS credentials for catalog API access. |
restRegion | iceberg-rest, glue-catalog | AWS region for the catalog endpoint. |
restService | iceberg-rest, glue-catalog | AWS service name for SigV4 signing: "glue" or "s3tables". |
storageAuthType | iceberg-rest, iceberg-rest-s3, glue-catalog | Auth type for S3 data access: always "aws-sigv4". |
storageAccessKeyId / storageSecretAccessKey | iceberg-rest, iceberg-rest-s3, glue-catalog | AWS credentials for reading Parquet data from S3. |
storageRegion | iceberg-rest, iceberg-rest-s3, glue-catalog | AWS region for S3 data access. |
region / database | glue-catalog | Region and default database for native Glue SDK access. |
bucketName / prefix | s3-catalog | S3 bucket name and key prefix pointing to the Parquet files. |
accessKey / secretKey | s3-catalog | AWS credentials (note: different field names from other providers). |
tableConfig — Key Fields
| Field | Description |
|---|---|
tableName | Pinot table name. Convention: <schemaName>_OFFLINE. |
tableType | Always "OFFLINE" for Iceberg ingestion. |
segmentsConfig.timeColumnName | Time column for Pinot segments. Can be null if no time dimension. |
segmentsConfig.retentionTimeValue / retentionTimeUnit | How long Pinot retains segments. |
segmentsConfig.segmentPushType | Always "APPEND" — new Iceberg snapshots are appended as new segments. |
tableIndexConfig.nullHandlingEnabled | Set to true to handle nullable columns from Iceberg schemas. |
task.taskTypeConfigsMap.IcebergIngestionTask.schedule | Cron expression for ingestion frequency. "0 */30 * * * ?" runs every 30 minutes. |
task.taskTypeConfigsMap.IcebergIngestionTask.inputFormat | Always "parquet" — Iceberg uses Parquet for data files. |
Frequently Asked Questions
Why isn’t my table ingesting data after creation? TheIcebergIngestionTask runs on a cron schedule (default every 30 minutes) and does not fire automatically on table creation. You must manually trigger the first run:
Which catalog type should I use?
| My setup | Use |
|---|---|
| Raw Parquet files directly on S3 (no catalog service) | s3-catalog (S3 Data Lake) |
| AWS Glue via the Iceberg REST protocol | iceberg-rest with serviceType: "glue" (Glue REST) |
| AWS Glue via the native Glue SDK | glue-catalog (Glue Native) |
| Project Nessie server | iceberg-rest-s3 with serviceType: "nessie" (Nessie REST) |
| AWS S3 Tables | iceberg-rest with serviceType: "s3Tables" (S3 Tables REST) |
What credentials do I need? For all AWS-backed catalog types (
iceberg-rest, glue-catalog, iceberg-rest-s3), you need two sets of AWS credentials:
- Catalog credentials (
restAccessKeyId/restSecretAccessKey) — to authenticate against the catalog API (Glue, S3 Tables). - Storage credentials (
storageAccessKeyId/storageSecretAccessKey) — to read Parquet data files from S3.
s3-catalog), only one set is needed, using the field names accessKey / secretKey.
What file format does Iceberg ingestion support? Only Parquet. Set
"inputFormat": "parquet" in the IcebergIngestionTask config. All Iceberg-managed tables use Parquet by default.
Can I set a custom ingestion schedule? Yes. The
schedule field in IcebergIngestionTask accepts a standard cron expression. The default "0 */5 * * * ?" runs every 5 minutes. To run every hour, use "0 0 * * * ?". The schedule applies to all subsequent automatic runs; the first run must always be triggered manually.
How do I pause ingestion without deleting the table? Set
"enabled": "false" in the IcebergIngestionTask config and update the table. This stops the scheduler from creating new ingestion runs while preserving all existing segments and the last checkpoint. Re-enable by setting "enabled": "true". See Pausing Ingestion for the full example.
What happens if
timeColumnName is null?
Pinot creates the table without a time dimension. Segments are still ingested and queryable, but time-based retention and time-partition pruning are disabled. Set timeColumnName to a timestamp column in your Iceberg schema if you need those features.
My Validate call succeeds but List Namespaces returns nothing — why? The
validate endpoint only confirms connectivity and credential validity. An empty namespace list typically means the credentials have access to the catalog service but the account or region contains no databases, or the warehouse / database field points to the wrong scope. Double-check that the region, warehouse, and database fields match your actual Glue or S3 Tables configuration.
Can I use IAM roles instead of access keys? Cross account IAM role based access is not supported yet.
How do I change the ingestion schedule after table creation? Update the
IcebergIngestionTask.schedule field in the table config.
Where can I monitor ingestion health? See the Observability page for the Watcher Status, Checkpoint, and File Count APIs.

