This feature requires StarTree Cloud with a minion tier enabled, and must be enabled on demand — contact StarTree support to activate it.
DataExportTask is a Pinot minion task that exports completed REALTIME segments to external storage as Parquet files. It is the complement of ExternalTableSyncTask: where that task reads external Parquet into Pinot, DataExportTask writes Pinot data back out to your data lake. Common uses include cold-tier archival, feeding downstream Iceberg catalogs, and cross-system data sharing without a separate ETL pipeline.
Two destinations are supported: a plain filesystem path (any S3-compatible URI via PinotFS) and an Iceberg REST catalog (AWS Glue, S3 Tables, or any Iceberg REST-compliant catalog).
How it works
Each cycle, the Generator runs on the controller: it loads a time-based watermark from ZooKeeper, filters REALTIME segments that are completed, past the bufferTimePeriod, and not already checkpointed or in a running task. It then emits one PinotTaskConfig per eligible segment, capped by tableMaxNumTasks.
The Executor runs on a minion: it downloads the segment from Pinot deep store, converts it to Parquet using the configured writer, and uploads the result to the destination. On success, it posts a COMPLETED checkpoint back so the segment is not re-exported.
Supported destinations
| Destination | externalTable.target | Use with |
|---|
| Filesystem (PinotFS) | filesystem | Any S3-compatible URI, GCS, HDFS |
| Iceberg REST catalog | iceberg-rest | AWS Glue, AWS S3 Tables, any Iceberg REST catalog |
Prerequisites
- A REALTIME table with completed segments exists in your StarTree cluster.
- A minion tier is provisioned in the cluster.
- The minion has network access to the destination (S3 bucket, GCS bucket, or Iceberg REST endpoint).
- For Iceberg REST: the target table must already exist in the catalog, must not be partitioned, and must use Parquet as its file format.
Configuring the task
Add DataExportTask to the table’s task.taskTypeConfigsMap. The following example exports to a filesystem destination:
"task": {
"taskTypeConfigsMap": {
"DataExportTask": {
"schedule": "0 0 * * * ?",
"sourceTableName": "my_events_REALTIME",
"fileFormat": "parquet",
"externalTable.target": "filesystem",
"outputDirURI": "s3://my-bucket/exports/my_events/",
"output.fs.class": "org.apache.pinot.plugin.filesystem.S3PinotFS",
"output.fs.prop.region": "us-east-1"
}
}
}
For an Iceberg REST destination, replace the filesystem keys with the catalog connection:
"task": {
"taskTypeConfigsMap": {
"DataExportTask": {
"schedule": "0 0 * * * ?",
"sourceTableName": "my_events_REALTIME",
"fileFormat": "parquet",
"externalTable.target": "iceberg-rest",
"catalog.iceberg-rest.restUri": "https://glue.us-east-1.amazonaws.com/iceberg",
"catalog.iceberg-rest.table.namespace": "my_database",
"catalog.iceberg-rest.table.tableName": "my_events_archive",
"iceberg.commitThreshold": "250"
}
}
}
Config reference
Core settings
| Key | Default | Description |
|---|
schedule | — | Quartz cron expression for how often the Generator runs, e.g. 0 0 * * * ? (hourly). |
sourceTableName | (required) | The REALTIME table to export from, with the _REALTIME suffix, e.g. my_events_REALTIME. |
fileFormat | parquet | Output file format. Only parquet is supported. |
externalTable.target | filesystem | Destination type. filesystem or iceberg-rest. |
bufferTimePeriod | 1d | Segments whose end time is within this window of now are excluded — allows segments to fully complete before export. |
initialWatermarkMs | — | Epoch ms to use as the starting watermark on the first run. Segments older than this value are skipped. |
tableMaxNumTasks | — | Maximum number of segments exported per Generator cycle. Limits concurrent minion load. |
Filesystem target
Set these when externalTable.target=filesystem.
| Key | Default | Description |
|---|
outputDirURI | (required) | Destination URI, e.g. s3://my-bucket/exports/table/. Output files are written as <outputDirURI>/<segmentName>.parquet. |
output.fs.class | — | Fully-qualified PinotFS class, e.g. org.apache.pinot.plugin.filesystem.S3PinotFS. Defaults to the cluster’s configured deep-store FS if unset. |
output.fs.prop.<key> | — | PinotFS properties, e.g. output.fs.prop.region=us-east-1 or output.fs.prop.accessKey=.... Passed directly to the FS class. |
These apply to both destinations.
| Key | Default | Description |
|---|
fileFormat.parquet.compressionCodec | SNAPPY | Parquet compression. SNAPPY, GZIP, ZSTD, or UNCOMPRESSED. |
fileFormat.parquet.rowGroupSizeBytes | 134217728 (128 MB) | Target row group size in bytes. |
fileFormat.parquet.pageSizeBytes | 1048576 (1 MB) | Target page size in bytes. |
fileFormat.parquet.enableDictionary | true | Write Parquet dictionary pages. Disable for high-cardinality columns. |
fileFormat.parquet.json.maxDepth | — | Maximum nesting depth when serializing JSON columns to Parquet group types. |
fileFormat.parquet.decimal.precision | — | Override decimal precision for BIG_DECIMAL columns. |
fileFormat.parquet.decimal.scale | — | Override decimal scale for BIG_DECIMAL columns. |
fileFormat.parquet.json.unknownKeysBehavior | warn | What to do when a JSON column in the row data has keys not present in the Iceberg schema: warn (log and continue) or fail (abort the task). Applies to iceberg-rest target only. |
Iceberg REST target
Set these when externalTable.target=iceberg-rest.
| Key | Default | Description |
|---|
catalog.iceberg-rest.restUri | (required) | Iceberg REST catalog URI, e.g. https://glue.us-east-1.amazonaws.com/iceberg. |
catalog.iceberg-rest.table.namespace | (required) | Catalog namespace (database) containing the target table. |
catalog.iceberg-rest.table.tableName | (required) | Target table name inside the namespace. |
catalog.iceberg-rest.token | — | Static bearer token for catalog authentication. |
catalog.iceberg-rest.credential | — | OAuth2 client credential (clientId:clientSecret). |
catalog.iceberg-rest.rest.signing-region | — | AWS region for SigV4 request signing (e.g. us-east-1). |
catalog.iceberg-rest.s3.access-key-id | — | Static AWS access key for S3 file I/O (overrides vended credentials). |
catalog.iceberg-rest.s3.secret-access-key | — | Static AWS secret key. |
For AWS Glue and S3 Tables, the catalog vends short-lived S3 credentials automatically — do not set static s3.* keys unless you want to override them. The minion’s IAM role must have glue:GetTable and s3:PutObject (or equivalent S3 Tables) permissions on the target.
Iceberg batch commits
By default, each minion subtask uploads a Parquet file and stages it. The Generator accumulates staged files in ZooKeeper and issues a single Iceberg snapshot commit (AppendFiles.commit()) when the batch is ready — one snapshot for many segments instead of one per segment. This reduces Iceberg snapshot churn on high-throughput tables.
The Generator runs an EMIT / WAIT / COMMIT decision each cycle:
| State | Condition | Action |
|---|
EMIT | Eligible segments exist and in-flight count is below threshold | Schedule new subtasks |
WAIT | Running subtasks + staged queue ≥ iceberg.commitThreshold | Skip this cycle; wait for in-flight tasks to drain |
COMMIT | No eligible segments remain, or staged files ≥ iceberg.commitThreshold | Flush the staged queue with one AppendFiles.commit() |
| Key | Default | Description |
|---|
iceberg.commitThreshold | 250 | Number of staged files that triggers an immediate batch commit. Tune down for lower latency; tune up to reduce snapshot frequency on high-volume tables. |
Scheduling and monitoring
DataExportTask uses the standard Pinot minion task scheduler. To trigger a run immediately without waiting for the next schedule tick, use the task schedule endpoint:
curl -X POST "$CONTROLLER/tasks/schedule?taskType=DataExportTask&tableName=my_events_REALTIME"
To check task status, use the standard Pinot tasks API:
curl "$CONTROLLER/tasks/DataExportTask/my_events_REALTIME/taskstates"
For a broader view of sync health and the ingestion checkpoint, see the Observability page.