Skip to main content
This feature requires StarTree Cloud with a minion tier enabled, and must be enabled on demand — contact StarTree support to activate it.
DataExportTask is a Pinot minion task that exports completed REALTIME segments to external storage as Parquet files. It is the complement of ExternalTableSyncTask: where that task reads external Parquet into Pinot, DataExportTask writes Pinot data back out to your data lake. Common uses include cold-tier archival, feeding downstream Iceberg catalogs, and cross-system data sharing without a separate ETL pipeline. Two destinations are supported: a plain filesystem path (any S3-compatible URI via PinotFS) and an Iceberg REST catalog (AWS Glue, S3 Tables, or any Iceberg REST-compliant catalog).

How it works

Each cycle, the Generator runs on the controller: it loads a time-based watermark from ZooKeeper, filters REALTIME segments that are completed, past the bufferTimePeriod, and not already checkpointed or in a running task. It then emits one PinotTaskConfig per eligible segment, capped by tableMaxNumTasks. The Executor runs on a minion: it downloads the segment from Pinot deep store, converts it to Parquet using the configured writer, and uploads the result to the destination. On success, it posts a COMPLETED checkpoint back so the segment is not re-exported.

Supported destinations

DestinationexternalTable.targetUse with
Filesystem (PinotFS)filesystemAny S3-compatible URI, GCS, HDFS
Iceberg REST catalogiceberg-restAWS Glue, AWS S3 Tables, any Iceberg REST catalog

Prerequisites

  • A REALTIME table with completed segments exists in your StarTree cluster.
  • A minion tier is provisioned in the cluster.
  • The minion has network access to the destination (S3 bucket, GCS bucket, or Iceberg REST endpoint).
  • For Iceberg REST: the target table must already exist in the catalog, must not be partitioned, and must use Parquet as its file format.

Configuring the task

Add DataExportTask to the table’s task.taskTypeConfigsMap. The following example exports to a filesystem destination:
"task": {
  "taskTypeConfigsMap": {
    "DataExportTask": {
      "schedule": "0 0 * * * ?",
      "sourceTableName": "my_events_REALTIME",
      "fileFormat": "parquet",
      "externalTable.target": "filesystem",
      "outputDirURI": "s3://my-bucket/exports/my_events/",
      "output.fs.class": "org.apache.pinot.plugin.filesystem.S3PinotFS",
      "output.fs.prop.region": "us-east-1"
    }
  }
}
For an Iceberg REST destination, replace the filesystem keys with the catalog connection:
"task": {
  "taskTypeConfigsMap": {
    "DataExportTask": {
      "schedule": "0 0 * * * ?",
      "sourceTableName": "my_events_REALTIME",
      "fileFormat": "parquet",
      "externalTable.target": "iceberg-rest",
      "catalog.iceberg-rest.restUri": "https://glue.us-east-1.amazonaws.com/iceberg",
      "catalog.iceberg-rest.table.namespace": "my_database",
      "catalog.iceberg-rest.table.tableName": "my_events_archive",
      "iceberg.commitThreshold": "250"
    }
  }
}

Config reference

Core settings

KeyDefaultDescription
scheduleQuartz cron expression for how often the Generator runs, e.g. 0 0 * * * ? (hourly).
sourceTableName(required)The REALTIME table to export from, with the _REALTIME suffix, e.g. my_events_REALTIME.
fileFormatparquetOutput file format. Only parquet is supported.
externalTable.targetfilesystemDestination type. filesystem or iceberg-rest.
bufferTimePeriod1dSegments whose end time is within this window of now are excluded — allows segments to fully complete before export.
initialWatermarkMsEpoch ms to use as the starting watermark on the first run. Segments older than this value are skipped.
tableMaxNumTasksMaximum number of segments exported per Generator cycle. Limits concurrent minion load.

Filesystem target

Set these when externalTable.target=filesystem.
KeyDefaultDescription
outputDirURI(required)Destination URI, e.g. s3://my-bucket/exports/table/. Output files are written as <outputDirURI>/<segmentName>.parquet.
output.fs.classFully-qualified PinotFS class, e.g. org.apache.pinot.plugin.filesystem.S3PinotFS. Defaults to the cluster’s configured deep-store FS if unset.
output.fs.prop.<key>PinotFS properties, e.g. output.fs.prop.region=us-east-1 or output.fs.prop.accessKey=.... Passed directly to the FS class.

Parquet format

These apply to both destinations.
KeyDefaultDescription
fileFormat.parquet.compressionCodecSNAPPYParquet compression. SNAPPY, GZIP, ZSTD, or UNCOMPRESSED.
fileFormat.parquet.rowGroupSizeBytes134217728 (128 MB)Target row group size in bytes.
fileFormat.parquet.pageSizeBytes1048576 (1 MB)Target page size in bytes.
fileFormat.parquet.enableDictionarytrueWrite Parquet dictionary pages. Disable for high-cardinality columns.
fileFormat.parquet.json.maxDepthMaximum nesting depth when serializing JSON columns to Parquet group types.
fileFormat.parquet.decimal.precisionOverride decimal precision for BIG_DECIMAL columns.
fileFormat.parquet.decimal.scaleOverride decimal scale for BIG_DECIMAL columns.
fileFormat.parquet.json.unknownKeysBehaviorwarnWhat to do when a JSON column in the row data has keys not present in the Iceberg schema: warn (log and continue) or fail (abort the task). Applies to iceberg-rest target only.

Iceberg REST target

Set these when externalTable.target=iceberg-rest.
KeyDefaultDescription
catalog.iceberg-rest.restUri(required)Iceberg REST catalog URI, e.g. https://glue.us-east-1.amazonaws.com/iceberg.
catalog.iceberg-rest.table.namespace(required)Catalog namespace (database) containing the target table.
catalog.iceberg-rest.table.tableName(required)Target table name inside the namespace.
catalog.iceberg-rest.tokenStatic bearer token for catalog authentication.
catalog.iceberg-rest.credentialOAuth2 client credential (clientId:clientSecret).
catalog.iceberg-rest.rest.signing-regionAWS region for SigV4 request signing (e.g. us-east-1).
catalog.iceberg-rest.s3.access-key-idStatic AWS access key for S3 file I/O (overrides vended credentials).
catalog.iceberg-rest.s3.secret-access-keyStatic AWS secret key.
For AWS Glue and S3 Tables, the catalog vends short-lived S3 credentials automatically — do not set static s3.* keys unless you want to override them. The minion’s IAM role must have glue:GetTable and s3:PutObject (or equivalent S3 Tables) permissions on the target.

Iceberg batch commits

By default, each minion subtask uploads a Parquet file and stages it. The Generator accumulates staged files in ZooKeeper and issues a single Iceberg snapshot commit (AppendFiles.commit()) when the batch is ready — one snapshot for many segments instead of one per segment. This reduces Iceberg snapshot churn on high-throughput tables. The Generator runs an EMIT / WAIT / COMMIT decision each cycle:
StateConditionAction
EMITEligible segments exist and in-flight count is below thresholdSchedule new subtasks
WAITRunning subtasks + staged queue ≥ iceberg.commitThresholdSkip this cycle; wait for in-flight tasks to drain
COMMITNo eligible segments remain, or staged files ≥ iceberg.commitThresholdFlush the staged queue with one AppendFiles.commit()
KeyDefaultDescription
iceberg.commitThreshold250Number of staged files that triggers an immediate batch commit. Tune down for lower latency; tune up to reduce snapshot frequency on high-volume tables.

Scheduling and monitoring

DataExportTask uses the standard Pinot minion task scheduler. To trigger a run immediately without waiting for the next schedule tick, use the task schedule endpoint:
curl -X POST "$CONTROLLER/tasks/schedule?taskType=DataExportTask&tableName=my_events_REALTIME"
To check task status, use the standard Pinot tasks API:
curl "$CONTROLLER/tasks/DataExportTask/my_events_REALTIME/taskstates"
For a broader view of sync health and the ingestion checkpoint, see the Observability page.