Data Export Task: Parquet Writer

This feature requires StarTree Cloud with a minion tier enabled, and must be enabled on demand — contact StarTree support to activate it.

DataExportTask is a Pinot minion task that exports completed REALTIME segments to external storage as Parquet files. It is the complement of ExternalTableSyncTask: where that task reads external Parquet into Pinot, DataExportTask writes Pinot data back out to your data lake. Common uses include cold-tier archival, feeding downstream Iceberg catalogs, and cross-system data sharing without a separate ETL pipeline. Two destinations are supported: a plain filesystem path (any S3-compatible URI via PinotFS) and an Iceberg REST catalog (AWS Glue, S3 Tables, or any Iceberg REST-compliant catalog).

How it works

Each cycle, the Generator runs on the controller: it loads a time-based watermark from ZooKeeper, filters REALTIME segments that are completed, past the bufferTimePeriod, and not already checkpointed or in a running task. It then emits one PinotTaskConfig per eligible segment, capped by tableMaxNumTasks. The Executor runs on a minion: it downloads the segment from Pinot deep store, converts it to Parquet using the configured writer, and uploads the result to the destination. On success, it posts a COMPLETED checkpoint back so the segment is not re-exported.

Supported destinations

Destination	`externalTable.target`	Use with
Filesystem (PinotFS)	`filesystem`	Any S3-compatible URI, GCS, HDFS
Iceberg REST catalog	`iceberg-rest`	AWS Glue, AWS S3 Tables, any Iceberg REST catalog

Prerequisites

A REALTIME table with completed segments exists in your StarTree cluster.
A minion tier is provisioned in the cluster.
The minion has network access to the destination (S3 bucket, GCS bucket, or Iceberg REST endpoint).
For Iceberg REST: the target table must already exist in the catalog, must not be partitioned, and must use Parquet as its file format.

Configuring the task

Add DataExportTask to the table’s task.taskTypeConfigsMap. The following example exports to a filesystem destination:

"task": {
  "taskTypeConfigsMap": {
    "DataExportTask": {
      "schedule": "0 0 * * * ?",
      "sourceTableName": "my_events_REALTIME",
      "fileFormat": "parquet",
      "externalTable.target": "filesystem",
      "outputDirURI": "s3://my-bucket/exports/my_events/",
      "output.fs.class": "org.apache.pinot.plugin.filesystem.S3PinotFS",
      "output.fs.prop.region": "us-east-1"
    }
  }
}

For an Iceberg REST destination, replace the filesystem keys with the catalog connection:

"task": {
  "taskTypeConfigsMap": {
    "DataExportTask": {
      "schedule": "0 0 * * * ?",
      "sourceTableName": "my_events_REALTIME",
      "fileFormat": "parquet",
      "externalTable.target": "iceberg-rest",
      "catalog.iceberg-rest.restUri": "https://glue.us-east-1.amazonaws.com/iceberg",
      "catalog.iceberg-rest.table.namespace": "my_database",
      "catalog.iceberg-rest.table.tableName": "my_events_archive",
      "iceberg.commitThreshold": "250"
    }
  }
}

Config reference

Core settings

Key	Default	Description
`schedule`	—	Quartz cron expression for how often the Generator runs, e.g. `0 0 * * * ?` (hourly).
`sourceTableName`	(required)	The REALTIME table to export from, with the `_REALTIME` suffix, e.g. `my_events_REALTIME`.
`fileFormat`	`parquet`	Output file format. Only `parquet` is supported.
`externalTable.target`	`filesystem`	Destination type. `filesystem` or `iceberg-rest`.
`bufferTimePeriod`	`1d`	Segments whose end time is within this window of now are excluded — allows segments to fully complete before export.
`initialWatermarkMs`	—	Epoch ms to use as the starting watermark on the first run. Segments older than this value are skipped.
`tableMaxNumTasks`	—	Maximum number of segments exported per Generator cycle. Limits concurrent minion load.

Filesystem target

Set these when externalTable.target=filesystem.

Key	Default	Description
`outputDirURI`	(required)	Destination URI, e.g. `s3://my-bucket/exports/table/`. Output files are written as `<outputDirURI>/<segmentName>.parquet`.
`output.fs.class`	—	Fully-qualified PinotFS class, e.g. `org.apache.pinot.plugin.filesystem.S3PinotFS`. Defaults to the cluster’s configured deep-store FS if unset.
`output.fs.prop.<key>`	—	PinotFS properties, e.g. `output.fs.prop.region=us-east-1` or `output.fs.prop.accessKey=...`. Passed directly to the FS class.

Parquet format

These apply to both destinations.

Key	Default	Description
`fileFormat.parquet.compressionCodec`	`SNAPPY`	Parquet compression. `SNAPPY`, `GZIP`, `ZSTD`, or `UNCOMPRESSED`.
`fileFormat.parquet.rowGroupSizeBytes`	`134217728` (128 MB)	Target row group size in bytes.
`fileFormat.parquet.pageSizeBytes`	`1048576` (1 MB)	Target page size in bytes.
`fileFormat.parquet.enableDictionary`	`true`	Write Parquet dictionary pages. Disable for high-cardinality columns.
`fileFormat.parquet.json.maxDepth`	—	Maximum nesting depth when serializing JSON columns to Parquet `group` types.
`fileFormat.parquet.decimal.precision`	—	Override decimal precision for `BIG_DECIMAL` columns.
`fileFormat.parquet.decimal.scale`	—	Override decimal scale for `BIG_DECIMAL` columns.
`fileFormat.parquet.json.unknownKeysBehavior`	`warn`	What to do when a JSON column in the row data has keys not present in the Iceberg schema: `warn` (log and continue) or `fail` (abort the task). Applies to `iceberg-rest` target only.

Iceberg REST target

Set these when externalTable.target=iceberg-rest.

Key	Default	Description
`catalog.iceberg-rest.restUri`	(required)	Iceberg REST catalog URI, e.g. `https://glue.us-east-1.amazonaws.com/iceberg`.
`catalog.iceberg-rest.table.namespace`	(required)	Catalog namespace (database) containing the target table.
`catalog.iceberg-rest.table.tableName`	(required)	Target table name inside the namespace.
`catalog.iceberg-rest.token`	—	Static bearer token for catalog authentication.
`catalog.iceberg-rest.credential`	—	OAuth2 client credential (`clientId:clientSecret`).
`catalog.iceberg-rest.rest.signing-region`	—	AWS region for SigV4 request signing (e.g. `us-east-1`).
`catalog.iceberg-rest.s3.access-key-id`	—	Static AWS access key for S3 file I/O (overrides vended credentials).
`catalog.iceberg-rest.s3.secret-access-key`	—	Static AWS secret key.

For AWS Glue and S3 Tables, the catalog vends short-lived S3 credentials automatically — do not set static s3.* keys unless you want to override them. The minion’s IAM role must have glue:GetTable and s3:PutObject (or equivalent S3 Tables) permissions on the target.

Iceberg batch commits

By default, each minion subtask uploads a Parquet file and stages it. The Generator accumulates staged files in ZooKeeper and issues a single Iceberg snapshot commit (AppendFiles.commit()) when the batch is ready — one snapshot for many segments instead of one per segment. This reduces Iceberg snapshot churn on high-throughput tables. The Generator runs an EMIT / WAIT / COMMIT decision each cycle:

State	Condition	Action
`EMIT`	Eligible segments exist and in-flight count is below threshold	Schedule new subtasks
`WAIT`	Running subtasks + staged queue ≥ `iceberg.commitThreshold`	Skip this cycle; wait for in-flight tasks to drain
`COMMIT`	No eligible segments remain, or staged files ≥ `iceberg.commitThreshold`	Flush the staged queue with one `AppendFiles.commit()`

Key	Default	Description
`iceberg.commitThreshold`	`250`	Number of staged files that triggers an immediate batch commit. Tune down for lower latency; tune up to reduce snapshot frequency on high-volume tables.

Scheduling and monitoring

DataExportTask uses the standard Pinot minion task scheduler. To trigger a run immediately without waiting for the next schedule tick, use the task schedule endpoint:

curl -X POST "$CONTROLLER/tasks/schedule?taskType=DataExportTask&tableName=my_events_REALTIME"

To check task status, use the standard Pinot tasks API:

curl "$CONTROLLER/tasks/DataExportTask/my_events_REALTIME/taskstates"

For a broader view of sync health and the ingestion checkpoint, see the Observability page.

​How it works

​Supported destinations

​Prerequisites

​Configuring the task

​Config reference

​Core settings

​Filesystem target

​Parquet format

​Iceberg REST target

​Iceberg batch commits

​Scheduling and monitoring

How it works

Supported destinations

Prerequisites

Configuring the task

Config reference

Core settings

Filesystem target

Parquet format

Iceberg REST target

Iceberg batch commits

Scheduling and monitoring