> ## Documentation Index
> Fetch the complete documentation index at: https://docs.startree.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Export Task: Parquet Writer

<Warning>
  This feature requires StarTree Cloud with a minion tier enabled, and must be enabled on demand — contact StarTree support to activate it.
</Warning>

`DataExportTask` is a Pinot minion task that exports completed **REALTIME segments** to external storage as Parquet files. It is the complement of `ExternalTableSyncTask`: where that task reads external Parquet into Pinot, `DataExportTask` writes Pinot data back out to your data lake. Common uses include cold-tier archival, feeding downstream Iceberg catalogs, and cross-system data sharing without a separate ETL pipeline.

Two destinations are supported: a plain filesystem path (any S3-compatible URI via PinotFS) and an Iceberg REST catalog (AWS Glue, S3 Tables, or any Iceberg REST-compliant catalog).

***

## How it works

```mermaid theme={null}
flowchart LR
    REALTIME["REALTIME Table\n(Pinot)"] --> Generator["Generator\n(Controller)"]
    Generator --> |"Watermark +\nZK checkpoint"| EligibleSegments["Eligible\nCompleted Segments"]
    EligibleSegments --> Executor["Executor\n(Minion)"]
    Executor --> Download["Download segment\nfrom deep store"]
    Download --> Convert["Convert to\nParquet"]
    Convert --> Upload["Upload to\ndestination"]
    Upload --> FilesystemDest["Filesystem\n(S3/GCS/HDFS)"]
    Upload --> IcebergDest["Iceberg REST\n(Glue / S3 Tables)"]
```

Each cycle, the **Generator** runs on the controller: it loads a time-based watermark from ZooKeeper, filters REALTIME segments that are completed, past the `bufferTimePeriod`, and not already checkpointed or in a running task. It then emits one `PinotTaskConfig` per eligible segment, capped by `tableMaxNumTasks`.

The **Executor** runs on a minion: it downloads the segment from Pinot deep store, converts it to Parquet using the configured writer, and uploads the result to the destination. On success, it posts a COMPLETED checkpoint back so the segment is not re-exported.

***

## Supported destinations

| Destination          | `externalTable.target` | Use with                                          |
| -------------------- | ---------------------- | ------------------------------------------------- |
| Filesystem (PinotFS) | `filesystem`           | Any S3-compatible URI, GCS, HDFS                  |
| Iceberg REST catalog | `iceberg-rest`         | AWS Glue, AWS S3 Tables, any Iceberg REST catalog |

***

## Prerequisites

* A REALTIME table with completed segments exists in your StarTree cluster.
* A minion tier is provisioned in the cluster.
* The minion has network access to the destination (S3 bucket, GCS bucket, or Iceberg REST endpoint).
* For Iceberg REST: the target table must already exist in the catalog, must not be partitioned, and must use Parquet as its file format.

***

## Configuring the task

Add `DataExportTask` to the table's `task.taskTypeConfigsMap`. The following example exports to a filesystem destination:

```json theme={null}
"task": {
  "taskTypeConfigsMap": {
    "DataExportTask": {
      "schedule": "0 0 * * * ?",
      "sourceTableName": "my_events_REALTIME",
      "fileFormat": "parquet",
      "externalTable.target": "filesystem",
      "outputDirURI": "s3://my-bucket/exports/my_events/",
      "output.fs.class": "org.apache.pinot.plugin.filesystem.S3PinotFS",
      "output.fs.prop.region": "us-east-1"
    }
  }
}
```

For an Iceberg REST destination, replace the filesystem keys with the catalog connection:

```json theme={null}
"task": {
  "taskTypeConfigsMap": {
    "DataExportTask": {
      "schedule": "0 0 * * * ?",
      "sourceTableName": "my_events_REALTIME",
      "fileFormat": "parquet",
      "externalTable.target": "iceberg-rest",
      "catalog.iceberg-rest.restUri": "https://glue.us-east-1.amazonaws.com/iceberg",
      "catalog.iceberg-rest.table.namespace": "my_database",
      "catalog.iceberg-rest.table.tableName": "my_events_archive",
      "iceberg.commitThreshold": "250"
    }
  }
}
```

***

## Config reference

### Core settings

| Key                    | Default      | Description                                                                                                          |
| ---------------------- | ------------ | -------------------------------------------------------------------------------------------------------------------- |
| `schedule`             | —            | Quartz cron expression for how often the Generator runs, e.g. `0 0 * * * ?` (hourly).                                |
| `sourceTableName`      | *(required)* | The REALTIME table to export from, with the `_REALTIME` suffix, e.g. `my_events_REALTIME`.                           |
| `fileFormat`           | `parquet`    | Output file format. Only `parquet` is supported.                                                                     |
| `externalTable.target` | `filesystem` | Destination type. `filesystem` or `iceberg-rest`.                                                                    |
| `bufferTimePeriod`     | `1d`         | Segments whose end time is within this window of now are excluded — allows segments to fully complete before export. |
| `initialWatermarkMs`   | —            | Epoch ms to use as the starting watermark on the first run. Segments older than this value are skipped.              |
| `tableMaxNumTasks`     | —            | Maximum number of segments exported per Generator cycle. Limits concurrent minion load.                              |

***

### Filesystem target

Set these when `externalTable.target=filesystem`.

| Key                    | Default      | Description                                                                                                                                      |
| ---------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------ |
| `outputDirURI`         | *(required)* | Destination URI, e.g. `s3://my-bucket/exports/table/`. Output files are written as `<outputDirURI>/<segmentName>.parquet`.                       |
| `output.fs.class`      | —            | Fully-qualified PinotFS class, e.g. `org.apache.pinot.plugin.filesystem.S3PinotFS`. Defaults to the cluster's configured deep-store FS if unset. |
| `output.fs.prop.<key>` | —            | PinotFS properties, e.g. `output.fs.prop.region=us-east-1` or `output.fs.prop.accessKey=...`. Passed directly to the FS class.                   |

***

### Parquet format

These apply to both destinations.

| Key                                           | Default              | Description                                                                                                                                                                            |
| --------------------------------------------- | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `fileFormat.parquet.compressionCodec`         | `SNAPPY`             | Parquet compression. `SNAPPY`, `GZIP`, `ZSTD`, or `UNCOMPRESSED`.                                                                                                                      |
| `fileFormat.parquet.rowGroupSizeBytes`        | `134217728` (128 MB) | Target row group size in bytes.                                                                                                                                                        |
| `fileFormat.parquet.pageSizeBytes`            | `1048576` (1 MB)     | Target page size in bytes.                                                                                                                                                             |
| `fileFormat.parquet.enableDictionary`         | `true`               | Write Parquet dictionary pages. Disable for high-cardinality columns.                                                                                                                  |
| `fileFormat.parquet.json.maxDepth`            | —                    | Maximum nesting depth when serializing JSON columns to Parquet `group` types.                                                                                                          |
| `fileFormat.parquet.decimal.precision`        | —                    | Override decimal precision for `BIG_DECIMAL` columns.                                                                                                                                  |
| `fileFormat.parquet.decimal.scale`            | —                    | Override decimal scale for `BIG_DECIMAL` columns.                                                                                                                                      |
| `fileFormat.parquet.json.unknownKeysBehavior` | `warn`               | What to do when a JSON column in the row data has keys not present in the Iceberg schema: `warn` (log and continue) or `fail` (abort the task). Applies to `iceberg-rest` target only. |

***

### Iceberg REST target

Set these when `externalTable.target=iceberg-rest`.

| Key                                         | Default      | Description                                                                    |
| ------------------------------------------- | ------------ | ------------------------------------------------------------------------------ |
| `catalog.iceberg-rest.restUri`              | *(required)* | Iceberg REST catalog URI, e.g. `https://glue.us-east-1.amazonaws.com/iceberg`. |
| `catalog.iceberg-rest.table.namespace`      | *(required)* | Catalog namespace (database) containing the target table.                      |
| `catalog.iceberg-rest.table.tableName`      | *(required)* | Target table name inside the namespace.                                        |
| `catalog.iceberg-rest.token`                | —            | Static bearer token for catalog authentication.                                |
| `catalog.iceberg-rest.credential`           | —            | OAuth2 client credential (`clientId:clientSecret`).                            |
| `catalog.iceberg-rest.rest.signing-region`  | —            | AWS region for SigV4 request signing (e.g. `us-east-1`).                       |
| `catalog.iceberg-rest.s3.access-key-id`     | —            | Static AWS access key for S3 file I/O (overrides vended credentials).          |
| `catalog.iceberg-rest.s3.secret-access-key` | —            | Static AWS secret key.                                                         |

<Note>
  For AWS Glue and S3 Tables, the catalog vends short-lived S3 credentials automatically — do not set static `s3.*` keys unless you want to override them. The minion's IAM role must have `glue:GetTable` and `s3:PutObject` (or equivalent S3 Tables) permissions on the target.
</Note>

***

### Iceberg batch commits

By default, each minion subtask uploads a Parquet file and stages it. The Generator accumulates staged files in ZooKeeper and issues a single Iceberg snapshot commit (`AppendFiles.commit()`) when the batch is ready — one snapshot for many segments instead of one per segment. This reduces Iceberg snapshot churn on high-throughput tables.

The Generator runs an **EMIT / WAIT / COMMIT** decision each cycle:

| State    | Condition                                                                | Action                                                 |
| -------- | ------------------------------------------------------------------------ | ------------------------------------------------------ |
| `EMIT`   | Eligible segments exist and in-flight count is below threshold           | Schedule new subtasks                                  |
| `WAIT`   | Running subtasks + staged queue ≥ `iceberg.commitThreshold`              | Skip this cycle; wait for in-flight tasks to drain     |
| `COMMIT` | No eligible segments remain, or staged files ≥ `iceberg.commitThreshold` | Flush the staged queue with one `AppendFiles.commit()` |

| Key                       | Default | Description                                                                                                                                              |
| ------------------------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `iceberg.commitThreshold` | `250`   | Number of staged files that triggers an immediate batch commit. Tune down for lower latency; tune up to reduce snapshot frequency on high-volume tables. |

***

## Scheduling and monitoring

`DataExportTask` uses the standard Pinot minion task scheduler. To trigger a run immediately without waiting for the next `schedule` tick, use the task schedule endpoint:

```bash theme={null}
curl -X POST "$CONTROLLER/tasks/schedule?taskType=DataExportTask&tableName=my_events_REALTIME"
```

To check task status, use the standard Pinot tasks API:

```bash theme={null}
curl "$CONTROLLER/tasks/DataExportTask/my_events_REALTIME/taskstates"
```

For a broader view of sync health and the ingestion checkpoint, see the [Observability](./observability) page.