Usage
The off-heap upsert is enabled by default in new StarTree releases, and both FULL and PARTIAL upserts are supported. We have used RocksDB to implement the off-heap upsert. Different storage systems can be supported if needed. When using the off-heap upsert, the server creates one RocksDB store to be shared by all the upsert tables for better efficiency.
ColumnFamily
in the shared RocksDB store. To customize the table’s ColumnFamily
add the following RocksDB configs in the metadataManagerConfigs
section. The config names are kept consistent with those available for RocksDB
Enable async removal of upsert metadata
This is enabled by default in new StarTree releases, so you can skip this section unless there is a need to customize its behavior. When a table has a lot of primary keys, its upsert metadata in RocksDB can be huge. Cleaning them up can take a long time. As this cleanup is done on the HelixTaskExecutor threads by default, those threads may be occupied for a long time, blocking the Helix state transitions from the other tables, which can potentially block the real-time data ingestion. So this async removal feature was added to move cleanup of upsert metadata to another thread pool, releasing the HelixTaskExecutor threads as soon as possible. This feature can be enabled in themetadataManagerConfigs
section and customized with the configs listed below.
Use UpsertSnapshotCreation minion task
The open source implementation of the preloading feature uses the validDocIds snapshots kept on server local disk to identify valid docs and write their upsert metadata into the RocksDB store. As preloading is write-only, it can finish pretty fast in most cases. But for very large upsert tables, like those with hundreds of millions or billions of primary keys per server, this can still take a long time to finish. Besides, there are cases when servers may lose their local disks and have to download raw segments from the deep store. The raw segments don’t have validDocIds snapshots, so servers have to load them with cpu intensive read/check/write operations. To handle those issues, we have built a minion task calledUpsertSnapshotCreationTask
to prebuild the upsert metadata for table partitions on the minion workers and upload the metadata to the deep store. The servers can simply download and import the prebuilt metadata into RocksDB when loading upsert tables. The prebuilt upsert metadata contains certain upsert configs and segment information for consistency check before importing to ensure data correctness.
The minion task runs periodically to keep updating the prebuilt upsert metadata incrementally.
To enable the minion task, add the following task configs. Please make sure enableSnapshot
and enablePreload
are set to true in the tableConfig for this minion task to execute.
More about how to operate the minion tasks can be found in the Pinot docs.
segmentPartitionConfig
field and defines a single partition column, then the UpsertSnapshotCreation task will read the number of partitions as set in the numPartitions
field below.
segmentPartitionConfig
field or it has two or more partition columns specified there, you must explicitly tell the UpsertSnapshotCreation task how many partitions are in the table by using the num_partition_overwrite
field in the upsertConfig section like below.