How does StarTree Cloud solve it?
To simplify this process and reduce the burden on operators, StarTree Cloud provides the Segment Refresh Task. Instead of requiring manual compaction and cleanup, this task automates the entire process of segment replacement. It takes a set of input segments, processes them to generate new segments by merging, transforming, or rolling up the data, and then atomically replaces the old segments with the new ones. With this mechanism, queries always see a consistent dataset, even when replacement is in progress. The Segment Refresh Task, therefore, offers a better, more reliable way to manage upserts in Pinot, ensuring correctness while optimizing storage and performance.What is Segment Refresh Task and How It Works
The Segment Refresh Task runs as a background minion task within StarTree Cloud and is designed to improve both storage and query efficiency. For example, imagine a table with three segments:S1
, S2
, and S3
. Over time, updates have invalidated many rows across these segments. When the Segment Refresh Task is executed, it collects S1
, S2
, and S3
as input and creates a new merged segment, say S3_refreshed
.
This new segment is immediately made visible to queries, while the old segments with no valid documents are removed eventually. The result is that queries only need to scan the refreshed segment, which reduces latency and saves storage space. Importantly, this process guarantees atomicity: queries never encounter a partial replacement where some data is missing or duplicated.
Read more about Segment Refresh Task
Things to Keep an Eye On While Configuring the Task
While the Segment Refresh Task is powerful, careful configuration is key to getting the best results. Running the task too frequently can put unnecessary strain on the system, as it may end up processing nearly all segments repeatedly. On the other hand, running it too infrequently can cause storage to grow and query performance to degrade. It is important to configure thresholds that determine when segments should be refreshed, ensuring that only segments with a significant number of invalid documents are selected. Another consideration is the buffer period, which allows the system to skip the most recent segments that are still ingesting new records. This avoids conflicts between ingestion and refresh. Segment naming conventions also matter, as consistently naming refreshed segments makes tie-breaking deterministic when conflicts arise.Example: Configuring Segment Refresh Task for Upsert Table
You can configure the Segment Refresh Task as shown below. Include this in your table config:Explanation of Key Parameters
- schedule: Defines when the task should run. In the above example, it runs every day at 2 AM.
- bufferTimePeriod: Ensures the most recent segments are skipped, so ingestion is not disrupted.
- invalidRecordsThresholdPercent: Only triggers a refresh if a large enough portion of the segment’s records have been invalidated by upserts.
- maxNumRecordsPerSegment: Prevents the creation of overly large output segments by setting a record count limit.