To improve disk storage and query performance, we recommend merging segments in real-time Pinot tables using the Minion merge rollup task.
To learn how to merge segments in real-time Pinot tables, watch the following video, or complete the tutorial below.
If your use case supports aggregating data, you may also want to rollup segments in real-time tables. For more information, check out the following video.
You can spin up a Pinot Cluster by running the following command:
Copy
Ask AI
docker-compose up
This command will run a single instance of the Pinot Controller, Pinot Server, Pinot Broker, and Zookeeper. You can find the docker-compose.yml file on GitHub.
This means that a segment will be committed once it contains 1m records or every minute, whichever comes first. For more on configuring the segment threshold, see the segment threshold guide.
This shouldn’t be strictly necessary because it’s a config that’s usually used for offline tables, but the merge/roll-up logic in the current version (0.12.1) relies on it being there. The reliance on this config existing has already been fixed in the main branch and therefore this config won’t be required when 0.13.0 is released.
We’re also interested in the MergeRollupTask, which is extracted below:
This configuration will bucket records from the same 5 minute period and will only process records with a timestamp from more than 2 minutes ago.
We are intentionally using very small values for the bucketTimePeriod and bufferTimePeriod for the purposes of this example. You’ll want to use larger values for production systems.
You can create the table and schema by running the following command:
We can navigate to the Pinot UI and run the following query to see the segments that have been created and the number of records that they contain:
Copy
Ask AI
select $segmentName, count(*), ToDateTime(min(ts), 'YYYY-MM-dd HH:mm:ss') AS minDate, ToDateTime(max(ts), 'YYYY-MM-dd HH:mm:ss') AS maxDatefrom events group by $segmentNameorder by max(ts) DESClimit 50
2023/03/31 11:19:15.651 INFO [MergeRollupTaskGenerator] [pool-10-thread-5] Start generating task configs for table: events_REALTIME for task: MergeRollupTask2023/03/31 11:19:15.658 INFO [MergeRollupTaskGenerator] [pool-10-thread-5] Creating the gauge metric for tracking the merge/roll-up task delay for table: events_REALTIME and mergeLevel: 5m_2m.(watermarkMs=1680261300000, bufferTimeMs=120000, bucketTimeMs=300000, taskDelayInNumTimeBuckets=0)2023/03/31 11:19:15.658 INFO [MergeRollupTaskGenerator] [pool-10-thread-5] Bucket with start: 1680261300000 and end: 1680261600000 (table : events_REALTIME, mergeLevel : 5m_2m) cannot be merged yet2023/03/31 11:19:15.663 INFO [MergeRollupTaskGenerator] [pool-10-thread-5] Finished generating task configs for table: events_REALTIME for task: MergeRollupTask, numTasks: 02023/03/31 11:24:15.672 INFO [MergeRollupTaskGenerator] [pool-10-thread-3] Start generating task configs for table: events_REALTIME for task: MergeRollupTask2023/03/31 11:24:15.682 INFO [MergeRollupTaskGenerator] [pool-10-thread-3] Update watermark for table: events_REALTIME, mergeLevel: 5m_2m from: 1680261300000 to: 16802613000002023/03/31 11:24:15.696 INFO [MergeRollupTaskGenerator] [pool-10-thread-3] Finished generating task configs for table: events_REALTIME for task: MergeRollupTask, numTasks: 1
And we can check the Pinot Minion logs to see if the job has run:
We can see that the merge job has rolled up segments 0-4 into two larger segments, containing data from 5 minute windows.
If we wait a few more minutes, the next time it runs it will roll up the rest of the data for the 11:20 to 11:25 window, as well as some of the next window.
We can see the output of the next few runs below: