Merge Small Segments in Offline Tables
In this recipe we’ll learn how to merge small segments into larger ones. By doing this, Pinot can benefit from disk storage and query performance. This is done using the Minion merge rollup task.
Pinot Version | 1.0.0 |
---|---|
Code | startreedata/pinot-recipes/merge-small-segments |
You can only merge segments in offline tables that have a time column.
You can also merge segments in real-time tables. For more information see the merge segments in real-time tables guide
Prerequisites
To follow the code examples in this guide, you must install Docker locally and download recipes.
Navigate to recipe
- If you haven’t already, download recipes.
- In terminal, go to the recipe by running the following command:
Launch Pinot Cluster
You can spin up a Pinot Cluster by running the following command:
This command will run a single instance of the Pinot Controller, Pinot Server, Pinot Broker, and Zookeeper. You can find the docker-compose.yml file on GitHub.
Dataset
We’re going to import a couple of CSVs that contain results from the Australian Open tennis tournament that was held in January 2022. The contents of the files are shown below:
input/matches0.csv
input/matches1.csv
Pinot Schema and Table
Now let’s create a Pinot Schema and Table.
First, the schema:
config/schema.json
We’ll also have the following table config:
config/table.json
Our table must specify segmentsConfig.timeColumnName
, otherwise the merge process won’t merge any segments.
The main thing that we’re interested in is the MergeRollupTask
, which is extracted below:
This configuration will bucket records from the same 1 day period into the same segment. It will only process records with a timestamp from more than 5 minutes ago.
You can create the table and schema by running the following command:`
Remove the -arm64
suffix if you’re not using a Mac M1/M2.
Import Data
Now let’s import those CSV files into Pinot, using the following ingestion spec:
config/job-spec.yml
You can run the following command to run the import:
Once this job has run, we can list the created segments by running the following command:
Output
Let’s wrap this command in a function so that we can use it again later:
We could then call the function like this:
We can check the contents of these segments by writing the following function:
We can call it like this:
Output
Merge segments
Now we’re going to merge these segments using the Minion merge rollup task.
The configuration that we defined in the matches
table is going to bucket records from a 1 day period into the same bucket. Since our events all happened on the same day, we would expect that all records will be merge into a single segment.
We can run the merge rollup task by running the following:
Output
We can then check the Pinot Controller logs to see that it’s been triggered:
Output
And we can check the Pinot Minion logs to see if the job has run:
Output
Let’s now check the list of segments again:
Output
We can see the new segment, but the initial segments are still there as well. The Pinot broker knows to use the new segment when it processes queries, so this isn’t a problem. We can confirm this by running the following query on the Pinot UI:
$segmentName | loser | matchTime | round | score | winner |
---|---|---|---|---|---|
merged_1day_1680183805991_0_matches_1642417200000_1642443240000_0 | Salvatore Caruso | 2022-01-17 11:00:00.0 | R128 | 6-4 6-2 6-1 | Miomir Kecmanovic |
merged_1day_1680183805991_0_matches_1642417200000_1642443240000_0 | Mikhail Kukushkin | 2022-01-17 11:10:00.0 | R128 | 6-3 6-4 6-2 | Tommy Paul |
merged_1day_1680183805991_0_matches_1642417200000_1642443240000_0 | Chun Hsin Tseng | 2022-01-17 12:00:00.0 | R128 | 6-4 6-3 6-2 | Oscar Otte |
merged_1day_1680183805991_0_matches_1642417200000_1642443240000_0 | Sam Querrey | 2022-01-17 13:14:00.0 | R128 | 7-5 6-3 6-3 | Lorenzo Sonego |
merged_1day_1680183805991_0_matches_1642417200000_1642443240000_0 | Federico Coria | 2022-01-17 15:32:00.0 | R128 | 6-1 6-1 6-3 | Gael Monfils |
merged_1day_1680183805991_0_matches_1642417200000_1642443240000_0 | Cameron Norrie | 2022-01-17 11:10:00.0 | R128 | 6-3 6-0 6-4 | Sebastian Korda |
merged_1day_1680183805991_0_matches_1642417200000_1642443240000_0 | Lucas Pouille | 2022-01-17 11:03:00.0 | R128 | 3-6 6-3 6-4 6-3 | Corentin Moutet |
merged_1day_1680183805991_0_matches_1642417200000_1642443240000_0 | Fabio Fognini | 2022-01-17 13:08:00.0 | R128 | 6-1 6-4 6-4 | Tallon Griekspoor |
merged_1day_1680183805991_0_matches_1642417200000_1642443240000_0 | Tomas Martin Etcheverry | 2022-01-17 18:14:00.0 | R128 | 6-1 6-2 7-6(2) | Pablo Carreno Busta |
merged_1day_1680183805991_0_matches_1642417200000_1642443240000_0 | Alejandro Tabilo | 2022-01-17 16:51:00.0 | R128 | 6-2 6-2 6-3 | Carlos Alcaraz |
Also, Pinot’s retention manager will take care of removing the old segments the next time that it runs.
We’ll see the following messages in the Pinot Controller’s logs when the retention manager has run:
Output
We can then check the contents of that segment by running the following:
Output
We now have a single segment that contains all the records from the two CSV files that we ingested.
Automatic scheduling
We can also automatically schedule the merge task by adding the following configuration to the Pinot Controller:
Controller configuration