Pinot Version | 0.9.3 |
---|---|
Code | startreedata/pinot-recipes/managed-offline-flow |
realtime.segment.flush.threshold.rows
config is intentionally set to an extremely small value so that the segment will be committed after 10,000 records have been ingested.
In a production system this value should be set much higher, as described in the configuring segment threshold guide.RealtimeToOfflineSegmentsTask
, which is extracted below:
bufferTimePeriod
- Tasks won’t be scheduled unless the time window is older than this buffer.bucketTimePeriod
- The time window size/amount of data processed for each run.bufferTimePeriod
and bucketTimePeriod
are intentionally set to very low values so that it’s easier to see how they work.
In a production setup, these values would usually be set to a time of 1 day or more.roundBucketTimePeriod
- Round the time value before merging rows in the offline table.
The value of 1m
that we’ve used means that values in the ts
column will be rounded to the nearest column.mergeType
- The type of aggregation to apply when creating rows in the offline table. Valid values are:
concat
- Don’t aggregate anything.rollup
- Perform metric aggregations across common dimensions and time.dedup
- Deduplicate rows that have the same values.{metricName}.aggregationType
- Aggregation function to apply to the metric for aggregations. Only applicable for rollup
.
Allowed values are sum
, max
, and min
.
We are selecting the max
value for the count
column.maxNumRecordsPerSegment
- The number of records to include in a segment.events
Kafka topic, by running the following:
Real-Time and Offline Tables
ingestionConfig.batchIngestionConfig.segmentIngestionFrequency
in our offline table:
HOURLY
, then timeBoundary = Maximum end time of offline segments - 1 HOUR
timeBoundary = Maximum end time of offline segments - 1 DAY
1646661240000
or 2022-03-07T13:54:00
in a more friendly format.
This value will increase in increments of 5 minutes every time that the RT2OFF job runs as we set bucketTimePeriod
to 5m
.
You can increase this value in the real-time table config to have the job process more data.
If we temporarily pause the ingestion, we can check what queries are actually getting executed on each table.
Let’s say we run the following query to count the number of records:
count(*) |
---|
1450197 |
count(*) |
---|
26595 |
count(*) |
---|
1423602 |
ts | count(*) |
---|---|
2022-03-04 15:54:00.0 | 17166 |
2022-03-04 15:53:00.0 | 9429 |
2022-03-07 16:04:56.059 | 1 |
2022-03-07 15:29:59.193 | 1 |
2022-03-07 14:55:02.075 | 1 |