Learn how to set up tiered storage with cloud object store, effectively decoupling storage from compute in your Pinot cluster.
RemoteTenant_OFFLINE
(or whatever you prefer) using the /instances/{instanceName}/updateTags
API in the Swagger UI on the host where Pinot is running. This link is only accessible when Pinot is running.tableConfig->tenants->server
in your Pinot table configuration. This option is a no-op, requiring you to do nothing.
POST /cluster/configs
API to add the following cluster-level tier storage configurations.Property | Description | Default value |
---|---|---|
tier.enabled | Configuration to enable tier storage for the cluster | false |
segment.cache.directory | Where to cache segment headers to speed up server restarts | empty (disabled) |
ondemand.total.size | Max memory for remote data access. Ensure this is smaller than direct memory available on server else you’ll get OutOfDirectMemory exceptions | 24G |
ondemand.reserved.size | This is a portion of the above total size. Ensure this is larger than size of largest uncompressed segment | 4G |
pinot.server.instance.segment.cache.directory
enables caching to restart the server more quickly, but does not affect the query execution speed. The two ondemand
configs are shown here for their importance in query performance. More information about these configs others can be found in the sections below. You may want to set some of them before restarting servers.tierConfigs
to your Pinot table configuration (example below). The segmentAge
will be calculated using the primary timeColumnName
as set in Pinot table configuration.Text-index
is under development and not functional.SegmentRelocator
, will check and migrate segments to the proper tier, such as when the segments cross the segmentAge
. See the SegmentRelocator configuration documentation for more information.
Property | Description |
---|---|
name | Set this to anything you want, this is more for referring to this tier uniquely. |
segmentSelectorType | Can be time for moving segments by age, or fixed to move fixed segments by name |
segmentAge | Applicable if segmentSelectorType is time . Set this to the segment age boundary you picked in Step 1. This takes any period string, examples - 0s , 12h , 30m , 60d |
segmentList | Applicable if segmentSelectorType is fixed . Set this to a comma separated list of segments you wish to move to remote tier. This is a good way to do a quick test. |
storageType | Has to be pinot_server |
serverTag | If using setup 2.1, set this to table’s server tag. If using setup 2.2, set this to the tag of the remote tier created in Step 2.2 |
tierBackend | For AWS S3 tier backend, use s3 . For Google Cloud Storage (GCS) tier backend using interoperable API, use gcs_interoperable . |
tierBackendProperties (for S3) | Set these as per your S3 details. Supported properties are region (required), bucket (required, but can be same as deep store bucket. If using deep store bucket, be sure to use pathPrefix to avoid collision), pathPrefix (optional as a relative path prefix to use in the bucket). The accessKey , secretKey and sessionToken can be saved if servers use role based access control to access S3. |
tierBackendProperties (for GCS) | For detail about tiered storage backend properties, see Tiered storage with GCS. |
segmentAge
setting is independent of the table’s retention. Segments will continue to be forgotten from Pinot once the table’s retention is reached, regardless of what tier the segments are on.tierBackendProperties
, as shown in the following example:tierBackendProperties
:
gcpprojectid
and gcpkeypath
are required to be able to access the GCP secret manager. The gcpkeypath
should be set to the absolute path where the gcp key is mounted on the server (typically /home/pinot/gcp/credentials.json
).
segmentAge
. For existing segments, the SegmentRelocator
task moves them to the remote tier periodically as configured.
segments/{tableName}/tiers
API to check segments’ current tierConfig | Description | Default value |
---|---|---|
preload.enable | Whether to use preload feature | false |
preload.total.size | How much disk space to hold preloaded index | 100G |
preload.dir | Where to keep the preloaded index. Be sure this is different from dataDir which server processes segments while loading them | required |
preload.load.existing.buffers | Whether to keep preloaded index across server restarts | true |
preload.index.keys | What index types to pin, applied to all tables | empty |
preload.index.keys.override | Same as above but for specific table. Be sure to use tableNameWithType like foo_OFFLINE | empty |
<columnName>.<indexType>
convention to specify the index type to preload. Use *
as a wildcard for the column name to preload all index buffers of a given index type.
pinot.server.instance.buffer.reference.manager.preload.index.keys.override
cluster configuration option as follows:tableNameWithType
like foo_OFFLINE
.
preload.index.keys.override
property inside tierBackendProperties
as follows:Config | Description | Default value |
---|---|---|
mmap.enable | Whether to use this caching feature | false |
mmap.total.size | How much disk space to hold cached index | 100G |
mmap.dir | Where to keep the cached index. Be sure this is different from dataDir which server processes segments while loading them | required |
mmap.load.existing.buffers | Whether to keep cached index across server restarts | true |
mmap.num.buffers | How many index to cache | unlimited |
mmap.cache.refresh.policy | Only LFU available for now | LFU |
mmap.cache.refresh.initial.delay.seconds | When to start cache refresher | 30 |
mmap.cache.refresh.frequency.seconds | How often to refresh the cache | 60 |
ondemand
space is an important configuration for query performance. If it’s too small, it may become a bottleneck to prefetch segment data effectively or to fully parallelize the remote reads. The reserved
size is to ensure query execution can always proceed, so it should be larger than the max segment size in the table. If query execution tends to wait for ondemand
space, you should increase the reserved
space. You’ve seen these two configs above in Add cluster configs and restart servers.
Property | Description | Default value |
---|---|---|
ondemand.total.size | Max memory for remote data access. Ensure this is smaller than direct memory available on server else you’ll get OutOfDirectMemory exceptions | 24G |
ondemand.reserved.size | This is a portion of the above total size. Ensure this is larger than size of largest uncompressed segment | 4G |
readAhead
to speed up the query. The access methods can be turned on/off as a query option, making it easy to experiment with both access methods, as in this example:
tierBackendProperties
, it will apply to all queries towards this table.
projection
, groupBy
, and orderBy
): if they are dictionary encoded, their dictionaries are prefetched; but in any case, their forward index data is accessed via the readAhead
method, because only part of the data is needed to evaluate the clauses.Config | Description | Default value |
---|---|---|
readAhead.enable | Whether to enable read ahead access method (need to set readAhead.use.blockCache=true first) | false, i.e. all by prefetching |
prefetch.column.index.list.processing.phase | Force to use prefetching to access some index | empty |
readAhead.column.index.list.processing.phase | Force to use readAhead to access some index | empty |
initialBytes (readAhead specific) | Initial buffer size to access data by read ahead | 10KB |
maxBytes (readAhead specific) | Max buffer size to access data by read ahead | 1MB |
numTriesBeforeDouble (readAhead specific) | How many reads before double the buffer size | 1 |
downloadAll (readAhead specific) | Whether to download all index data by a single read, avoiding the cost of growing buffer | false |
readAhead.use.blockCache | Whether to use block cache prefetch read instead of read ahead (need to set readAhead.enable=true first) | false |
blockCache.blockSize.bytes | The size of the block to prefetch in block cache prefetch read | 10000 |
*.forward_block
to the pinot.server.instance.buffer.reference.manager.preload.index.keys
cluster configuration. Additionally, set pinot.server.instance.index.sparse.enabled
to true in the cluster config to enable prefetching support.
enable.startree.index
configuration, in tierBackendProperties
(shown below). Changes here require you to reload the table or restart the servers. This is disabled by default.
readAhead
method during query.
Here is an example to preload the tree nodes of StarTree Index. The count suffix is aligned with the order of star-ree index configurations set in the table configuration.
tierBackendProperties
. Most of them need to reload the table or restart servers to take effect.
Pinot supports both sync and async s3 client.
By default, sync s3 client is used and a thread pool is created to fetch data via the sync s3 clients in parallel, without blocking the query processing threads. The configs for sync s3 client are prefixed with s3client.http
.
The async s3 client can reduce the size of the thread pool considerably. It uses async I/O to fetch data in parallel in a non-blocking manner; and uses the thread pool mentioned above to process the I/O completion callbacks only. The configs for the async s3 client are prefixed with s3client.asynchttp
. Configs prefixed with s3client.general
apply to both kinds of clients.
Review the related configuration options here:
Config | Description | Default value |
---|---|---|
s3client.general.apiCallTimeout | Timeout for e2e API call, which may involve a few http request retries. | no limit |
s3client.general.apiCallAttemptTimeout | Timeout for a single http request. | no limit |
s3client.general.numRetries | How many times to retry the http request, with exponential backoff and jitter. | 3 |
s3client.general.readBufferSize | Size of buffer to stream data into buffers for query processing. | 8KB |
Config | Description | Default value |
---|---|---|
s3client.http.maxConnections | The max number of connections to pool for reuse | 50 |
s3client.http.socketTimeout | Timeout to read/write data | 30s |
s3client.http.connectionTimeout | Timeout to establish connection | 2s |
s3client.http.connectionTimeToLive | Timeout to reclaim the idle connection | 0s |
s3client.http.connectionAcquireTimeout | Timeout to get a connection from pool | 10s |
s3client.http.connectionMaxIdleTimeout | Timeout to mark a connection as idle to be candidate to reclaim | 60s |
s3client.http.reapIdleConnections | Reclaim the idle connection in pool | true |
Config | Description | Default value |
---|---|---|
s3client.asynchttp.maxConcurrency | The max number of concurrent requests allowed. We use HTTP/1.1, so this controls max connections. | 50 |
s3client.asynchttp.readTimeout | Timeout to read data | 30s |
s3client.asynchttp.writeTimeout | Timeout to write data | 30s |
s3client.asynchttp.connectionTimeout | Timeout to establish connection | 2s |
s3client.asynchttp.connectionTimeToLive | Timeout to reclaim the idle connection | 0s |
s3client.asynchttp.connectionAcquireTimeout | Timeout to get a connection from pool | 10s |
s3client.asynchttp.connectionMaxIdleTimeout | Timeout to mark a connection as idle to be candidate to reclaim | 60s |
s3client.asynchttp.reapIdleConnections | Reclaim the idle connection in pool | true |
pinot.server.instance.segment.cache.directory
cluster configuration to a path. This path must be different from the dataDir. Setting the property leads to segment metadata files, such as creation.meta
, metadata.properties
, and index_map
, along with certain byte slices (specifically, the header bytes of each of the index buffers) of the columns.psf
file being cached onto the disk.
These cached files are used during server startup and eliminate the expensive object store calls, which helps in reducing the overall server startup time.
"directTierUpload": "true"
in the individual table’s minion task config and adding the cluster config "pinot.server.instance.table.data.manager.provider.class" : "ai.startree.pinot.data.tier.table.StarTreeDefaultTableDataManagerProvider"
but will be made default in later releases.
A server restart is required for the cluster config to take effect.
POST /tasks/schedule
API call.
Config | Description | Default value |
---|---|---|
controller.tieredStorage.segment.cleanup.frequencyPeriod | The frequency at which the controller should check for stale segment directories | -1 (disabled by default) |
controller.tieredStorage.segment.cleanup.tombstoneTtlMillis | The amount of time in milliseconds before an unused segment directory can be tombstoned | 2h |
controller.tieredStorage.segment.cleanup.tombstoneToDeletionTtlMillis | The amount of time in milliseconds before a tombstoned segment directory is deleted | 6h |
controller.tieredStorage.segment.cleanup.tombstoneTtlMillis
allows more time for reusing old segment directories if table config reverts. However, it prolongs the persistence of stale directories in the object store.
controller.tieredStorage.segment.cleanup.tombstoneToDeletionTtlMillis
config specifies the wait time before deleting a tombstoned directory. This prevents the wrongful deletion of a segment directory if any server starts using the segment directory concurrently during the tombstoning process.
Metrics | Description |
---|---|
pinot_server_s3Segments_UploadedCount | How many segments uploaded to S3 tier backend when reloading segment. |
pinot_server_s3Segments_UploadedSizeBytes | How many bytes in total uploaded to S3 tier backend when reloading segment. |
pinot_server_s3SegmentUploadTimeMs | How long segment uploading takes when reloading segment. |
pinot_server_s3Segments_DownloadedCount | How many segments downloaded to server when reloading segment. |
pinot_server_s3Segments_DownloadedSizeBytes | How many bytes in total downloaded to server when reloading segment. |
pinot_server_s3SegmentDownloadTimeMs | How long segment downloading takes when reloading segment (this is not on query path). |
pinot_server_totalPrefetch_99thPercentile/OneMinuteRate | Rate and time spent to kick off data prefetching, and should be low. The prefetching is done in async. The metric is also broken down for cache layers: mmapCacheLayer and ondemandCacheLayer . |
pinot_server_totalAcquire_99thPercentile/OneMinuteRate | Rate and time spent to wait for segment data to be available and may be varying based on the effectiveness of prefetching. The metric is also broken down for cache layers: mmapCacheLayer and ondemandCacheLayer . |
pinot_server_totalReleas_99thPercentile/OneMinuteRate | Rate and time spent to release the segment data, and should be low. The metric is also broken down for cache layers: mmapCacheLayer and ondemandCacheLayer . |
pinot_server_bufferPrefetchExceptions_OneMinuteRate | Error rate when prefetching segment data. This metric is also broken down for cache layers: mmapCacheLayer and ondemandCacheLayer . |
pinot_server_bufferAcquireExceptions_OneMinuteRate | Error rate when waiting to acquire segment data. This metric is also broken down for cache layers: mmapCacheLayer and ondemandCacheLayer . |
pinot_server_bufferReleaseExceptions_OneMinuteRate | Error rate when releasing the prefetched segment data. This metris is also broken down for cache layers: mmapCacheLayer and ondemandCacheLayer . |
pinot_server_ondemandCacheLayerReserve_99thPercentile | Time for prefetching threads to wait for ondemand memory space to be available to fetch data. If this is high, try to configure more on-demand memory or adjust the query to access less segments like making query predicates more selective. |
pinot_server_ondemandCacheLayerBufferDownloadBytes_OneMinuteRate | Bytes rate of data prefetching. |
pinot_server_ondemandCacheLayerSingleBufferFetchersCount_OneMinuteRate | Number of column index pairs being prefetched. |
pinot_server_ondemandCacheLayerBufferDownloadBytes_OneMinuteRate | Bytes rate of data prefetching. |