Learn about off-heap dedup and how to use them in StarTree
enablePreload
and metadataManagerClass
parameters in the dedupConfig. These parameters are enabled by default starting from January, 2025.
metadataTTL
in Open Source Pinot, metadata is removed when next consuming segment starts which can be minutes or even hours, depending on the segment commit frequency. At StarTree, the dedup metadata is cleaned up by the async removal process as mentioned above. This is particularly useful for scenarios where duplicates are unlikely after a certain time.
ColumnFamily
in the shared RocksDB store. To customize the table’s ColumnFamily
add the following RocksDB configs in the metadataManagerConfigs
section. The config names are kept consistent with those available for RocksDB.
DedupSnapshotCreationTask
, and persisted in deep store, later imported by servers during restart. This mechanism minimizes restart delays by eliminating the need to rebuild metadata using read and write for each record.
DedupSnapshotCreationTask
requires partition information to schedule tasks effectively. If the table hassegmentPartitionConfig
with a single partition column, the task uses the numPartitions
field:
segmentPartitionConfig
, explicitly specify the partitions:
PREBUILT_SNAPSHOT_CHECK
, we can identify those dedup tables whose snapshot generation task is disabled in a cluster.
Can we enable deduplication on existing real-time tables?
dedupConfig
on existing real time tables is not advised and it will not eliminate any duplicate records from the existing segments, but newly ingested data will start to get deduplicated after enabling dedup. If the user still wishes to enable it on the existing table, add the dedupConfig
and restart the server, beware they may see duplicates in existing segments until the segments get deleted from the table, for example due to data retention.Is it possible to convert an upsert table to a dedup table?
What if duplicates are seen in a Dedup table?
metadataTTL
in dedupConfig
is set to an approipriate value and duplicate primary keys are not ingested beyond this ttl window.