Segment Purge Task
Use the SegmentPurgeTask
to purge records from a Pinot table, for example, to ensure GDPR (General Data Protection Regulation) compliance. TheSegmentPurgeTask
reads input files from a remote location, like AWS S3, to build hash keys used to uniquely identify records to purge in a Pinot table.
Pinot Version: 0.13.0-ST.58+
TheSegmentBackfillTask
offers extended functionality and performance when compared to the SegmentPurgeTask
and will eventually replace this task as the default.
Configure SegmentPurgeTask
Configure SegmentPurgeTask
under taskConfig
in the table configuration.
Property Name | Required | Description |
---|---|---|
input.fs.className | Yes | The class name used to read the files from source location. |
inputFormat | Yes | The input file format. |
inputDirURI | Yes | The input dir containing the purge input files. |
input.fs.prop.accessKey | No | Required if reading from AWS S3. |
input.fs.prop.secretKey | No | Required if reading from AWS S3. |
input.fs.prop.region | No | Required if reading from AWS S3. |
recordReader.prop.delimiter | No | Defaults to ”,”. Supported values are ”,” and ”;“. |
max.num.purge.input.files | No | Maximum number of input files processed at a time. Default value is 10. |
max.total.purge.input.file.size | No | Maximum total size of all input files combined. Default value is 100_000_000 bytes. A single file cannot exceed this value. |
table.max.num.tasks | No | Maximum number of minion sub tasks generated per task invocation. Default value is 10. |
Example Table Configuration
Example Input Files
File#1
File Name: purgeRecords1.csv
If the above input file was processed against a table named users
, all records that match the userId
field in the Pinot table would be deleted.
File#2
FileName: purgeRecords2.csv
If the above input file was processed against a table named users
, all records that match both the firstName
and lastName
fields in the Pinot table would be deleted.
Guidance on input files
- An input file must not be appended or overwritten.
- New data should not be generated for the records which are to be purged. If this does happen, new input files must be dropped.
- We recommend periodically cleaning up the input directory to purge files that have been processed.
Limits
Note the following limits.
Item | Limit | Description |
---|---|---|
Input file format | CSV | Contents within the input file should conform to CSV format. |
Input files source | AWS S3 | Only S3 is supported as of now. |
Data Types | int, long, boolean, string | The input files must contain fields that conform to one of these types. |
Field Values | Single Values | All fields in the input file must be of single value dimension. Multi values are not supported. |
Comparison Type | == | Fields from the input record would be matched against the Pinot record using the equals operator. |
Null Value | Not Supported | Matching against null values are not supported. |
Empty Value | Not Supported | Matching against emtpy values are not supported. |
Input file Name Extension | Not Required | File extension is not required. |
Input file field delimiter | ”,” and ”;“ | Other delimeters like space, tab are not supported. Default value is ”,”. |
Limitations
SegmentPurgeTask
works with both OFFLINE and REALTIME tables. The task is not tested against upsert enabled REALTIME tables.- Running other tasks such as
SegmentRefresh
orMergeRollup
along with the SegmentPurgeTask can lead to issues. The SegmentRefresh and MergeRollup tasks needs to be disabled while the SegmentPurgeTask is running to avoid race conditions. These tasks can be re-enabled after the SegmentPurgeTask completes.
Field Values
The string literal null
is not supported as a field value. Records with this value are not purged.
As an example, consider the input file purgeRecords.csv
.
This input record would be skipped, and no record in the corresponding Pinot table would be deleted. You can make use of alternate fields in the table to delete these records.