Replicating DynamoDB to Apache Pinot
In this developer guide, you’ll learn how to ingest DynamoDB formatted change data capture data into Pinot.
Introduction
Imagine having the robust storage capabilities of DynamoDB combined with the lightning-fast analytics of Apache Pinot.
What You’ll Need
- AWS account (with DynamoDB and Kinesis access)
- Apache Pinot cluster
- Your favorite code editor
Setting Up the Replication Pipeline
Step 1: Create a DynamoDB Table
Let’s start by creating our source of truth - a DynamoDB table.
Step 2: Create a Kinesis Data Stream
Time to create a highway for our data - Kinesis stream where dynamo will push its CDC.
Step 3: Enable DynamoDB-Kinesis stream
Now, let’s turn on the data faucet by connecting dynamodb to kinesis
Step 4: Create Pinot Schema
Let’s tell Pinot what our data looks like:
Step 5: Create Pinot Table Configuration
Now, let’s set the table for our data feast!
Why do we have so many configurations?
Let’s try to understand which of these configs are necessary. When you enable CDC on dynamoDB table, it starts sending the data in the following format
Decoder Configuration
To help pinot understand the dynamodb data format, we need to add decoder configs to our table
The decoder.class.name
specifies our primary decoder.
The timeColumnName
specifies the column that should be filled with the ApproximateCreationDateTime
from dynamodb json record.
the deleteColumnName
specifies the column that should be set to true
in case we receive a REMOVE
record from dynamodb
Finally, the envelope.decoder.class.name
simply specifies the vanilla decoder that should be used to parse the message. Since them dynamodb messages come in json format, we specify the JSONMessageDecoder
here
Upserts Configuration
To handle updates properly, you need to enable upserts in Pinot. This is done in the upsertConfig
section of the table configuration:
Key points:
mode
: Set to “PARTIAL” for partial updates.deleteRecordColumn
: Specifies the column that indicates if a record should be deleted.comparisonColumns
: UsesApproximateCreationDateTime
to determine the order of changes.
Derived Column for Deletions
A new derived column is_delete
is created in the schema to signify whether a key needs to be removed from the upsert metadata:
This column is set to true when the eventName
in the DynamoDB stream event is “REMOVE”.
Handling Different Event Types
The configuration handles different event types as follows:
- INSERT: New records are added to Pinot.
- MODIFY: Existing records are updated using the upsert configuration.
- REMOVE: Records are marked for deletion using the
is_delete
column.
ApproximateCreationDateTime Usage
The ApproximateCreationDateTime
from the DynamoDB payload is used in the comparisonColumns
of the upsert configuration. This ensures that changes are applied in the correct order, as it represents the sequence of events in DynamoDB.
A corresponding column is added to the schema:
Step 6: Create Pinot Table
Let’s bring our table to life!
Insert, Update, Delete!
Now that we’ve set the stage, let’s watch our data perform!
Insert
Let’s add some data to our DynamoDB table:
Check your Pinot table, and you’ll see these rows magically appear!
Update
Let’s update a row:
Query Pinot, and witness the transformation!
Row before update
Row after update
Delete
To remove a row:
Check Pinot:
Behind the Scenes: Viewing Operation Order
Use the following in your Pinot queries:
This will reveal the entire history of your data’s journey.
Conclusion
You’ve successfully created a real-time replication pipeline from DynamoDB to Apache Pinot. Your data is now ready for lightning-fast analytics while maintaining the reliability of DynamoDB.