CREATED
, PROCESSING
, IN_TRANSIT
, and DELIVERED
. A change data capture (CDC) stream capturing an orders table may emit change events containing different values for the order status.
But, from an analytics perspective, you may only interested in the most up-to-date version and state for each order. For example, consider writing a SQL query to retrieve orders that took more than two days for the delivery. To enable that, we need to merge all change events belonging to a particular order to its latest value.
Apache Pinot supports that by enabling upserts on a real-time table.
Understanding upserts in Pinot
Pinot, by default, allows querying all events ingested from a Kafka topic by a particular primary key (a dimension). Revisiting our e-commerce example above, that kind of a query will return all the state changes for all orders. In some cases, we need to get back the most up-to-date version and state for each order. Pinot is an immutable datastore, which means that there is no genuine concept of upsert as you stream data into it from Kafka. For the upsert implementation, it’s essential to understand that an individual record is not updated via a write; instead, updates are appended to a log and a pointer maintains the most recent version of a record. Pinot upserts work in two modes:- Full upserts
- Partial upserts
Pinot Version | 1.0.0 |
---|---|
Code | startreedata/pinot-recipes/full-upserts |
Prerequisites
To follow the code examples in this guide, you must install Docker locally and download recipes.Navigate to recipe
- If you haven’t already, download recipes.
- In terminal, go to the recipe by running the following command:
Start Pinot and learn
Spin up a Pinot cluster using Docker Compose:order_status
is for order_id
5 is now set to CANCELLED
.