How to use Google Cloud Storage as a Deep Store
In this recipe we’ll learn how to use Google Cloud Storage as a Deep Store for Apache Pinot segments. The deep store (or deep storage) is the permanent store for segment files and is used for backup and restore operations.
Pinot Version | 0.9.3 |
---|---|
Code | startreedata/pinot-recipes/google-cloud-storage |
Prerequisites
To follow the code examples in this guide, do the following:
- Install Docker and the Google Cloud CLI locally.
- Create a GCP project and a user or service account that has permission to list and create buckets, and then navigate to https://console.cloud.google.com/storage/browser and create a bucket, for example
pinot-deepstore.yourdomain.com
- Download recipes
Navigate to recipe
- If you haven’t already, download recipes.
- In terminal, go to the recipe by running the following command:
Launch Pinot Cluster
You can spin up a Pinot Cluster by running the following command:
This command will run a single instance of the Pinot Controller, Pinot Server, Pinot Broker, Kafka, and Zookeeper. You can find the docker-compose.yml file on GitHub.
Controller configuration
We need to provide configuration parameters to the Pinot Controller to configure MinIO as the Deep Store. This is done in the following section of the Docker Compose file:
The configuration is specified in /config/controller-conf.conf
, the contents of which are shown below:
/config/controller-conf.conf
Let’s go through some of these properties:
controller.data.dir
contains the name of our bucket.pinot.controller.storage.factory.gs.projectId
contains the name of our GCP project.pinot.controller.storage.factory.gs.gcpKey
contains the path to our GCP JSON key file.
You’ll need to update the following lines:
- Replace
<bucket-name>
with the name of your bucket. - Replace
<project-id>
with the name of your GCP project.
You should also paste the contents of your GCP JSON key file into config/service-account.json
.
Pinot Schema and Tables
Now let’s create a Pinot Schema and real-time table.
Schema
Our schema is going to capture some simple events, and looks like this:
config/schema.json
You can create the schema by running the following command:
Real-Time Table
And the real-time table is defined below:
config/table-realtime.json
realtime.segment.flush.threshold.rows
config is intentionally set to an extremely small value so that the segment will be committed after 10,000 records have been ingested. In a production system this value should be set much higher, as described in the configuring segment threshold guide.You can create the table by running the following command:
Ingesting Data
Let’s ingest data into the events
Kafka topic, by running the following:
Data will make its way into the real-time table. We can see how many records have been ingested by running the following query:
Exploring Deep Store
Now we’re going to check what segments we have and where they’re stored.
You can get a list of all segments by running the following:
The output is shown below:
Output
Let’s pick one of these segments, events__0__3__20220505T1343Z
and get its metadata, by running the following:
The output is shown below:
Output
We can see from the highlighted line that this segment is persisted at gs://pinot-events/events/events__0__3__20220505T1343Z
.
Let’s go back to the terminal and return a list of all the segments in the bucket:
The output is shown below:
Output