- pinot-batch-ingestion-spark
- pinot-s3
- pinot-parquet
Pinot Version | 0.9.3 |
---|---|
Code | startreedata/pinot-recipes/ingest-parquet-files-from-s3-using-spark |
Prerequisites
This tutorial requires you have these on your local workstation.- JDK version 11 or higher
- Apache Spark 2.4.8 with Hadoop 2.7
- Apache Pinot 0.9.3 binary distribution
- An AWS S3 account with reading and write permissions
- AWS CLI installation with properly configured access key credentials.
Overview
To understand how these plugins work together, let’s take an example of a fictitious mobile application. The company Foo recently launched a mobile application. Now, the engineering team is curious to learn the user engagement statistics of the application. They are planning to calculate the daily active users (DAU) metrics to quantify that information. The mobile app emits an event similar to this whenever a user performs an activity in the application.
Launch Pinot Cluster
Once you have everything ready, follow this guide to spin up a single-node Pinot cluster. You can start Pinot using thequick-start-batch
launcher script.
Prepare the input data set
To simulate the mobile app events, we will use a sample CSV file that you can download from here. Having this file in CSV format helps us to spot errors and correct them quickly.Convert the CSV file into Parquet
Once you have downloaded the file, we will convert it into Parquet format using Apache Spark. To do that, open up a terminal and navigate to the location where you’ve extracted the Spark distribution. Then execute the following command to launch the Spark-scala shell. We are using Spark shell here as we don’t want to write a fully-fledged Spark script to get this done.Upload the Parquet file to S3
Now we have our Parquet file in place. Let’s go ahead and upload that into an S3 bucket. You can use AWS CLI or AWS console for that based on your preference. We will use the AWS CLI to upload the Parquet files into an S3 bucket calledpinot-spark-demo
:
Create a table and schema
We need to create a table to query the data that will be ingested. All tables in Pinot are associated with a schema. You can check out the Table configuration and Schema configuration for more details. Create a file calledevents_schema.json
with a text editor and add the following content.
That represents the schema definition to capture events.
events_table.json
with the following content. That represents the events table.
<PINOT_HOME>/bin
directory to define events schema and the table.

Ingest data from S3
Now that we have an empty table inside Pinot. Let’s go ahead and ingest the Parquet files in the S3 bucket into the running Pinot instance. Batch Data ingestion in Pinot involves the following steps.- Read data and generate compressed segment files from the input
- Upload the compressed segment files to the output location
- Push the location of the segment files to the controller

jobType
as SegmentCreationAndMetadataPush
, which is more performant and lightweight for the controller. When using this jobType, make sure that controllers and servers have access to the outputDirURI so that they can download segments from directly.
Other supported jobTypes are:
- SegmentCreation
- SegmentTarPush
- SegmentUriPush
- SegmentCreationAndTarPush
- SegmentCreationAndUriPush
- SegmentCreationAndMetadataPush
PINOT_DISTRIBUTION_DIR
and SPARK_HOME
to match with your local environment.
The above command includes the JARs of all the required plugins in Spark’s driver classpath. In practice, you only need to do this if you get a ClassNotFoundException.
The command will take a few seconds to complete based on your machine’s performance.
If the ingestion job doesn’t throw out any errors to the console, that means you have successfully populated the events table with data coming from the S3 bucket.
The data explorer shows the populated events table as follows.
Querying Pinot to calculate the daily active users (DAU)
Here comes the exciting part. Now we have data ingested into Pinot. Let’s write some SQL queries to calculate DAU based on the user engagement events. The mobile app has been released to the public. Its audience will grow daily. The first time users install the app, some will continue to use it regularly or from time to time, and some will quickly lose interest. However, these user segments are considered active because they had at least one session in the app at a particular time. Let’s use SQL to calculate the number of unique users in the last three months. Copy and paste the following query into the Query console and run.