How to unnest arrays in JSON documents
In this recipe we’ll learn how to unnest or explode values when ingesting JSON documents into Apache Pinot. When we unnest an array it means that we’ll create one row in Pinot for each item in the array rather than one row for the whole JSON document.
Pinot Version | 0.10.0 |
---|---|
Code | startreedata/pinot-recipes/json-unnest |
Prerequisites
To follow the code examples in this guide, you must install Docker locally and download recipes.
Navigate to recipe
- If you haven’t already, download recipes.
- In terminal, go to the recipe by running the following command:
Launch Pinot Cluster
You can spin up a Pinot Cluster by running the following command:
This command will run a single instance of the Pinot Controller, Pinot Server, Pinot Broker, and Zookeeper. You can find the docker-compose.yml file on GitHub.
Dataset
We’re going to import the following JSON file:
data/movies.json
We’re particularly interested in the Ratings
property of each movie. We’re going to unnest or explode the ratings so that we have one row per rating in our database after the data is ingested.
Pinot Schema and Table
Now let’s create a Pinot Schema and Table.
First, the schema:
config/schema.json
When we unnest a JSON list in a document, column names will be prefixed with the parent property name, in this case Ratings
.
We’ll also have the following table config:
config/table.json
The highlighted config specifies which field in the data source should be unnested.
The Ratings
field contains an array of JSON objects and that array will be exploded to create one row for each item in the array.
If you want to rename the unnested fields, see How to rename fields when unnesting arrays in JSON documents
You can create the table and schema by running the following command:`
You should see a message similar to the following if everything is working correctly:
Ingestion Job
Now we’re going to import the JSON file into Pinot. We’ll do this with the following ingestion spec:
config/job-spec.yml
The import job will map fields in each JSON document to a corresponding column in the movie_ratings
schema. If one of the fields doesn’t exist in the schema it will be skipped.
You can run the following command to run the import:
Querying
Once that’s completed, navigate to localhost:9000/#/query and click on the movie_ratings
table or copy/paste the following query:
You will see the following output:
Rated | Ratings.Source | Ratings.Value | Title |
---|---|---|---|
R | Internet Movie Database | 8.7/10 | The Matrix |
R | Rotten Tomatoes | 88% | The Matrix |
R | Metacritic | 73/100 | The Matrix |
PG-13 | Internet Movie Database | 7.8/10 | Avatar |
PG-13 | Rotten Tomatoes | 82% | Avatar |
PG-13 | Metacritic | 83/100 | Avatar |
PG | Internet Movie Database | 6.3/10 | The Boss Baby |
PG | Rotten Tomatoes | 53% | The Boss Baby |
PG | Metacritic | 50/100 | The Boss Baby |
TV-PG | Internet Movie Database | 5.2/10 | Tall Girl |
TV-PG | Rotten Tomatoes | 38% | Tall Girl |
TV-MA | Internet Movie Database | 6.3/10 | Love Hard |
TV-MA | Rotten Tomatoes | 57% | Love Hard |
TV-MA | Metacritic | 42/100 | Love Hard |
Query Results
We can see that each movie has multiple rows, one for each rating given to that movie.