How to index a JSON column
To learn how to configure an index for a JSON column, watch the following video, or complete the tutorial below.
Pinot Version | 1.0.0 |
---|---|
Code | startreedata/pinot-recipes/json-index |
Prerequisites
To follow the code examples in this guide, you must install Docker locally and download recipes.
Navigate to recipe
- If you haven’t already, download recipes.
- In terminal, go to the recipe by running the following command:
Launch Pinot Cluster
You can spin up a Pinot Cluster by running the following command:
This command will run a single instance of the Pinot Controller, Pinot Server, Pinot Broker, Kafka, and Zookeeper. You can find the docker-compose.yml file on GitHub.
Data generator
This recipe contains a data generator that creates events with data about people.
It uses the Faker library, so you’ll first need to install that:
You can generate data by running the following command:
Output is shown below:
Kafka ingestion
We’re going to ingest this data into an Apache Kafka topic using the kcat command line tool.
We’ll also use jq
to structure the data in the key:payload
structure that Kafka expects:
We can check that Kafka has some data by running the following command:
We’ll see something like the following:
Pinot Schema and Table
Now let’s create a Pinot Schema and Table.
First, the schema:
Our schema has only two columns - one for the timestamp and another one that stores the person.
Now for the table config:
This highlighted section contains the config for the JSON index. An explanation of each of the config parameters is shown below:
Config Key | Description | Type | Default |
---|---|---|---|
maxLevels | Max levels to flatten the json object (array is also counted as one level) | int | -1 (unlimited) |
excludeArray | Whether to exclude array when flattening the object | boolean | false (include array) |
disableCrossArrayUnnest | Whether to not unnest multiple arrays (unique combination of all elements) | boolean | false (calculate unique combination of all elements) |
includePaths | Only include the given paths, e.g. “.a.c[*]” (mutual exclusive with excludePaths). Paths under the included paths will be included, e.g. “.a.b” is configured to be included. | Set<String\> | null (include all paths) |
excludePaths | Exclude the given paths, e.g. “.a.c[*]” (mutual exclusive with includePaths). Paths under the excluded paths will also be excluded, e.g. “.a.b” is configured to be excluded. | Set<String\> | null (include all paths) |
excludeFields | Exclude the given fields, e.g. “b”, “c”, even if it is under the included paths. | Set<String\> | null (include all fields) |
So we are including all fields except age
in the JSON index.
Fields included in the JSON index can be filtered using the JSON_MATCH
function.
If you use this function with a field that isn’t included in the index, it won’t return any records.
We’ll create the table by running the following:
Querying by JSON index
Let’s now try to query this table using the JSON index. JSON indexes support the following predicates: =
, <>
, IN
, and NOT IN
State = Kentucky
ts | person |
---|---|
2023-03-30 09:40:59.788 | {"address":{"street_address":"8173 Diaz Hollow Apt. 175","country":"British Indian Ocean Territory (Chagos Archipelago)","city":"East Carlos","state":"Kentucky"},"friend_ids":["de27ec51-2895-4044-98bc-5643853d7d0d","84080eb3-1c17-4b4a-ada7-a764ed44cef7"],"name":"Robert Alvarado","phone_number":"085.136.9039x6810","id":"874a4543-f528-49e2-85a8-d31a6cbae269","job":{"company":"Gray, Huerta and Moyer","position":"Chemical engineer","department":"transition collaborative networks"},"interests":["Swimming","Meditation","Reading","Cycling","Reading"],"email":"jamiebarker@example.net","age":52} |
2023-03-30 09:40:59.823 | {"address":{"street_address":"277 Robert Crossroad Apt. 622","country":"Haiti","city":"Port Karenshire","state":"Kentucky"},"friend_ids":["2285322d-0049-4386-b41c-9f96a0ed5038"],"name":"Diana Thomas","phone_number":"171-966-4122","id":"b996770e-478c-412d-89bd-d3a085140016","job":{"company":"Diaz LLC","position":"Scientist, research (maths)","department":"deploy granular convergence"},"interests":["Sports","Photography","Art","Photography","Reading"],"email":"ncook@example.org","age":40} |
2023-03-30 09:40:59.843 | {"address":{"street_address":"85628 William Curve","country":"Djibouti","city":"Gregoryborough","state":"Kentucky"},"friend_ids":["1a7ee9e4-6daa-4994-a077-3170bffab04f"],"name":"Monica Harrell","phone_number":"(120)253-7632","id":"c71df52f-49c4-4d61-b6d6-d94134c92e8b","job":{"company":"Larson-Mcdaniel","position":"Ranger/warden","department":"seize leading-edge web services"},"interests":["Traveling","Music","Art"],"email":"allen60@example.net","age":33} |
Query Results
State <\>
Kentucky
count(*) |
---|
319584 |
Query Results
State IN Kentucky or Alabama
state | count(*) |
---|---|
Kentucky | 6682 |
Alabama | 6414 |
Query Results
State NOT IN Kentucky or Alabama
state | count(*) |
---|---|
Montana | 6743 |
Kansas | 6718 |
Maryland | 6693 |
Ohio | 6677 |
Connecticut | 6662 |
Mississippi | 6639 |
Rhode Island | 6630 |
Arkansas | 6621 |
Query Results
Unindexed field
What if we try to query by the age
column, which hasn’t been indexed?
count(*) |
---|
0 |
Query Results
We get back no results! We can even run the following query to confirm that the table contains records with an age of 59:
count(*) |
---|
5018 |
Query Results