Optimize geospatial query performance with Apache Pinot. This guide covers everything from basic performance principles to tuning techniques.

Performance Fundamentals

How H3 Index Acceleration Works

Understanding the two-phase optimization approach is key to writing efficient queries:
1

Coarse Filtering (H3 Index)

The H3 index quickly identifies candidate records within nearby hexagons, eliminating the vast majority of data from consideration.
2

Precise Filtering (Geospatial Functions)

Only records in candidate hexagons undergo expensive precise geospatial calculations (ST_Distance, ST_Within, etc.).
Example: Finding locations within 5km
  • Without Index: Calculate exact distance for ALL records (expensive)
  • With H3 Index:
    1. Find hexagon containing given point
    2. Find hexagons fully contained within the searched area
    3. Include all points associated with those hexagons
    4. Find hexagons overlapping searched area
    5. Calculate exact distance for records in those hexagons
Important: If query distance is >100x the hexagon size, the index is automatically bypassed and falls back to full scan for performance reasons.

Query Plan Analysis

Using EXPLAIN PLAN FOR

The most important tool for geospatial performance optimization: Single-stage query engine (SSQE):
EXPLAIN PLAN FOR
SELECT store_name, address, 
       ST_Distance(location, ST_Point(-122.4194, 37.7749)) as distance
FROM stores 
WHERE ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 5000
ORDER BY distance;
Multi-stage query engine MSQE:
set explainAskingServers=true;
EXPLAIN PLAN FOR
SELECT store_name, address,
       ST_Distance(location, ST_Point(-122.4194, 37.7749)) as distance
FROM stores
WHERE ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 5000
ORDER BY distance;
Setting explainAskingServers query option is necessary for MSQE to show usage of segment-level indexes.

Index Usage Operators

Look for these operators in your query plans:
Shows as FilterH3Index in MSQE. When: ST_Distance with range conditions
WHERE ST_Distance(location_column, literal_point) < distance
WHERE ST_Distance(location_column, literal_point) > distance  
WHERE ST_Distance(location_column, literal_point) BETWEEN min_dist AND max_dist
Query Plan Example: SSQE:
BROKER_REDUCE(limit:1000)
└── COMBINE_SELECT
    └── SELECT(selectList:[store_name, address, distance])
        └── TRANSFORM_PASSTHROUGH(distance:ST_DISTANCE(...))
            └── FILTER_H3_INDEX(predicate:ST_DISTANCE(...) < 5000)
                └── SEGMENT_SCAN(table:stores)
MSQE:
PinotLogicalExchange(distribution=[broadcast])
  LeafStageCombineOperator(table=[stores])
    StreamingInstanceResponse
      StreamingCombineSelect
        SelectStreaming(segment=[stores_OFFLINE_0], table=[starbucksStores])
          Project(columns=[[store_name, address, distance]])
            DocIdSet(maxDocs=[10000])
              FilterH3Index(predicate=[stdistance(...) < '5000.0'], indexLookUp=[h3_index], operator=[RANGE])

EXPLAIN PLAN Examples

Here are real examples of what you should see when H3 indexes are working: FILTER_H3_INDEX Example (ST_Distance):
EXPLAIN PLAN FOR
SELECT * FROM starbucksStores
WHERE ST_Distance(location_st_point, ST_Point(-122.4194, 37.7749)) < 5000;
Output:
BROKER_REDUCE(limit:10)
COMBINE_SELECT
PLAN_START(numSegmentsForThisPlan:1)
SELECT(selectList:address, lat, location_st_point, lon, name)
PROJECT(name, lon, address, location_st_point, lat)
DOC_ID_SET
FILTER_H3_INDEX
INCLUSION_FILTER_H3_INDEX Example (ST_Contains):
EXPLAIN PLAN FOR
SELECT * FROM starbucksStores
WHERE ST_Contains(ST_GeomFromText('POLYGON((0 0, 4 0, 4 4, 0 4, 0 0))'), location_st_point);
Output:
BROKER_REDUCE(limit:10)
COMBINE_SELECT
PLAN_START(numSegmentsForThisPlan:1)
SELECT(selectList:address, lat, location_st_point, lon, name)
PROJECT(name, lon, address, location_st_point, lat)
DOC_ID_SET
INCLUSION_FILTER_H3_INDEX
Key Indicators:
  • FILTER_H3_INDEX for distance-based queries
  • INCLUSION_FILTER_H3_INDEX for containment queries
  • DOC_ID_SET showing efficient document filtering
  • No FILTER operator (which would indicate full scan)

Index Optimization

Resolution Selection Strategy

Choose H3 resolutions based on your query patterns and performance requirements:

Single Resolution Approach

When to use: Uniform query patterns with consistent search radius
{
  "indexes": {
    "h3": {
      "resolution": [8]  // ~500m precision, good for city-wide queries
    }
  }
}
Query Performance:
  • Excellent for queries around the target resolution
  • Suboptimal for very different search radii
When to use: Varied query patterns with different search radii
{
  "indexes": {
    "h3": {
      "resolution": [5, 8, 11]  // Coarse, medium, fine
    }
  }
}
Resolution Selection Algorithm:
  • Resolution 5 (8km): For queries >20km radius
  • Resolution 8 (500m): For queries 1-20km radius
  • Resolution 11 (25m): For queries <1km radius

Performance vs Storage Analysis

ResolutionsQuery PerformanceStorage OverheadMemory UsageUse Case
[8]⭐⭐⭐⭐⭐⭐⭐⭐LowUniform patterns
[5, 8]⭐⭐⭐⭐⭐⭐⭐⭐MediumMost applications
[5, 8, 11]⭐⭐⭐⭐⭐⭐⭐⭐Medium-HighVaried patterns
[3, 6, 9, 12]⭐⭐⭐⭐⭐⭐⭐HighComplex analytics
[1, 4, 7, 10, 13]⭐⭐⭐⭐Very HighOver-optimization

Resolution Tuning Examples

Retail Store Locator:
{
  "resolution": [7, 10]  // 1.2km and 65m
}
  • Resolution 7: “Stores within 5km” queries
  • Resolution 10: “Nearest store” precision queries
Logistics & Delivery:
{
  "resolution": [4, 7, 10]  // 22km, 1.2km, 65m
}
  • Resolution 4: Regional route planning
  • Resolution 7: City-wide delivery zones
  • Resolution 10: Precise delivery locations
Global IoT Network:
{
  "resolution": [2, 5, 8, 11]  // 158km, 8km, 500m, 25m
}
  • Resolution 2: Country/region analysis
  • Resolution 5: City-level aggregation
  • Resolution 8: Local area monitoring
  • Resolution 11: Device-level precision

Optimized Query Patterns

Distance-Based Queries

Index-Optimized Patterns:
-- Correct argument order (column first, literal second)
WHERE ST_Distance(location_column, ST_Point(-122.4194, 37.7749)) < 5000

-- Range queries work well
WHERE ST_Distance(location_column, ST_Point(-122.4194, 37.7749)) BETWEEN 1000 AND 5000

-- Use with ORDER BY for nearest neighbor
SELECT * FROM stores 
WHERE ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 10000
ORDER BY ST_Distance(location, ST_Point(-122.4194, 37.7749))
LIMIT 10;
Non-Optimized Patterns:
-- Wrong argument order (literal first)
WHERE ST_Distance(ST_Point(-122.4194, 37.7749), location_column) < 5000

-- Both arguments are columns (can't use index)
WHERE ST_Distance(pickup_location, dropoff_location) < 5000

-- Complex expressions in spatial function
WHERE ST_Distance(location, ST_Point(lon_column, lat_column)) < 5000

Containment Queries

ST_Within Optimization:
-- Optimized: column first, literal second  
SELECT * FROM events
WHERE ST_Within(
  event_location,
  ST_GeomFromText('POLYGON((-122.5 37.7, -122.4 37.7, -122.4 37.8, -122.5 37.8, -122.5 37.7))')
);

-- Not optimized: literal first
WHERE ST_Within(
  ST_GeomFromText('POLYGON(...)'),
  event_location
);
ST_Contains Optimization:
-- Optimized: literal first, column second
SELECT * FROM delivery_addresses
WHERE ST_Contains(
  ST_GeomFromText('POLYGON((-122.5 37.7, -122.4 37.7, -122.4 37.8, -122.5 37.8, -122.5 37.7))'),
  address_location
);

-- Not optimized: column first  
WHERE ST_Contains(service_area, fixed_point);

Advanced Query Optimization

Multiple Spatial Conditions:
-- Combine spatial and non-spatial filters efficiently
SELECT store_name, category
FROM stores 
WHERE category = 'restaurant'  -- Fast filter first
  AND ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 2000  -- Spatial filter
  AND rating > 4.0;  -- Additional filters after spatial
Subquery Optimization:
-- Use spatial filtering in subquery to reduce dataset
SELECT s.store_name, s.address, r.avg_rating
FROM (
  SELECT store_id, store_name, address 
  FROM stores 
  WHERE ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 5000
) s
JOIN ratings r ON s.store_id = r.store_id;
Spatial Aggregations:
-- Efficient spatial grouping
SELECT 
  region,
  COUNT(*) as store_count,
  AVG(ST_Distance(location, ST_Point(-122.4194, 37.7749))) as avg_distance
FROM stores 
WHERE ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 20000
GROUP BY region
ORDER BY avg_distance;

Performance Monitoring

Key Performance Metrics

Monitor:
  • Overall query response time
  • Spatial filter execution time
  • Index effectiveness ratio
Tools:
-- Enable query execution metrics
SET enableQueryExecutionMetrics = true;

-- Run your spatial query and check execution time
SELECT store_name FROM stores 
WHERE ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 5000;

Performance Benchmarking

Create Test Scenarios:
-- Benchmark different query patterns
-- Small radius (high precision)
SELECT 'Small Radius' as test_type, COUNT(*) as result_count
FROM stores 
WHERE ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 1000;

-- Medium radius (typical use case)
SELECT 'Medium Radius' as test_type, COUNT(*) as result_count  
FROM stores
WHERE ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 5000;

-- Large radius (coarse filtering)
SELECT 'Large Radius' as test_type, COUNT(*) as result_count
FROM stores 
WHERE ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 25000;
Performance Comparison:
-- Compare indexed vs non-indexed performance
-- (Temporarily disable index for comparison)

-- Time indexed query
SELECT store_name FROM stores_indexed
WHERE ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 5000;

-- Time non-indexed query  
SELECT store_name FROM stores_no_index
WHERE ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 5000;

Advanced Optimization Techniques

Query Rewriting for Performance

Distance Approximation:
-- For very large datasets, use approximate filtering first
WITH nearby_candidates AS (
  SELECT store_id, location
  FROM stores 
  WHERE ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 6000  -- Slightly larger radius
)
SELECT s.store_name, s.address
FROM nearby_candidates nc
JOIN stores s ON nc.store_id = s.store_id
WHERE ST_Distance(nc.location, ST_Point(-122.4194, 37.7749)) < 5000;  -- Exact filter
Batch Spatial Operations:
-- Process multiple locations efficiently
WITH locations_of_interest(location_name, lon, lat) AS (
  VALUES 
    ('SF', -122.4194, 37.7749),
    ('Oakland', -122.2711, 37.8044),
    ('San Jose', -121.8863, 37.3382)
)
SELECT 
  loi.location_name,
  COUNT(*) as nearby_stores
FROM locations_of_interest loi
CROSS JOIN stores s
WHERE ST_Distance(s.location, ST_Point(loi.lon, loi.lat)) < 5000
GROUP BY loi.location_name;

Memory Optimization

Column Selection:
-- Select only needed columns
SELECT store_name, address 
FROM stores 
WHERE ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 5000;

-- Avoid SELECT * with spatial data
SELECT * FROM stores  -- Includes large spatial columns
WHERE ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 5000;
Result Set Limiting:
-- Use LIMIT for interactive queries
SELECT store_name, 
       ST_Distance(location, ST_Point(-122.4194, 37.7749)) as distance
FROM stores 
WHERE ST_Distance(location, ST_Point(-122.4194, 37.7749)) < 10000
ORDER BY distance
LIMIT 20;  -- Only get top 20 results

Troubleshooting Performance Issues

Common Performance Problems

Performance Tuning Workflow

1

Baseline Measurement

Measure current performance and identify bottlenecks using EXPLAIN PLAN and query timing.
2

Index Configuration

Optimize H3 resolutions based on query patterns and test different configurations.
3

Query Optimization

Rewrite queries to use indexes effectively and add non-spatial filters where possible.
4

Validation

Verify improvements using query plans and performance benchmarks.
5

Monitoring

Set up ongoing monitoring of query performance and index effectiveness.

Best Practices Summary

Configuration Best Practices

  • Use 2-3 H3 resolutions for most applications
  • Choose resolutions based on typical query radius
  • Always disable dictionary encoding (encodingType: "RAW")
  • Test resolution configurations with real query patterns

Query Best Practices

  • Use correct argument order in spatial functions
  • Combine spatial and non-spatial filters efficiently
  • Use EXPLAIN PLAN to verify index usage
  • Limit result sets in interactive applications

Performance Monitoring

  • Monitor query execution times regularly
  • Check index effectiveness ratios
  • Benchmark different query patterns
  • Set up alerts for performance degradation