Back to The Times of Claw

DuckDB Query Optimization: Speed Up Your Queries

DuckDB query optimization techniques: EXPLAIN ANALYZE, column pruning, predicate pushdown, indexes, and partition strategies. Make your queries faster.

Mark Rachapoom
Mark Rachapoom
·8 min read
DuckDB Query Optimization: Speed Up Your Queries

DuckDB Query Optimization: Speed Up Your Queries

DuckDB is already fast by default — its vectorized execution engine, parallel processing, and columnar storage handle most queries efficiently without tuning. But when you're dealing with large datasets or complex queries, understanding how to optimize makes a real difference.

Here's the systematic approach.

Step 1: Understand What's Happening with EXPLAIN#

Before optimizing, understand what DuckDB is actually doing:

-- See the query plan
EXPLAIN SELECT * FROM events WHERE user_id = 'user-123';
 
-- See the plan with actual execution statistics
EXPLAIN ANALYZE SELECT * FROM events WHERE user_id = 'user-123';

The output shows:

  • Physical plan — what operations DuckDB will perform
  • Estimated rows — how many rows DuckDB thinks each step processes
  • Actual rows (with ANALYZE) — how many rows were actually processed
  • Execution time — time spent in each operator

Look for:

  • Large row counts at filter steps (means the filter is applied late)
  • Full table scans where you expect an index scan
  • Nested loop joins on large tables (usually should be hash joins)

Step 2: Column Pruning — Only Read What You Need#

DuckDB's columnar storage means it only reads columns you reference. Don't SELECT * when you only need a few columns:

-- Slow: reads all columns from disk
SELECT * FROM events WHERE event_type = 'purchase';
 
-- Fast: reads only 3 columns
SELECT user_id, revenue, occurred_at FROM events WHERE event_type = 'purchase';

This matters most for wide tables (many columns) and Parquet files (columnar storage maximizes this benefit).

Step 3: Predicate Pushdown — Filter Early#

DuckDB automatically pushes filters down to the scan level, but you can help by:

1. Put the most selective filter first:

-- Less optimal: broader filter first
SELECT * FROM events 
WHERE year = 2026 AND user_id = 'specific-user';
 
-- More optimal: most selective first (DuckDB may optimize automatically, but be explicit)
SELECT * FROM events 
WHERE user_id = 'specific-user' AND year = 2026;

2. Use range predicates on sorted columns:

-- DuckDB can skip zones of data when filtering sorted timestamp columns
SELECT * FROM events
WHERE occurred_at BETWEEN '2026-03-01' AND '2026-03-31';

3. Push filters into CTEs and subqueries:

-- Less optimal: filter after join
SELECT e.*, u.segment
FROM events e
JOIN users u ON e.user_id = u.id
WHERE e.event_type = 'purchase' AND e.occurred_at > '2026-01-01';
 
-- More optimal: filter before join
WITH recent_purchases AS (
    SELECT * FROM events
    WHERE event_type = 'purchase' AND occurred_at > '2026-01-01'
)
SELECT rp.*, u.segment
FROM recent_purchases rp
JOIN users u ON rp.user_id = u.id;

Step 4: Min/Max Zonemaps (Automatic)#

DuckDB automatically creates min/max statistics (zonemaps) for each column in each row group. These let DuckDB skip entire chunks of data without reading them.

To benefit from zonemaps:

  • Insert data in sorted order when possible (especially for timestamp columns)
  • Avoid random inserts that scatter values across row groups
-- When loading data, sort by the most commonly filtered column
INSERT INTO events
SELECT * FROM staging_events
ORDER BY occurred_at;

If your data is pre-sorted by date (common for event logs), DuckDB can skip entire months of data when you filter by date range.

Step 5: ART Indexes for Point Lookups#

DuckDB supports Adaptive Radix Tree (ART) indexes for point lookups and range scans:

-- Create an index
CREATE INDEX idx_events_user_id ON events(user_id);
CREATE INDEX idx_events_occurred_at ON events(occurred_at);
 
-- Composite index
CREATE INDEX idx_events_user_type ON events(user_id, event_type);

When indexes help:

  • Point lookups: WHERE user_id = 'specific-user'
  • Range scans with low selectivity: WHERE user_id = 'x' AND occurred_at > '2026-01-01'
  • JOIN conditions on the indexed column

When indexes don't help:

  • Aggregations that scan the whole table anyway
  • Queries that return more than ~10% of rows
  • Low-cardinality columns like event_type with only a few distinct values
-- Check if an index exists
SELECT * FROM duckdb_indexes() WHERE table_name = 'events';

Step 6: Join Optimization#

DuckDB chooses join algorithms automatically, but you can influence them:

Build Side Selection#

For hash joins, DuckDB builds a hash table on the smaller relation. Ensure the smaller table is on the right side of the join when possible:

-- Less optimal: large table on right side
SELECT * FROM events  -- large
JOIN users ON events.user_id = users.id;  -- small
 
-- Same performance in practice (DuckDB re-orders), but explicit ordering helps clarity
SELECT * FROM users  -- small
JOIN events ON users.id = events.user_id;  -- large

Avoid Cross Joins#

-- Accidental cross join (missing join condition)
SELECT * FROM orders, customers;  -- Every order × every customer!
 
-- Correct
SELECT * FROM orders JOIN customers ON orders.customer_id = customers.id;

Use Appropriate Join Types#

-- Inner join: only matching rows
SELECT * FROM orders INNER JOIN customers ON orders.customer_id = customers.id;
 
-- Left join: all orders, NULL for unmatched customers
SELECT * FROM orders LEFT JOIN customers ON orders.customer_id = customers.id;
 
-- Semi join (exists): use WHERE EXISTS instead of JOIN + DISTINCT
SELECT DISTINCT o.customer_id FROM orders o
WHERE EXISTS (SELECT 1 FROM customers c WHERE c.id = o.customer_id AND c.segment = 'enterprise');

Step 7: Aggregate Optimization#

Pre-aggregate When Joining#

-- Less optimal: join then aggregate
SELECT u.segment, SUM(e.revenue)
FROM events e
JOIN users u ON e.user_id = u.id
WHERE e.event_type = 'purchase'
GROUP BY u.segment;
 
-- More optimal: aggregate before join (fewer rows to join)
WITH user_revenue AS (
    SELECT user_id, SUM(revenue) AS total_revenue
    FROM events
    WHERE event_type = 'purchase'
    GROUP BY user_id
)
SELECT u.segment, SUM(ur.total_revenue)
FROM user_revenue ur
JOIN users u ON ur.user_id = u.id
GROUP BY u.segment;

GROUPING SETS for Multiple Aggregations#

Instead of multiple queries:

-- Multiple queries (runs 3 table scans)
SELECT 'total' AS level, SUM(revenue) FROM events;
SELECT region, SUM(revenue) FROM events GROUP BY region;
SELECT region, product_id, SUM(revenue) FROM events GROUP BY region, product_id;
 
-- One query with GROUPING SETS (one table scan)
SELECT region, product_id, SUM(revenue)
FROM events
GROUP BY GROUPING SETS ((), (region), (region, product_id));

Step 8: Memory Management#

For large queries, tune memory settings:

-- Set memory limit (default: 80% of RAM)
SET memory_limit = '8GB';
 
-- Set temp directory for spilling to disk
SET temp_directory = '/fast-ssd/duckdb_temp';
 
-- Control parallel threads (default: number of CPU cores)
SET threads = 8;

Monitor memory usage:

SELECT current_usage, peak_usage FROM pragma_memory_usage();

Step 9: Partitioned Files#

For Parquet data, use Hive-style partitioning to enable partition pruning:

-- Write partitioned Parquet
COPY (SELECT * FROM events) TO 'events_partitioned/' 
(FORMAT PARQUET, PARTITION_BY (year, month));
 
-- Read with partition pruning (only reads 2026/03/)
SELECT * FROM read_parquet('events_partitioned/**/*.parquet', hive_partitioning=true)
WHERE year = 2026 AND month = 3;

DuckDB skips the partitions that don't match the filter — no reads from other years/months.

DenchClaw Query Optimization#

DenchClaw uses a PIVOT view pattern in DuckDB where EAV data is reshaped into readable views. For analytics queries, the key optimization is filtering before the PIVOT:

-- Faster: filter in the EAV layer before pivoting
WITH filtered_entries AS (
    SELECT entry_id FROM entries e
    JOIN entry_fields ef ON e.id = ef.entry_id
    JOIN fields f ON ef.field_id = f.id
    WHERE f.name = 'Stage' AND ef.value = 'Proposal'
)
SELECT * FROM v_deals
WHERE id IN (SELECT entry_id FROM filtered_entries);

Frequently Asked Questions#

How do I know if my query is using an index?#

Use EXPLAIN SELECT ... and look for INDEX_SCAN in the plan. If you see TABLE_SCAN, the index is not being used.

When should I add an index in DuckDB?#

Add indexes for columns that appear in WHERE clauses for point lookups, JOIN conditions, and high-cardinality columns you filter frequently. Don't add indexes on columns you never filter on.

Does DuckDB use parallel execution by default?#

Yes. DuckDB automatically uses all available CPU cores for query execution.

How can I make GROUP BY queries faster?#

Pre-filter data before grouping, use partial aggregation, and ensure the group-by columns have good cardinality. Very high cardinality GROUP BY (millions of groups) is inherently slow.

What's the biggest performance gain available?#

Eliminating unnecessary columns (SELECT *) and adding date-range filters to large time-series tables typically give the biggest wins.

Ready to try DenchClaw? Install in one command: npx denchclaw. Full setup guide →

Mark Rachapoom

Written by

Mark Rachapoom

Building the future of AI CRM software.

Continue reading

DENCH

© 2026 DenchHQ · San Francisco, CA