DuckDB for Data Scientists: Better Than Pandas?

DuckDB processes large datasets faster than Pandas with less memory. Here's an honest comparison and when to use each for data science workflows.

Mark Rachapoom

March 26, 2026·6 min read

DuckDB is faster than Pandas for most analytical workloads. It uses less memory, handles larger-than-RAM datasets, and often runs the same analysis in a fraction of the time. But it's not a Pandas replacement — it's a different tool with different strengths.

Here's the honest comparison and practical guide for using DuckDB in data science workflows.

The Performance Reality#

Let's be direct about the numbers. On a typical analytical task — groupby aggregation on a 10-million row dataset:

Tool	Time	Memory Used
Pandas	8.2s	4.1 GB
DuckDB	0.4s	312 MB
Polars	0.6s	280 MB

DuckDB wins on both speed and memory. The gap widens as datasets grow. At 100 million rows, Pandas often runs out of memory entirely while DuckDB finishes in seconds.

Why? DuckDB uses vectorized columnar execution. It processes entire columns at once using SIMD instructions, parallelizes across cores automatically, and only reads the columns a query needs. Pandas stores data in row-oriented NumPy arrays and operates row-by-row or column-by-column with Python overhead.

Setting Up DuckDB in Python#

pip install duckdb

import duckdb
 
# In-memory database
con = duckdb.connect()
 
# Persistent database
con = duckdb.connect('analysis.duckdb')

That's it. No server, no configuration, no connection string.

DuckDB vs. Pandas: Side-by-Side#

Reading Data#

# Pandas
import pandas as pd
df = pd.read_csv('events.csv')
df = pd.read_parquet('events.parquet')
 
# DuckDB
import duckdb
df = duckdb.sql("SELECT * FROM 'events.csv'").df()
df = duckdb.sql("SELECT * FROM 'events.parquet'").df()
# Or read directly without loading into memory:
result = duckdb.sql("SELECT user_id, SUM(revenue) FROM 'events.parquet' GROUP BY user_id")

DuckDB can query files directly without loading them into memory first. For large files, this is the key advantage.

Filtering and Selecting#

# Pandas
result = df[df['status'] == 'active'][['name', 'email', 'revenue']]
 
# DuckDB
result = duckdb.sql("""
    SELECT name, email, revenue 
    FROM df 
    WHERE status = 'active'
""").df()

DuckDB can query Pandas DataFrames directly — the FROM df works with any DataFrame in scope.

GroupBy Aggregation#

# Pandas
result = df.groupby('region')['revenue'].agg(['sum', 'mean', 'count']).reset_index()
 
# DuckDB
result = duckdb.sql("""
    SELECT 
        region,
        SUM(revenue) AS total_revenue,
        AVG(revenue) AS avg_revenue,
        COUNT(*) AS count
    FROM df
    GROUP BY region
    ORDER BY total_revenue DESC
""").df()

Joins#

# Pandas
result = pd.merge(orders, customers, on='customer_id', how='left')
 
# DuckDB
result = duckdb.sql("""
    SELECT o.*, c.name, c.email, c.segment
    FROM orders o
    LEFT JOIN customers c ON o.customer_id = c.id
""").df()

Multi-table joins in DuckDB are significantly faster than Pandas merge for large datasets.

Window Functions#

Pandas has limited window function support. DuckDB has full SQL window functions:

# DuckDB - running total, rank, lag all in one query
result = duckdb.sql("""
    SELECT 
        user_id,
        occurred_at,
        revenue,
        SUM(revenue) OVER (PARTITION BY user_id ORDER BY occurred_at) AS lifetime_value,
        RANK() OVER (ORDER BY revenue DESC) AS revenue_rank,
        LAG(revenue) OVER (PARTITION BY user_id ORDER BY occurred_at) AS prev_purchase
    FROM df
""").df()

When to Use DuckDB vs. Pandas#

Use DuckDB when:

Dataset is larger than a few hundred MB
You need complex aggregations, window functions, or multi-table joins
You want to query CSV/Parquet files directly without loading
You want SQL syntax (more readable for complex logic)
Performance matters

Use Pandas when:

You need row-level manipulation (iterating, applying custom functions)
You're doing matrix operations with NumPy integration
You're working with time-series resampling and resample()
You need Pandas-specific libraries (scikit-learn, matplotlib expect DataFrames)
Dataset is small enough that performance doesn't matter

The pragmatic answer: Use both. DuckDB for the heavy lifting (read, filter, aggregate), Pandas for the final steps (plotting, ML model input, custom transformations).

The DuckDB + Pandas Workflow#

import duckdb
import pandas as pd
import matplotlib.pyplot as plt
 
# Heavy lifting in DuckDB (fast)
monthly_revenue = duckdb.sql("""
    SELECT 
        DATE_TRUNC('month', occurred_at) AS month,
        SUM(revenue) AS revenue,
        COUNT(DISTINCT user_id) AS active_users
    FROM read_parquet('events/*.parquet')
    WHERE occurred_at >= '2025-01-01'
    GROUP BY month
    ORDER BY month
""").df()
 
# Final analysis in Pandas
monthly_revenue['month'] = pd.to_datetime(monthly_revenue['month'])
monthly_revenue['revenue_per_user'] = monthly_revenue['revenue'] / monthly_revenue['active_users']
 
# Visualization
monthly_revenue.plot(x='month', y='revenue', kind='bar', figsize=(12, 4))
plt.title('Monthly Revenue 2025')
plt.tight_layout()
plt.savefig('revenue_chart.png')

Working with Large Files#

DuckDB's killer feature for data scientists: query files larger than RAM.

# This works even if the file is 100GB and your machine has 16GB RAM
result = duckdb.sql("""
    SELECT 
        user_segment,
        AVG(session_duration) AS avg_duration,
        COUNT(*) AS sessions
    FROM read_parquet('s3://my-bucket/sessions/2026/**/*.parquet')
    WHERE country = 'US'
    GROUP BY user_segment
""").df()

DuckDB streams data from S3, processes it in chunks, and returns only the aggregated result — which is tiny.

DuckDB in Jupyter Notebooks#

# Magic commands (install: pip install duckdb-engine jupysql)
%load_ext sql
%sql duckdb:///:memory:
 
%%sql
SELECT 
    region,
    SUM(revenue) AS total_revenue
FROM df
GROUP BY region
ORDER BY total_revenue DESC

Or use DuckDB's built-in Jupyter display:

import duckdb
 
con = duckdb.connect()
con.execute("CREATE TABLE sales AS SELECT * FROM 'sales.csv'")
 
# Returns a pandas DataFrame automatically
con.sql("SELECT * FROM sales LIMIT 10")

DuckDB with scikit-learn#

from sklearn.ensemble import RandomForestClassifier
import duckdb
 
# Feature engineering in DuckDB (fast)
features = duckdb.sql("""
    SELECT 
        user_id,
        COUNT(*) AS purchase_count,
        SUM(revenue) AS lifetime_value,
        AVG(revenue) AS avg_order_value,
        MAX(occurred_at) AS last_purchase,
        DATEDIFF('day', MAX(occurred_at), CURRENT_DATE) AS days_since_purchase
    FROM events
    WHERE event_type = 'purchase'
    GROUP BY user_id
""").df()
 
# Labels
labels = duckdb.sql("""
    SELECT user_id, churned FROM user_labels
""").df()
 
# Merge and train
data = features.merge(labels, on='user_id')
X = data.drop(['user_id', 'churned', 'last_purchase'], axis=1).fillna(0)
y = data['churned']
 
model = RandomForestClassifier()
model.fit(X, y)

DuckDB and DenchClaw#

DenchClaw uses DuckDB as its embedded database, which means data scientists working with DenchClaw CRM data can query it directly with DuckDB:

import duckdb
 
# Query your DenchClaw workspace
con = duckdb.connect('/Users/you/.openclaw-dench/workspace/workspace.duckdb', read_only=True)
 
# Analyze your pipeline
pipeline = con.sql("""
    SELECT 
        "Stage",
        COUNT(*) AS deals,
        SUM("Deal Value") AS pipeline_value,
        AVG(DATEDIFF('day', "Created At", CURRENT_DATE)) AS avg_age_days
    FROM v_deals
    WHERE "Status" = 'Active'
    GROUP BY "Stage"
""").df()
 
print(pipeline)