Back to The Times of Claw

DuckDB for Data Scientists: Better Than Pandas?

DuckDB processes large datasets faster than Pandas with less memory. Here's an honest comparison and when to use each for data science workflows.

Mark Rachapoom
Mark Rachapoom
·6 min read
DuckDB for Data Scientists: Better Than Pandas?

DuckDB for Data Scientists: Better Than Pandas?

DuckDB is faster than Pandas for most analytical workloads. It uses less memory, handles larger-than-RAM datasets, and often runs the same analysis in a fraction of the time. But it's not a Pandas replacement — it's a different tool with different strengths.

Here's the honest comparison and practical guide for using DuckDB in data science workflows.

The Performance Reality#

Let's be direct about the numbers. On a typical analytical task — groupby aggregation on a 10-million row dataset:

ToolTimeMemory Used
Pandas8.2s4.1 GB
DuckDB0.4s312 MB
Polars0.6s280 MB

DuckDB wins on both speed and memory. The gap widens as datasets grow. At 100 million rows, Pandas often runs out of memory entirely while DuckDB finishes in seconds.

Why? DuckDB uses vectorized columnar execution. It processes entire columns at once using SIMD instructions, parallelizes across cores automatically, and only reads the columns a query needs. Pandas stores data in row-oriented NumPy arrays and operates row-by-row or column-by-column with Python overhead.

Setting Up DuckDB in Python#

pip install duckdb
import duckdb
 
# In-memory database
con = duckdb.connect()
 
# Persistent database
con = duckdb.connect('analysis.duckdb')

That's it. No server, no configuration, no connection string.

DuckDB vs. Pandas: Side-by-Side#

Reading Data#

# Pandas
import pandas as pd
df = pd.read_csv('events.csv')
df = pd.read_parquet('events.parquet')
 
# DuckDB
import duckdb
df = duckdb.sql("SELECT * FROM 'events.csv'").df()
df = duckdb.sql("SELECT * FROM 'events.parquet'").df()
# Or read directly without loading into memory:
result = duckdb.sql("SELECT user_id, SUM(revenue) FROM 'events.parquet' GROUP BY user_id")

DuckDB can query files directly without loading them into memory first. For large files, this is the key advantage.

Filtering and Selecting#

# Pandas
result = df[df['status'] == 'active'][['name', 'email', 'revenue']]
 
# DuckDB
result = duckdb.sql("""
    SELECT name, email, revenue 
    FROM df 
    WHERE status = 'active'
""").df()

DuckDB can query Pandas DataFrames directly — the FROM df works with any DataFrame in scope.

GroupBy Aggregation#

# Pandas
result = df.groupby('region')['revenue'].agg(['sum', 'mean', 'count']).reset_index()
 
# DuckDB
result = duckdb.sql("""
    SELECT 
        region,
        SUM(revenue) AS total_revenue,
        AVG(revenue) AS avg_revenue,
        COUNT(*) AS count
    FROM df
    GROUP BY region
    ORDER BY total_revenue DESC
""").df()

Joins#

# Pandas
result = pd.merge(orders, customers, on='customer_id', how='left')
 
# DuckDB
result = duckdb.sql("""
    SELECT o.*, c.name, c.email, c.segment
    FROM orders o
    LEFT JOIN customers c ON o.customer_id = c.id
""").df()

Multi-table joins in DuckDB are significantly faster than Pandas merge for large datasets.

Window Functions#

Pandas has limited window function support. DuckDB has full SQL window functions:

# DuckDB - running total, rank, lag all in one query
result = duckdb.sql("""
    SELECT 
        user_id,
        occurred_at,
        revenue,
        SUM(revenue) OVER (PARTITION BY user_id ORDER BY occurred_at) AS lifetime_value,
        RANK() OVER (ORDER BY revenue DESC) AS revenue_rank,
        LAG(revenue) OVER (PARTITION BY user_id ORDER BY occurred_at) AS prev_purchase
    FROM df
""").df()

When to Use DuckDB vs. Pandas#

Use DuckDB when:

  • Dataset is larger than a few hundred MB
  • You need complex aggregations, window functions, or multi-table joins
  • You want to query CSV/Parquet files directly without loading
  • You want SQL syntax (more readable for complex logic)
  • Performance matters

Use Pandas when:

  • You need row-level manipulation (iterating, applying custom functions)
  • You're doing matrix operations with NumPy integration
  • You're working with time-series resampling and resample()
  • You need Pandas-specific libraries (scikit-learn, matplotlib expect DataFrames)
  • Dataset is small enough that performance doesn't matter

The pragmatic answer: Use both. DuckDB for the heavy lifting (read, filter, aggregate), Pandas for the final steps (plotting, ML model input, custom transformations).

The DuckDB + Pandas Workflow#

import duckdb
import pandas as pd
import matplotlib.pyplot as plt
 
# Heavy lifting in DuckDB (fast)
monthly_revenue = duckdb.sql("""
    SELECT 
        DATE_TRUNC('month', occurred_at) AS month,
        SUM(revenue) AS revenue,
        COUNT(DISTINCT user_id) AS active_users
    FROM read_parquet('events/*.parquet')
    WHERE occurred_at >= '2025-01-01'
    GROUP BY month
    ORDER BY month
""").df()
 
# Final analysis in Pandas
monthly_revenue['month'] = pd.to_datetime(monthly_revenue['month'])
monthly_revenue['revenue_per_user'] = monthly_revenue['revenue'] / monthly_revenue['active_users']
 
# Visualization
monthly_revenue.plot(x='month', y='revenue', kind='bar', figsize=(12, 4))
plt.title('Monthly Revenue 2025')
plt.tight_layout()
plt.savefig('revenue_chart.png')

Working with Large Files#

DuckDB's killer feature for data scientists: query files larger than RAM.

# This works even if the file is 100GB and your machine has 16GB RAM
result = duckdb.sql("""
    SELECT 
        user_segment,
        AVG(session_duration) AS avg_duration,
        COUNT(*) AS sessions
    FROM read_parquet('s3://my-bucket/sessions/2026/**/*.parquet')
    WHERE country = 'US'
    GROUP BY user_segment
""").df()

DuckDB streams data from S3, processes it in chunks, and returns only the aggregated result — which is tiny.

DuckDB in Jupyter Notebooks#

# Magic commands (install: pip install duckdb-engine jupysql)
%load_ext sql
%sql duckdb:///:memory:
 
%%sql
SELECT 
    region,
    SUM(revenue) AS total_revenue
FROM df
GROUP BY region
ORDER BY total_revenue DESC

Or use DuckDB's built-in Jupyter display:

import duckdb
 
con = duckdb.connect()
con.execute("CREATE TABLE sales AS SELECT * FROM 'sales.csv'")
 
# Returns a pandas DataFrame automatically
con.sql("SELECT * FROM sales LIMIT 10")

DuckDB with scikit-learn#

from sklearn.ensemble import RandomForestClassifier
import duckdb
 
# Feature engineering in DuckDB (fast)
features = duckdb.sql("""
    SELECT 
        user_id,
        COUNT(*) AS purchase_count,
        SUM(revenue) AS lifetime_value,
        AVG(revenue) AS avg_order_value,
        MAX(occurred_at) AS last_purchase,
        DATEDIFF('day', MAX(occurred_at), CURRENT_DATE) AS days_since_purchase
    FROM events
    WHERE event_type = 'purchase'
    GROUP BY user_id
""").df()
 
# Labels
labels = duckdb.sql("""
    SELECT user_id, churned FROM user_labels
""").df()
 
# Merge and train
data = features.merge(labels, on='user_id')
X = data.drop(['user_id', 'churned', 'last_purchase'], axis=1).fillna(0)
y = data['churned']
 
model = RandomForestClassifier()
model.fit(X, y)

DuckDB and DenchClaw#

DenchClaw uses DuckDB as its embedded database, which means data scientists working with DenchClaw CRM data can query it directly with DuckDB:

import duckdb
 
# Query your DenchClaw workspace
con = duckdb.connect('/Users/you/.openclaw-dench/workspace/workspace.duckdb', read_only=True)
 
# Analyze your pipeline
pipeline = con.sql("""
    SELECT 
        "Stage",
        COUNT(*) AS deals,
        SUM("Deal Value") AS pipeline_value,
        AVG(DATEDIFF('day', "Created At", CURRENT_DATE)) AS avg_age_days
    FROM v_deals
    WHERE "Status" = 'Active'
    GROUP BY "Stage"
""").df()
 
print(pipeline)

This is the DuckDB for business analytics pattern — your CRM data is queryable with standard Python data science tools.

Frequently Asked Questions#

Is DuckDB faster than Polars?#

They're comparable. DuckDB often wins on complex joins and aggregations; Polars sometimes wins on simple operations with its lazy evaluation engine. Both are dramatically faster than Pandas.

Can I use DuckDB with NumPy?#

Yes. duckdb.sql(...).fetchnumpy() returns a NumPy array. Or use .df() for a Pandas DataFrame that NumPy can work with.

Does DuckDB support GPU acceleration?#

No. DuckDB runs on CPU. For GPU-accelerated analytics, look at cuDF (RAPIDS).

Can I run DuckDB in a cloud notebook (Colab, Kaggle)?#

Yes. pip install duckdb works in any Python environment including Colab and Kaggle notebooks.

What's the max dataset size DuckDB can handle?#

Theoretically, DuckDB can process datasets larger than disk by streaming. In practice, performance is best when the working set (data being actively processed) fits in RAM. For multi-terabyte datasets, consider ClickHouse or BigQuery.

Ready to try DenchClaw? Install in one command: npx denchclaw. Full setup guide →

Mark Rachapoom

Written by

Mark Rachapoom

Building the future of AI CRM software.

Continue reading

DENCH

© 2026 DenchHQ · San Francisco, CA