DuckDB with Apache Arrow: Zero-Copy Data Processing

DuckDB and Apache Arrow integrate natively for zero-copy data exchange. Here's how to use them together for high-performance data pipelines.

Mark Rachapoom

March 26, 2026·6 min read

DuckDB and Apache Arrow are designed to work together. Arrow provides an in-memory columnar format; DuckDB provides fast analytical SQL. When you combine them, you get zero-copy data exchange — DuckDB can read Arrow tables without deserializing or copying data.

This matters for performance-critical pipelines where moving data between tools is usually the bottleneck.

What Is Apache Arrow?#

Apache Arrow is a language-independent columnar memory format. It defines how columnar data is laid out in memory so that:

Different tools can share data without serialization/deserialization
SIMD CPU instructions can operate directly on Arrow-format data
Data transfers between Python, Java, C++, and Rust have zero copies

It's the standard format used by Pandas 2.0+, Polars, PyArrow, Spark, and now DuckDB.

The DuckDB + Arrow Integration#

DuckDB can:

Read from Arrow tables without copying data
Write to Arrow tables without copying data
Execute SQL against Arrow data via the arrow_scan function
Return results as Arrow via Python API

This means: compute in DuckDB, pass results to Pandas or Polars as Arrow, and avoid the copy.

Python Setup#

pip install duckdb pyarrow pandas

import duckdb
import pyarrow as pa
import pyarrow.compute as pc

Reading Arrow Tables in DuckDB#

import duckdb
import pyarrow as pa
 
# Create an Arrow table
table = pa.table({
    'user_id': ['u1', 'u2', 'u3', 'u4'],
    'revenue': [100.0, 250.0, 75.0, 500.0],
    'segment': ['pro', 'free', 'pro', 'enterprise']
})
 
# Query it with DuckDB SQL — zero copy!
result = duckdb.sql("""
    SELECT segment, SUM(revenue) AS total, COUNT(*) AS users
    FROM table
    GROUP BY segment
    ORDER BY total DESC
""").arrow()  # Returns Arrow table
 
print(result.to_pandas())

The FROM table references the Python variable directly. DuckDB reads the Arrow buffer without copying it.

Writing DuckDB Results as Arrow#

import duckdb
 
con = duckdb.connect('analytics.duckdb')
 
# Execute query and get Arrow result
arrow_result = con.execute("""
    SELECT 
        user_id,
        SUM(revenue) AS lifetime_value,
        COUNT(*) AS purchase_count
    FROM events
    WHERE event_type = 'purchase'
    GROUP BY user_id
""").arrow()
 
# arrow_result is a pyarrow.Table
print(type(arrow_result))  # <class 'pyarrow.lib.Table'>
print(arrow_result.schema)

This is zero-copy from DuckDB to Arrow — no serialization to bytes, no deserialization.

DuckDB + Arrow + Pandas Pipeline#

import duckdb
import pandas as pd
import pyarrow as pa
 
# Step 1: Load data with Pandas (existing workflow)
df = pd.read_parquet('events.parquet')
 
# Step 2: Heavy aggregation in DuckDB (fast!)
# DuckDB reads the Pandas DataFrame via Arrow
agg = duckdb.sql("""
    SELECT 
        DATE_TRUNC('month', occurred_at) AS month,
        segment,
        SUM(revenue) AS revenue,
        COUNT(DISTINCT user_id) AS users
    FROM df
    GROUP BY month, segment
    ORDER BY month, revenue DESC
""").df()  # Returns Pandas DataFrame
 
# Step 3: Visualization in Pandas/matplotlib
agg.pivot(index='month', columns='segment', values='revenue').plot(kind='bar')

The .df() method returns a Pandas DataFrame. .arrow() returns an Arrow table. .fetchnumpy() returns NumPy arrays.

Arrow Flight SQL#

For distributed systems, DuckDB supports Arrow Flight SQL — a high-performance RPC protocol for transferring Arrow data between services:

# This is more advanced — for distributed data pipelines
# where DuckDB acts as a query engine over Arrow streams
import pyarrow.flight as flight
 
# DuckDB can serve as a Flight server
# (requires additional setup with ADBC)

DuckDB + Polars via Arrow#

Polars is natively Arrow-based, making the DuckDB integration seamless:

import duckdb
import polars as pl
 
# Create Polars DataFrame
df = pl.read_parquet('events.parquet')
 
# Query with DuckDB
result = duckdb.sql("""
    SELECT 
        user_id,
        SUM(value) AS total
    FROM df
    GROUP BY user_id
""").pl()  # Returns Polars DataFrame
 
print(result.head())

The .pl() method returns a Polars DataFrame via Arrow — zero-copy from DuckDB to Polars.

DuckDB + Ray via Arrow#

For distributed machine learning pipelines:

import ray
import duckdb
import pyarrow as pa
 
@ray.remote
def analyze_partition(parquet_path: str) -> pa.Table:
    con = duckdb.connect()
    return con.execute(f"""
        SELECT user_id, SUM(revenue) AS ltv
        FROM read_parquet('{parquet_path}')
        GROUP BY user_id
    """).arrow()
 
# Distribute across Ray cluster
results = ray.get([
    analyze_partition.remote(f's3://bucket/shard_{i}.parquet')
    for i in range(10)
])
 
# Combine Arrow tables
combined = pa.concat_tables(results)

Arrow Dataset Integration#

PyArrow Datasets provide lazy loading — DuckDB can query them with predicate pushdown:

import pyarrow.dataset as ds
import duckdb
 
# Create a Dataset (lazy — doesn't load data yet)
dataset = ds.dataset('s3://my-bucket/events/', format='parquet')
 
# DuckDB queries the dataset with pushdown
result = duckdb.sql("""
    SELECT 
        user_id,
        SUM(revenue) AS total
    FROM dataset
    WHERE occurred_at > '2026-01-01'
    GROUP BY user_id
""").arrow()

DuckDB pushes the WHERE occurred_at > '2026-01-01' filter down to the Parquet file reads — only relevant row groups are read.

Performance Benchmarks#

On a 10M row dataset:

Operation	Approach	Time
Pandas groupby	Pure Pandas	8.2s
DuckDB on Pandas (copy)	`con.register()`	0.6s
DuckDB on Arrow (zero-copy)	Direct Arrow read	0.4s
DuckDB on Parquet	`read_parquet()`	0.5s

The zero-copy Arrow path is consistently the fastest for large DataFrames because it avoids both the Pandas overhead and the copy cost.

DenchClaw and Arrow#

DenchClaw uses DuckDB natively, and the App Builder supports returning query results as Arrow tables for use in analytical apps:

// In a DenchClaw app
const arrowData = await dench.db.queryArrow(`
    SELECT "Stage", SUM("Deal Value") AS pipeline_value
    FROM v_deals
    GROUP BY "Stage"
`);
 
// Use with Apache Arrow JS library
import { tableFromIPC } from 'apache-arrow';
const table = tableFromIPC(arrowData);