DuckDB with dbt: Local Data Transformation
Use dbt with DuckDB for local data transformation pipelines. No cloud warehouse needed — model, test, and document your data transforms entirely locally.
DuckDB with dbt: Local Data Transformation
dbt (data build tool) is the standard for SQL-based data transformations. And DuckDB is its fastest, cheapest backend — no cloud warehouse, no monthly bill, transforms run in seconds on your laptop.
This guide sets up a complete dbt + DuckDB pipeline from scratch.
Why dbt + DuckDB?#
The combination makes sense at multiple levels:
For individual analysts: dbt gives you version-controlled, tested SQL transforms. DuckDB gives you fast local execution. Together, you get a production-quality data pipeline that runs on your laptop for free.
For teams: Develop locally with DuckDB, deploy against Snowflake or BigQuery in production. Same dbt models, different backend.
For startups: Skip the $2,000/month data warehouse. Run everything on DuckDB until you genuinely need the scale.
Setup#
# Install dbt with DuckDB adapter
pip install dbt-duckdb
# Verify
dbt --versionInitialize a dbt Project#
# Create new project
dbt init my_analytics
# Navigate to project
cd my_analyticsConfigure the DuckDB Profile#
Edit ~/.dbt/profiles.yml:
my_analytics:
target: dev
outputs:
dev:
type: duckdb
path: /Users/yourname/data/analytics.duckdb
threads: 4
# Production profile (Snowflake example)
prod:
type: snowflake
account: myaccount
user: analytics_user
password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
role: analytics
database: analytics
warehouse: analytics_wh
schema: public
threads: 8# Test connection
dbt debugProject Structure#
my_analytics/
├── dbt_project.yml
├── profiles.yml (reference only)
├── models/
│ ├── staging/ # Raw data cleaning
│ │ ├── stg_events.sql
│ │ ├── stg_users.sql
│ │ └── sources.yml
│ ├── intermediate/ # Business logic
│ │ ├── int_user_sessions.sql
│ │ └── int_deal_stages.sql
│ └── marts/ # Final analytical tables
│ ├── fct_revenue.sql
│ ├── dim_customers.sql
│ └── rpt_pipeline.sql
├── seeds/ # Static reference data (CSV)
│ └── deal_stage_weights.csv
└── tests/ # Data quality tests
└── assert_revenue_positive.sql
Building Models#
Staging Model (Raw → Clean)#
-- models/staging/stg_events.sql
{{ config(materialized='view') }}
SELECT
event_id,
user_id,
event_name,
-- Parse JSON properties
json_extract_string(properties, '$.page') AS page,
json_extract_string(properties, '$.feature') AS feature,
CAST(json_extract(properties, '$.revenue') AS DECIMAL(10,2)) AS revenue,
occurred_at::TIMESTAMP AS occurred_at,
DATE_TRUNC('day', occurred_at::TIMESTAMP) AS event_date
FROM {{ source('raw', 'events') }}
WHERE occurred_at IS NOT NULL
AND user_id IS NOT NULLIntermediate Model (Business Logic)#
-- models/intermediate/int_user_sessions.sql
{{ config(materialized='table') }}
WITH session_boundaries AS (
SELECT
user_id,
occurred_at,
event_name,
-- New session if >30 min since last event
CASE
WHEN DATEDIFF('minute',
LAG(occurred_at) OVER (PARTITION BY user_id ORDER BY occurred_at),
occurred_at) > 30
OR LAG(occurred_at) OVER (PARTITION BY user_id ORDER BY occurred_at) IS NULL
THEN 1 ELSE 0
END AS is_new_session
FROM {{ ref('stg_events') }}
),
sessions AS (
SELECT
user_id,
occurred_at,
SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY occurred_at) AS session_num
FROM session_boundaries
)
SELECT
user_id,
session_num,
MIN(occurred_at) AS session_start,
MAX(occurred_at) AS session_end,
COUNT(*) AS event_count,
DATEDIFF('minute', MIN(occurred_at), MAX(occurred_at)) AS duration_minutes
FROM sessions
GROUP BY user_id, session_numMart Model (Final Analytics Table)#
-- models/marts/fct_revenue.sql
{{ config(
materialized='incremental',
unique_key='transaction_id'
) }}
SELECT
t.id AS transaction_id,
t.customer_id,
c.segment AS customer_segment,
c.country,
t.type AS transaction_type,
t.mrr_delta,
t.arr_delta,
t.occurred_at,
DATE_TRUNC('month', t.occurred_at) AS revenue_month,
DATE_TRUNC('quarter', t.occurred_at) AS revenue_quarter
FROM {{ ref('stg_transactions') }} t
LEFT JOIN {{ ref('dim_customers') }} c ON t.customer_id = c.customer_id
{% if is_incremental() %}
WHERE t.occurred_at >= (SELECT MAX(occurred_at) FROM {{ this }})
{% endif %}Sources Configuration#
# models/staging/sources.yml
version: 2
sources:
- name: raw
schema: main # DuckDB schema
tables:
- name: events
description: "Raw user events from the application"
- name: transactions
description: "Revenue transactions"
freshness:
warn_after: {count: 24, period: hour}
error_after: {count: 48, period: hour}Testing Models#
dbt has built-in tests and supports custom SQL tests:
# models/marts/fct_revenue.yml
version: 2
models:
- name: fct_revenue
description: "Daily revenue transactions"
columns:
- name: transaction_id
tests:
- unique
- not_null
- name: mrr_delta
tests:
- not_null
- name: customer_segment
tests:
- accepted_values:
values: ['free', 'pro', 'enterprise']-- tests/assert_no_future_revenue.sql
-- Custom test: no revenue dated in the future
SELECT COUNT(*) AS failures
FROM {{ ref('fct_revenue') }}
WHERE occurred_at > CURRENT_TIMESTAMP
HAVING failures > 0# Run all tests
dbt test
# Test specific model
dbt test --select fct_revenueRunning the Pipeline#
# Run all models
dbt run
# Run specific model and its dependencies
dbt run --select fct_revenue+
# Run in production profile
dbt run --target prod
# Full refresh (rebuild from scratch)
dbt run --full-refresh
# Run + test + docs in one command
dbt buildDocumentation#
# Generate docs
dbt docs generate
# Serve locally
dbt docs serve
# Opens at http://localhost:8080The docs site shows your DAG (model dependency graph), column descriptions, and test results.
DuckDB-Specific Features in dbt#
The dbt-duckdb adapter supports DuckDB-specific features:
-- Use DuckDB's PIVOT syntax in models
{{ config(materialized='table') }}
PIVOT (
SELECT stage, COUNT(*) AS deals FROM {{ ref('stg_deals') }}
)
ON stage
USING SUM(deals)# dbt Python models (runs in DuckDB via Pandas)
# models/feature_engineering.py
def model(dbt, session):
deals = dbt.ref("stg_deals")
df = deals.df()
df['deal_age_days'] = (pd.Timestamp.now() - pd.to_datetime(df['created_at'])).dt.days
return dfDenchClaw + dbt#
DenchClaw uses DuckDB for all its CRM data. You can run dbt against the DenchClaw workspace database to build analytics layers on top of your CRM:
# profiles.yml
denchclaw_analytics:
target: dev
outputs:
dev:
type: duckdb
path: /Users/yourname/.openclaw-dench/workspace/workspace.duckdb
threads: 2-- models/staging/stg_deals.sql
SELECT
id AS deal_id,
"Company" AS company,
"Deal Value" AS value,
"Stage" AS stage,
"Created At" AS created_at
FROM v_dealsRun dbt run to build analytics models on top of your DuckDB CRM data. The transformed tables live alongside your CRM data in the same database.
Frequently Asked Questions#
Can I use dbt-duckdb in production?#
Yes. The dbt-duckdb adapter is actively maintained and production-ready. Many teams use DuckDB as their production dbt backend.
What's the performance difference between dbt + DuckDB vs dbt + Snowflake?#
For local development with datasets that fit on one machine, DuckDB is 5-20x faster. For production with multi-TB datasets, Snowflake wins due to distributed execution.
Can I mix DuckDB and Snowflake in the same dbt project?#
Yes. Use different profiles for dev (DuckDB) and prod (Snowflake). The same SQL models run on both.
Does dbt-duckdb support incremental models?#
Yes. Incremental models work the same as in other adapters.
How do I seed data from CSV files in dbt + DuckDB?#
Place CSV files in the seeds/ directory and run dbt seed. dbt loads them as tables in DuckDB automatically.
Ready to try DenchClaw? Install in one command: npx denchclaw. Full setup guide →
