DuckDB with dbt: Local Data Transformation

Use dbt with DuckDB for local data transformation pipelines. No cloud warehouse needed — model, test, and document your data transforms entirely locally.

Mark Rachapoom

March 26, 2026·6 min read

dbt (data build tool) is the standard for SQL-based data transformations. And DuckDB is its fastest, cheapest backend — no cloud warehouse, no monthly bill, transforms run in seconds on your laptop.

This guide sets up a complete dbt + DuckDB pipeline from scratch.

Why dbt + DuckDB?#

The combination makes sense at multiple levels:

For individual analysts: dbt gives you version-controlled, tested SQL transforms. DuckDB gives you fast local execution. Together, you get a production-quality data pipeline that runs on your laptop for free.

For teams: Develop locally with DuckDB, deploy against Snowflake or BigQuery in production. Same dbt models, different backend.

For startups: Skip the $2,000/month data warehouse. Run everything on DuckDB until you genuinely need the scale.

Setup#

# Install dbt with DuckDB adapter
pip install dbt-duckdb
 
# Verify
dbt --version

Initialize a dbt Project#

# Create new project
dbt init my_analytics
 
# Navigate to project
cd my_analytics

Configure the DuckDB Profile#

Edit ~/.dbt/profiles.yml:

my_analytics:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: /Users/yourname/data/analytics.duckdb
      threads: 4
    
    # Production profile (Snowflake example)
    prod:
      type: snowflake
      account: myaccount
      user: analytics_user
      password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
      role: analytics
      database: analytics
      warehouse: analytics_wh
      schema: public
      threads: 8

# Test connection
dbt debug

Project Structure#

my_analytics/
├── dbt_project.yml
├── profiles.yml (reference only)
├── models/
│   ├── staging/           # Raw data cleaning
│   │   ├── stg_events.sql
│   │   ├── stg_users.sql
│   │   └── sources.yml
│   ├── intermediate/      # Business logic
│   │   ├── int_user_sessions.sql
│   │   └── int_deal_stages.sql
│   └── marts/             # Final analytical tables
│       ├── fct_revenue.sql
│       ├── dim_customers.sql
│       └── rpt_pipeline.sql
├── seeds/                 # Static reference data (CSV)
│   └── deal_stage_weights.csv
└── tests/                 # Data quality tests
    └── assert_revenue_positive.sql

Building Models#

Staging Model (Raw → Clean)#

-- models/staging/stg_events.sql
{{ config(materialized='view') }}
 
SELECT
    event_id,
    user_id,
    event_name,
    -- Parse JSON properties
    json_extract_string(properties, '$.page') AS page,
    json_extract_string(properties, '$.feature') AS feature,
    CAST(json_extract(properties, '$.revenue') AS DECIMAL(10,2)) AS revenue,
    occurred_at::TIMESTAMP AS occurred_at,
    DATE_TRUNC('day', occurred_at::TIMESTAMP) AS event_date
FROM {{ source('raw', 'events') }}
WHERE occurred_at IS NOT NULL
    AND user_id IS NOT NULL

Intermediate Model (Business Logic)#

-- models/intermediate/int_user_sessions.sql
{{ config(materialized='table') }}
 
WITH session_boundaries AS (
    SELECT
        user_id,
        occurred_at,
        event_name,
        -- New session if >30 min since last event
        CASE 
            WHEN DATEDIFF('minute', 
                LAG(occurred_at) OVER (PARTITION BY user_id ORDER BY occurred_at),
                occurred_at) > 30
            OR LAG(occurred_at) OVER (PARTITION BY user_id ORDER BY occurred_at) IS NULL
            THEN 1 ELSE 0
        END AS is_new_session
    FROM {{ ref('stg_events') }}
),
sessions AS (
    SELECT
        user_id,
        occurred_at,
        SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY occurred_at) AS session_num
    FROM session_boundaries
)
SELECT
    user_id,
    session_num,
    MIN(occurred_at) AS session_start,
    MAX(occurred_at) AS session_end,
    COUNT(*) AS event_count,
    DATEDIFF('minute', MIN(occurred_at), MAX(occurred_at)) AS duration_minutes
FROM sessions
GROUP BY user_id, session_num

Mart Model (Final Analytics Table)#

-- models/marts/fct_revenue.sql
{{ config(
    materialized='incremental',
    unique_key='transaction_id'
) }}
 
SELECT
    t.id AS transaction_id,
    t.customer_id,
    c.segment AS customer_segment,
    c.country,
    t.type AS transaction_type,
    t.mrr_delta,
    t.arr_delta,
    t.occurred_at,
    DATE_TRUNC('month', t.occurred_at) AS revenue_month,
    DATE_TRUNC('quarter', t.occurred_at) AS revenue_quarter
FROM {{ ref('stg_transactions') }} t
LEFT JOIN {{ ref('dim_customers') }} c ON t.customer_id = c.customer_id
 
{% if is_incremental() %}
WHERE t.occurred_at >= (SELECT MAX(occurred_at) FROM {{ this }})
{% endif %}

Sources Configuration#

# models/staging/sources.yml
version: 2
 
sources:
  - name: raw
    schema: main  # DuckDB schema
    tables:
      - name: events
        description: "Raw user events from the application"
      - name: transactions
        description: "Revenue transactions"
      
    freshness:
      warn_after: {count: 24, period: hour}
      error_after: {count: 48, period: hour}

Testing Models#

dbt has built-in tests and supports custom SQL tests:

# models/marts/fct_revenue.yml
version: 2
 
models:
  - name: fct_revenue
    description: "Daily revenue transactions"
    columns:
      - name: transaction_id
        tests:
          - unique
          - not_null
      - name: mrr_delta
        tests:
          - not_null
      - name: customer_segment
        tests:
          - accepted_values:
              values: ['free', 'pro', 'enterprise']

-- tests/assert_no_future_revenue.sql
-- Custom test: no revenue dated in the future
SELECT COUNT(*) AS failures
FROM {{ ref('fct_revenue') }}
WHERE occurred_at > CURRENT_TIMESTAMP
HAVING failures > 0

# Run all tests
dbt test
 
# Test specific model
dbt test --select fct_revenue

Running the Pipeline#

# Run all models
dbt run
 
# Run specific model and its dependencies
dbt run --select fct_revenue+
 
# Run in production profile
dbt run --target prod
 
# Full refresh (rebuild from scratch)
dbt run --full-refresh
 
# Run + test + docs in one command
dbt build

Documentation#

# Generate docs
dbt docs generate
 
# Serve locally
dbt docs serve
# Opens at http://localhost:8080

The docs site shows your DAG (model dependency graph), column descriptions, and test results.

DuckDB-Specific Features in dbt#

The dbt-duckdb adapter supports DuckDB-specific features:

-- Use DuckDB's PIVOT syntax in models
{{ config(materialized='table') }}
 
PIVOT (
    SELECT stage, COUNT(*) AS deals FROM {{ ref('stg_deals') }}
)
ON stage
USING SUM(deals)

# dbt Python models (runs in DuckDB via Pandas)
# models/feature_engineering.py
def model(dbt, session):
    deals = dbt.ref("stg_deals")
    
    df = deals.df()
    df['deal_age_days'] = (pd.Timestamp.now() - pd.to_datetime(df['created_at'])).dt.days
    
    return df

DenchClaw + dbt#

DenchClaw uses DuckDB for all its CRM data. You can run dbt against the DenchClaw workspace database to build analytics layers on top of your CRM:

# profiles.yml
denchclaw_analytics:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: /Users/yourname/.openclaw-dench/workspace/workspace.duckdb
      threads: 2

-- models/staging/stg_deals.sql
SELECT 
    id AS deal_id,
    "Company" AS company,
    "Deal Value" AS value,
    "Stage" AS stage,
    "Created At" AS created_at
FROM v_deals