3 Simple Ways To Build dbt Python Models - Learn

Photo of Jim Kutz
Jim Kutz
December 19, 2025

Summarize this article with:

✨ AI Generated Summary

dbt Python models give data teams a way to go beyond SQL without breaking the structure and guarantees that make dbt reliable. They run directly inside your warehouse, stay fully connected to lineage, tests, and documentation, and let you apply logic that would be painful or unreadable in pure SQL.

That flexibility comes with real constraints. Python models are not a replacement for SQL, and they cannot reach outside your warehouse to pull data from APIs or SaaS tools. Used carelessly, they add cost, slow down pipelines, and complicate maintenance. Used intentionally, they unlock powerful patterns that stay clean, testable, and production-ready.

This guide walks through three practical ways to build dbt Python models: simple in-warehouse transformations, custom logic that SQL struggles with, and workflows that combine dbt with external data using PyAirbyte. 

TL;DR: Ways to Build dbt Phyton Models at a Glance

  • dbt Python models let you apply Python-based transformations inside your warehouse while staying fully integrated with dbt’s DAG, lineage, tests, and docs.

  • They are best used when SQL becomes hard to read or impossible to express, not as a replacement for SQL-first modeling.

  • Python models cannot pull data from external APIs or SaaS tools on their own. They only work with data already in the warehouse.

  • To work with external data, use PyAirbyte to extract and load data first, then transform it with dbt Python models.

  • Keep Python models small, focused, and intentional to control cost, performance, and maintenance overhead. 

What You Need Before You Start

Python models execute in your warehouse runtime, not locally, so you need the right technical setup before adding them to your dbt project. Before diving into implementation, verify these requirements:

  • Supported data warehouse: Snowflake with Snowpark, Databricks with PySpark, BigQuery with DataFrames, Amazon Redshift, or Azure Synapse. Your model code can run either in the remote environment of these warehouses or locally, depending on your setup (dbt Cloud or dbt Core).
  • Compatible dbt adapter: To use dbt with Snowflake, Databricks, or BigQuery, install dbt-core and then the corresponding adapter (dbt-snowflake, dbt-databricks, or dbt-bigquery) with pip. Each adapter provides platform-specific integration for dbt Core.
  • Python version: Python 3.8+ on your development machine, matching what's supported by both dbt Core and your adapter.
  • dbt fundamentals: You should already know dbt.ref(), dbt.source(), materializations, and basic staging → marts patterns before mixing Python into your DAG.
  • Python skills: Basic DataFrame operations and familiarity with your warehouse's API — Snowpark's .filter(), PySpark functions, or BigQuery DataFrames.
  • Materialization limits: Python models only support table or incremental materialization, and no view or ephemeral. Plan your storage and refresh strategy accordingly.

1. How to Build a Basic dbt Python Model

A dbt Python model is just a .py file that lives alongside your SQL models. You keep everything in one DAG and one repository, organized by the same layers you already know: 

  • Staging
  • Intermediate
  • Marts

When you open models/ you'll see familiar structure. Staging handles raw source cleanup, intermediate manages business logic in flight, and marts delivers analytics-ready tables. You can create a Python model in dbt by placing a properly defined .py file (with the @model decorator) in your models folder and ensuring your project is configured for Python models; dbt will then discover and run it during dbt run.

In dbt, Python models are defined as functions that return a DataFrame, typically decorated with a dbt macro, and the helpers like ref() and source() are not available as direct function arguments.

You pull upstream data with dbt.ref("upstream_model"); that call returns a DataFrame native to your platform. After your transformations you must return one DataFrame so dbt can materialize it as a table or incremental model. You can set materialization inline with dbt.config(materialized="table")— a literal value so dbt's static parser can understand it.

Here's a minimal Snowpark example that filters orders to the last 30 days and selects only the columns you care about. The same pattern works in PySpark or BigQuery DataFrames with API-specific syntax tweaks:

# models/staging/stg_orders_recent.py
from datetime import date, timedelta

def model(dbt, session):
    dbt.config(materialized="table")         # physical table in the warehouse

    orders_df = dbt.ref("stg_orders")        # Snowpark DataFrame
    cutoff = date.today() - timedelta(days=30)

    recent_df = (
        orders_df
        .filter(orders_df["order_date"] >= cutoff)
        .select("order_id", "customer_id", "order_total", "order_date")
    )

    return recent_df

Run it the same way you run any model:

dbt run --select stg_orders_recent

dbt compiles the Python into a stored procedure (or Spark job), executes it in the warehouse, and writes a new table you can query immediately.

How to Reference Upstream Models

The same dbt.ref() and dbt.source() calls you use in SQL work in Python. When you call dbt.ref("stg_orders"), dbt injects a dependency edge in the DAG, guaranteeing that model finishes before yours runs. If the data lives outside dbt and is declared in sources.yml, call dbt.source("sales", "orders_raw") instead, just as you would in SQL.

Python models cannot point at ephemeral models because those never materialize into physical tables. Stick to refs that resolve to tables, views, or incremental models. Combining multiple refs is straightforward:

def model(dbt, session):
    orders = dbt.ref("stg_orders")
    customers = dbt.ref("stg_customers")

    joined = orders.join(customers, orders["customer_id"] == customers["customer_id"])
    return joined

Whether those upstream models are SQL or Python doesn't matter. dbt still handles the execution order and lineage for you.

2. How to Build a dbt Python Model With Custom Logic

Python models shine when you bump into logic that is painful, or impossible to express in pure SQL. By writing transformation steps in Python, you keep everything inside the familiar dbt DAG while accessing libraries and control flow that a query engine cannot provide.

You'll reach for this pattern when you need complex business rules, statistical calculations, or feature engineering that relies on Python packages. 

Inside the .py file you still define a single model(dbt, session) function. dbt.ref() returns a reference to an upstream model or source, which can be converted to a general Python DataFrame (such as a pandas DataFrame) inside Python models, but does not automatically load warehouse-specific DataFrames like Snowpark, Spark, or BigQuery DataFrames. You can then chain Python transformations just like you would in pandas.

Typical tasks include filling nulls, creating derived columns, vectorized math, or branching if/else logic that would balloon into nested CASE statements in SQL:

# models/orders__scored.py
import snowflake.snowpark.functions as F

def model(dbt, session):
    # materialize as a table so downstream marts query quickly
    dbt.config(materialized="table")

    # Load upstream orders staging model
    orders_df = dbt.ref("stg_orders")

    # 1. Flag high-value orders using Pythonic conditional logic
    orders_df = orders_df.with_column(
        "is_high_value",
        F.when(orders_df["amount_usd"] > 500, F.lit(True)).otherwise(F.lit(False))
    )

    # 2. Replace missing promo codes with a sentinel value
    orders_df = orders_df.fillna({"promo_code": "UNKNOWN"})

    # 3. Derive an ML-friendly bucketed amount feature
    orders_df = orders_df.with_column(
        "amount_bucket",
        F.floor(orders_df["amount_usd"] / 100)
    )

    return orders_df

Running dbt run --select orders__scored executes the Python model locally (using dbt and Snowpark), then materializes the result as a physical table in Snowflake, which you can test and document like any other model.

Python execution is slower and billed differently than SQL because it relies on general-purpose compute rather than the warehouse's optimized query engine. Mitigate that overhead with a few strategic approaches:

  • Filter early by retrieving only the columns and rows you need before heavy Python logic. 
  • Stay vectorized by preferring DataFrame operations over row-by-row .apply() loops, which are costly on Spark and Snowpark. 
  • Push joins upstream by letting an SQL model handle large joins or window functions; feed Python only the refined subset it must touch. 
  • Use incremental awareness by wrapping filters in dbt.is_incremental() to process new rows only, cutting runtime and cost.

Code is platform-specific: a Snowpark with_column becomes withColumn in PySpark, and BigQuery DataFrames have their own syntax. Keep your project consistent by standardizing on one adapter-specific style or abstracting helpers when you must support multiple warehouses.

3. How to Build dbt Python Models From External Data

dbt Python models can only work with data that already exists in your warehouse. They cannot pull data directly from SaaS tools, APIs, or third-party systems.

PyAirbyte fills that gap. It handles extraction and loading from external sources, while dbt remains responsible for transformation. PyAirbyte wraps Airbyte’s 600+ connectors and loads external data into standard warehouse tables that dbt can immediately access.

The flow is simple. PyAirbyte syncs data from an external source into a raw schema, such as raw.github_repos, and generates matching sources.yml entries. Your dbt project can then reference those tables using dbt.source() inside a Python model, apply business logic, and return a DataFrame for materialization.

Because PyAirbyte writes to native warehouse tables, the same pattern works across Snowflake, BigQuery, and Databricks. Ingestion happens upstream, transformation stays inside the dbt DAG, and testing, documentation, and lineage remain unified.

The key principle is separation of concerns. PyAirbyte moves data in. dbt Python models transform it. This avoids brittle custom scripts, keeps lineage intact in dbt Docs, and ensures new records flow cleanly from source to analytics without leaving the warehouse.

How to Keep Models Fast and Maintainable

Move Python into your dbt project only when it adds real value. When you do, these habits keep performance predictable and code easy to reason about:

  1. Keep Python models small and focused by splitting complex logic across several models so each file handles a single responsibility. Smaller units are easier to debug and test.
  2. Let SQL handle heavy aggregation by pushing joins, window functions, and large group-bys to upstream SQL models. Save Python for calculations that SQL can't express cleanly.
  3. Avoid row-by-row processing by replacing .apply() loops with vectorized DataFrame methods to prevent massive slowdowns.
  4. Optimize incremental patterns by wrapping filters in dbt.is_incremental() so each run touches only new or changed rows. This reduces processing time and warehouse costs.
  5. Test Python model outputs by applying unique, not_null, and accepted_values tests to Python models, though dbt's framework does not treat Python models identically to SQL ones for these tests; special handling or custom testing may be required.
  6. Document with YAML by using schema.yml files for column descriptions and configs because Python files lack inline comment blocks.
  7. Be aware of platform costs since inefficient warehouse-side Python code can consume more resources than optimized SQL. Benchmark critical models and size resources accordingly.

What Are Common Mistakes to Avoid?

Even experienced data engineers stumble when first introducing Python models into their dbt projects. Watch out for these common pitfalls that can slow down your pipelines or create maintenance headaches:

  • Using Python where SQL is faster: Your warehouse optimizes SQL queries but runs Python as general compute. Stick with SQL for joins and aggregations—save Python for the logic that actually needs it.
  • Ignoring warehouse execution limits: Python models hit timeout and memory walls fast. Check your platform's limits before that 4-hour feature engineering job crashes your entire DAG.
  • Trying to fetch APIs inside Python models: Don't pull external data directly in dbt models. Use PyAirbyte to land raw data first, then transform it. Mixing extract and transform logic creates brittle dependencies.
  • Using print() for debugging: The warehouse swallows print statements. You'll be debugging blind. Write diagnostic rows to a temp table or check your platform's execution logs instead.
  • Copy-pasting logic between models: You can't import functions between Python models yet. Repeated code drifts over time—package shared utilities separately where your runtime can actually import them.
  • Materializing as view or ephemeral: Python models in dbt generally default to table materialization; attempting to set 'view' as the materialization typically results in the model being built as a table, not a failed build. Adapter-specific behavior may vary.

When Do dbt Python Models Actually Make Sense?

dbt Python models are a powerful extension to SQL-first analytics, but only when used with intention. They work best for transformations that are awkward or unreadable in SQL, while still benefiting from dbt’s DAG, lineage, testing, and documentation. Treating them as a default choice or as a way to bypass SQL usually leads to slower runs, higher costs, and harder-to-maintain projects.

The key is separation of concerns. Let your warehouse and SQL handle heavy joins, aggregations, and filtering. Use Python models for focused logic that truly needs Python. When external data is involved, land it in the warehouse first with PyAirbyte, then transform it inside dbt so everything remains observable, testable, and reliable.

If you apply Python selectively and keep models small, dbt Python models become a practical tool rather than a liability, helping teams solve real problems without compromising the structure that makes dbt effective.

Ready to land external data in your warehouse before transforming it in dbt? Try Airbyte and connect your first data source in minutes with PyAirbyte or 600+ pre-built connectors.

Frequently Asked Questions 

How do I build my first dbt Python model?

Create a .py file in your models/ directory, define a model(dbt, session) function, set a supported materialization, reference upstream models with dbt.ref(), and return a single DataFrame.

What warehouses support dbt Python models?

dbt Python models are supported on Snowflake, Databricks, BigQuery, Redshift, and Synapse, using each platform’s native DataFrame APIs.

Can I mix SQL and Python models in the same dbt project?

Yes. SQL and Python models can reference each other using dbt.ref(), and dbt will manage execution order and lineage automatically.

What’s the best way to work with external data in dbt Python models?

External data must be loaded into your warehouse first. Tools like PyAirbyte handle extraction and loading, after which dbt Python models can transform that data like any other source.

Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program
The data movement infrastructure for the modern data teams.
Try a 30-day free trial
Photo of Jim Kutz