Your Python pipeline has no compiler. Nothing stops you from passing a raw, unvalidated dict into a function that expects clean data. Nothing prevents mixing up two IDs that happen to be strings. Nothing catches a None that shouldn’t exist until it blows up three steps later.

Compiled languages catch this class of error before the code runs. Python doesn’t — unless you build that safety net yourself.

Pydantic can be that net. But not the way most data engineers use it.

What You’re Probably Already Doing

If you’ve read Stop Using Dicts for Config — Use Pydantic Instead, you know how useful BaseSettings is for environment-based configuration. And if you’re validating API inputs or ingestion payloads, you’re likely using BaseModel there too.

That’s the right call. But it’s Pydantic at the edges of your system — where data enters from the outside world.

The pipeline itself — the transformations, the intermediate states, the business logic — is still full of dicts, bare strings, and implicit assumptions about what a variable contains at any given point.

That’s the gap this article addresses.

The Problem: Your Intermediate States Are Invisible

Here’s a typical pipeline function signature:

def enrich_transaction(data: dict, account_id: str, ref_id: str) -> dict:
    ...

Three questions Python cannot answer:

  • Is data raw from the CSV, or already validated?
  • Which ID is account_id and which is ref_id? They’re both str.
  • What happens if data["amount"] is None?

The answers live in your head, in a comment, or in a README nobody reads. When you’re onboarding someone, or coming back to this code six months later, you’re guessing.

A compiler would refuse to build this. Python just runs it.

The Fix: Model Your Pipeline States as Distinct Types

The core idea is simple: each stage of your pipeline has its own type. A transformation is not just a function that mutates data — it’s a function that converts one type into another.

Let’s build a pipeline that processes bank export CSV files.

RawTransaction — read, don’t validate

from pydantic import BaseModel

class RawTransaction(BaseModel):
    account_id: str
    amount: str | None
    transaction_date: str
    currency: str
    description: str | None

Everything is a str or None. This model represents exactly what’s in the file — including malformed values. The only thing it guarantees is that the CSV row has the expected columns.

It cannot fail on bad data. That’s intentional.

ValidatedTransaction — types are enforced, rules are applied

from pydantic import BaseModel, field_validator
from decimal import Decimal
from datetime import date
from enum import Enum

class Currency(str, Enum):
    EUR = "EUR"
    USD = "USD"
    GBP = "GBP"

class AccountId(BaseModel):
    value: str

    @field_validator("value")
    @classmethod
    def must_be_valid_format(cls, v: str) -> str:
        if not v.startswith("ACC") or len(v) != 12:
            raise ValueError(f"Invalid AccountId format: {v}")
        return v

    def __str__(self) -> str:
        return self.value

class ValidatedTransaction(BaseModel):
    account_id: AccountId
    amount: Decimal
    transaction_date: date
    currency: Currency
    description: str | None

This is where Pydantic earns the “compiler” label. A ValidatedTransaction cannot exist in an invalid state. A malformed AccountId, a non-parseable amount, an unknown currency — any of these raises a ValidationError at construction time, not three steps later when you try to write to the database.

The transformation function is explicit about what it rejects:

from pydantic import ValidationError
from decimal import InvalidOperation

def validate_transaction(raw: RawTransaction) -> ValidatedTransaction | None:
    try:
        return ValidatedTransaction(
            account_id=AccountId(value=raw.account_id),
            amount=Decimal(raw.amount),
            transaction_date=date.fromisoformat(raw.transaction_date),
            currency=Currency(raw.currency),
            description=raw.description
        )
    except (ValidationError, InvalidOperation, ValueError) as e:
        logger.warning("Rejected transaction", raw=raw.model_dump(), reason=str(e))
        return None

Bad rows don’t crash the pipeline — they get logged and dropped. The rest continues.

EnrichedTransaction — business logic is applied

class EnrichedTransaction(BaseModel):
    account_id: AccountId
    amount: Decimal
    transaction_date: date
    currency: Currency
    description: str | None
    is_debit: bool
    month: int
    abs_amount: Decimal
def enrich_transaction(validated: ValidatedTransaction) -> EnrichedTransaction:
    return EnrichedTransaction(
        **validated.model_dump(),
        is_debit=validated.amount < 0,
        month=validated.transaction_date.month,
        abs_amount=abs(validated.amount)
    )

The signature validated: ValidatedTransaction is a hard contract. You cannot pass a RawTransaction here by mistake — Python will raise a TypeError immediately. No silent corruption.

Typed Identifiers: One Step Further

AccountId in the example above illustrates a broader pattern worth naming explicitly.

In a real pipeline you might have AccountId, TransactionId, BatchId — all backed by strings or integers. Python treats them as identical types. You can pass one where the other is expected and nothing complains.

def get_transactions(account_id: str, batch_id: str) -> list:
    ...

# Six months later, someone calls it like this:
get_transactions(batch_id, account_id)  # swapped — no error

Typed identifiers eliminate this class of bug:

class TransactionId(BaseModel):
    value: int

    def __post_init__(self):
        if self.value <= 0:
            raise ValueError(f"TransactionId must be positive: {self.value}")

def get_transactions(account_id: AccountId, batch_id: BatchId) -> list:
    ...

# Now this fails immediately at the type level
get_transactions(batch_id, account_id)  # TypeError

Small investment, entire category of bugs eliminated.

The Spark Boundary

Pydantic doesn’t go inside Spark. DataFrames work with primitive types — strings, integers, floats. You can’t put an AccountId in a column.

This is not a failure of the pattern. It’s a boundary you make explicit.

validated_transactions: list[ValidatedTransaction] = [...]

# Pydantic → DataFrame (enter Spark)
rows = [
    {
        "account_id": str(t.account_id),
        "amount": float(t.amount),
        "transaction_date": t.transaction_date,
        "currency": t.currency.value,
        "description": t.description,
    }
    for t in validated_transactions
]
df = spark.createDataFrame(rows)

# ... Spark transformations ...

# DataFrame → Pydantic (exit Spark)
enriched = [
    EnrichedTransaction(
        account_id=AccountId(value=row["account_id"]),
        amount=Decimal(str(row["amount"])),
        transaction_date=row["transaction_date"],
        currency=Currency(row["currency"]),
        description=row["description"],
        is_debit=row["is_debit"],
        month=row["month"],
        abs_amount=Decimal(str(row["abs_amount"]))
    )
    for row in df.collect()
]

The “compiled zone” looks like this:

[CSV brut]
     parsing naïf
[RawTransaction]         str partout, rien ne peut échouer
     validate_transaction()
[ValidatedTransaction]   zone compilée  types forts, règles métier
     enrich_transaction()
[EnrichedTransaction]    zone compilée
     .model_dump()
[DataFrame Spark]        primitifs, hors zone compilée
     reconstruction
[EnrichedTransaction]    retour en zone compilée

You know exactly where Pydantic protects you and where it doesn’t. That’s a feature, not a limitation.

What “Compiler” Actually Means Here

To be clear: Pydantic is not a compiler. It doesn’t perform static analysis, it doesn’t check types at import time, and it won’t catch every mistake before you run your code.

What it does — when used this way — is move a whole class of errors from runtime to construction time. Instead of discovering that your pipeline processed 3% of records incorrectly at 3am, you get a ValidationError the moment a malformed object is created.

That’s the compiler analogy. Not perfect — but close enough to be useful.

Use Pydantic at the edges? Good. Use it to model your intermediate pipeline states too? Now you have a safety net that actually spans the pipeline.


The ApplicationContext pattern from Tired of Passing 10 Parameters to Your Functions pairs well with this approach — your typed models flow through the pipeline, your context carries the infrastructure.