Your pipeline is green. No errors, no warnings, no angry Slack messages. The daily job runs, the MERGE statement completes, row counts look stable.

Except that right now, somewhere in your dimension table, real data is being silently overwritten by rows that have nothing to do with it.

This is the story of how an integer overflow on a surrogate key turned a routine MERGE into a data destruction machine — and how I found out way too late.


The setup: nothing fancy, nothing suspicious

The pipeline was straightforward. A dimension table in Databricks, fed daily by new incoming transactions. Each new row needed a surrogate key, generated with a classic pattern:

SELECT
    (SELECT COALESCE(MAX(sk_transaction_id), 0) FROM dim_transaction)
        + ROW_NUMBER() OVER (ORDER BY transaction_date) AS sk_transaction_id,
    transaction_ref,
    account_id,
    transaction_date,
    amount,
    currency
FROM staging_transactions
WHERE transaction_ref NOT IN (SELECT transaction_ref FROM dim_transaction)

Then the MERGE:

MERGE INTO dim_transaction AS target
USING new_transactions AS source
ON target.sk_transaction_id = source.sk_transaction_id
WHEN MATCHED THEN UPDATE SET
    target.transaction_ref  = source.transaction_ref,
    target.account_id       = source.account_id,
    target.transaction_date = source.transaction_date,
    target.amount           = source.amount,
    target.currency         = source.currency
WHEN NOT MATCHED THEN INSERT *

Nothing exotic. You’ve probably written something like this. I certainly had — dozens of times.


The day the numbers went negative

The sk_transaction_id column was defined as INT. In Spark SQL, that’s a 32-bit signed integer. Max value: 2,147,483,647.

When MAX(sk_transaction_id) + ROW_NUMBER() crossed that threshold, Spark didn’t raise an error. It didn’t log a warning. It wrapped around to negative values.

New surrogate keys started looking like this:

sk_transaction_idtransaction_refamount
-2147483642TXN-99300121,250.00
-2147483641TXN-9930013870.50
-2147483640TXN-99300143,100.00

Negative primary keys. No error. Pipeline green.


Why it got worse: the MERGE that overwrites

Here’s the real damage. Those negative IDs weren’t just weird — they already existed in the target table.

The MERGE statement matched on sk_transaction_id. It found existing rows with those IDs. So instead of inserting new records, it updated existing ones.

Legitimate transaction records were silently overwritten with completely unrelated data. The row count stayed the same. The job reported success. Nobody noticed.

The pipeline didn’t lose data. It replaced data. That’s worse — there’s no obvious signal that anything happened.


Reproducing the bug

Don’t take my word for it. Open a Spark SQL notebook and try this.

1. Create a dimension table near the INT limit

CREATE OR REPLACE TEMP VIEW dim_transaction AS
SELECT * FROM VALUES
    (-2147483648, 'TXN-OLD-001', 'ACC-001', '2025-01-15', 500.00,  'EUR'),
    (-2147483647, 'TXN-OLD-002', 'ACC-002', '2025-02-20', 750.00,  'EUR'),
    ( 2147483643, 'TXN-RECENT-1','ACC-099', '2025-11-15', 200.00,  'EUR'),
    ( 2147483644, 'TXN-RECENT-2','ACC-100', '2025-12-01', 300.00,  'EUR'),
    ( 2147483645, 'TXN-LATEST',  'ACC-101', '2025-12-15', 450.00,  'EUR')
AS t(sk_transaction_id, transaction_ref, account_id, transaction_date, amount, currency);

The table has a few rows near the INT ceiling — and some old rows down at -2147483648. That detail matters.

2. Generate new keys with the overflow pattern

SELECT
    CAST(
        (SELECT MAX(sk_transaction_id) FROM dim_transaction)
        + ROW_NUMBER() OVER (ORDER BY transaction_date)
    AS INT) AS sk_transaction_id,
    transaction_ref,
    account_id,
    transaction_date,
    amount,
    currency
FROM VALUES
    ('TXN-NEW-001', 'ACC-200', '2026-01-05', 1250.00, 'EUR'),
    ('TXN-NEW-002', 'ACC-201', '2026-01-06', 870.50,  'EUR'),
    ('TXN-NEW-003', 'ACC-202', '2026-01-07', 3100.00, 'EUR')
AS t(transaction_ref, account_id, transaction_date, amount, currency);

3. Check the result

sk_transaction_id | transaction_ref
------------------+----------------
        2147483646 | TXN-NEW-001
        2147483647 | TXN-NEW-002
       -2147483648 | TXN-NEW-003

Two rows look fine. The third one wrapped to -2147483648 — which is the exact minimum value of a 32-bit signed integer. And it happens to match an existing row in the dimension table.

That’s the insidious part: it’s not all rows that overflow. It’s one row in a batch, the one that crosses the threshold. In a real pipeline processing thousands of records, you’d never spot it by eyeballing the output.

Note on ANSI mode: By default, Spark wraps silently on integer overflow. If spark.sql.ansi.enabled = true, it throws an ArithmeticException instead. Check your Databricks cluster settings — most environments I’ve worked with have ANSI mode off.


Why nobody noticed

This is the part that stings the most. There were zero signals:

No exception. Spark treats integer overflow as a valid arithmetic operation. It wraps. It doesn’t care.

Stable row counts. The MERGE was updating instead of inserting, so the table size barely changed. No spike, no drop.

No schema violation. The column is INT, the value is INT. Everything looks correct to Spark’s type system.

No business alerts. Unless someone queried a specific transaction and noticed the data didn’t match, there was no reason to suspect anything.

The bug was discovered when a downstream report showed financial figures that didn’t reconcile. The investigation traced it back to dimension records that had been overwritten — some of them weeks earlier.


The guardrails that would have caught it

Every single one of these checks is simple. None of them existed in the original pipeline.

1. Assert no negative surrogate keys

The simplest possible check. If your keys should always be positive, verify it.

SELECT COUNT(*) AS negative_key_count
FROM new_transactions
WHERE sk_transaction_id < 0

If this returns anything other than zero, stop the pipeline.

2. Check key monotonicity

New keys should always be greater than the current max. If they’re not, something broke.

SELECT
    (SELECT MAX(sk_transaction_id) FROM dim_transaction) AS current_max,
    MIN(sk_transaction_id) AS new_min
FROM new_transactions
HAVING new_min <= current_max

This query should return no rows. If it does, your key generation is compromised.

3. Validate MERGE outcomes

After the MERGE, check what actually happened. In Databricks, Delta Lake tracks operation metrics:

DESCRIBE HISTORY dim_transaction LIMIT 1

Inspect operationMetrics. If numTargetRowsUpdated is greater than zero when you only expected inserts — you have a problem.

4. Use BIGINT

BIGINT gives you a 64-bit signed integer: max value 9,223,372,036,854,775,807. That’s 9.2 quintillion.

CREATE TABLE dim_transaction (
    sk_transaction_id BIGINT,
    ...
)

The storage cost difference between INT and BIGINT is 4 bytes per row. On a table with 10 million rows, that’s 40 MB. Your data integrity is worth more than 40 MB.

5. Reconsider MAX(id) + ROW_NUMBER()

This pattern is fragile. It relies on the current max being correct, on the type being large enough, and on no concurrent writes happening. Any of these assumptions can break.

Alternatives worth considering:

ApproachProsCons
monotonically_increasing_id()Guaranteed unique per jobNot sequential, can look messy
UUID / hash-based keysNo collision riskLarger, not sortable
Delta GENERATED ALWAYS AS IDENTITYClean, managed by DeltaDatabricks-specific

Each has trade-offs, but all of them are more robust than manually computing the next integer.


The lesson Spark won’t teach you

Spark is a compute engine. It processes your data exactly as you tell it to. It won’t second-guess your types, flag suspicious values, or refuse to overwrite good data with bad.

That’s your job.

Every pipeline should answer three questions before writing to a target table:

  1. Are the keys valid? — positive, unique, monotonically increasing
  2. Does the MERGE do what I expect? — inserts when expected, updates only when intended
  3. Would I notice if the data was wrong? — downstream reconciliation, row count deltas, value range checks

If you can’t answer yes to all three, your pipeline isn’t done. It just looks done.


If you’re reading this thinking “I should check my surrogate key columns” — yeah, you probably should. Run a SELECT MIN(sk_id) on your dimension tables. If anything comes back negative, we should talk.