Your pipeline is green. No errors, no warnings, no angry Slack messages. The daily job runs, the MERGE statement completes, row counts look stable.
Except that right now, somewhere in your dimension table, real data is being silently overwritten by rows that have nothing to do with it.
This is the story of how an integer overflow on a surrogate key turned a routine MERGE into a data destruction machine — and how I found out way too late.
The setup: nothing fancy, nothing suspicious
The pipeline was straightforward. A dimension table in Databricks, fed daily by new incoming transactions. Each new row needed a surrogate key, generated with a classic pattern:
SELECT
(SELECT COALESCE(MAX(sk_transaction_id), 0) FROM dim_transaction)
+ ROW_NUMBER() OVER (ORDER BY transaction_date) AS sk_transaction_id,
transaction_ref,
account_id,
transaction_date,
amount,
currency
FROM staging_transactions
WHERE transaction_ref NOT IN (SELECT transaction_ref FROM dim_transaction)
Then the MERGE:
MERGE INTO dim_transaction AS target
USING new_transactions AS source
ON target.sk_transaction_id = source.sk_transaction_id
WHEN MATCHED THEN UPDATE SET
target.transaction_ref = source.transaction_ref,
target.account_id = source.account_id,
target.transaction_date = source.transaction_date,
target.amount = source.amount,
target.currency = source.currency
WHEN NOT MATCHED THEN INSERT *
Nothing exotic. You’ve probably written something like this. I certainly had — dozens of times.
The day the numbers went negative
The sk_transaction_id column was defined as INT. In Spark SQL, that’s a 32-bit signed integer. Max value: 2,147,483,647.
When MAX(sk_transaction_id) + ROW_NUMBER() crossed that threshold, Spark didn’t raise an error. It didn’t log a warning. It wrapped around to negative values.
New surrogate keys started looking like this:
| sk_transaction_id | transaction_ref | amount |
|---|---|---|
| -2147483642 | TXN-9930012 | 1,250.00 |
| -2147483641 | TXN-9930013 | 870.50 |
| -2147483640 | TXN-9930014 | 3,100.00 |
Negative primary keys. No error. Pipeline green.
Why it got worse: the MERGE that overwrites
Here’s the real damage. Those negative IDs weren’t just weird — they already existed in the target table.
The MERGE statement matched on sk_transaction_id. It found existing rows with those IDs. So instead of inserting new records, it updated existing ones.
Legitimate transaction records were silently overwritten with completely unrelated data. The row count stayed the same. The job reported success. Nobody noticed.
The pipeline didn’t lose data. It replaced data. That’s worse — there’s no obvious signal that anything happened.
Reproducing the bug
Don’t take my word for it. Open a Spark SQL notebook and try this.
1. Create a dimension table near the INT limit
CREATE OR REPLACE TEMP VIEW dim_transaction AS
SELECT * FROM VALUES
(-2147483648, 'TXN-OLD-001', 'ACC-001', '2025-01-15', 500.00, 'EUR'),
(-2147483647, 'TXN-OLD-002', 'ACC-002', '2025-02-20', 750.00, 'EUR'),
( 2147483643, 'TXN-RECENT-1','ACC-099', '2025-11-15', 200.00, 'EUR'),
( 2147483644, 'TXN-RECENT-2','ACC-100', '2025-12-01', 300.00, 'EUR'),
( 2147483645, 'TXN-LATEST', 'ACC-101', '2025-12-15', 450.00, 'EUR')
AS t(sk_transaction_id, transaction_ref, account_id, transaction_date, amount, currency);
The table has a few rows near the INT ceiling — and some old rows down at -2147483648. That detail matters.
2. Generate new keys with the overflow pattern
SELECT
CAST(
(SELECT MAX(sk_transaction_id) FROM dim_transaction)
+ ROW_NUMBER() OVER (ORDER BY transaction_date)
AS INT) AS sk_transaction_id,
transaction_ref,
account_id,
transaction_date,
amount,
currency
FROM VALUES
('TXN-NEW-001', 'ACC-200', '2026-01-05', 1250.00, 'EUR'),
('TXN-NEW-002', 'ACC-201', '2026-01-06', 870.50, 'EUR'),
('TXN-NEW-003', 'ACC-202', '2026-01-07', 3100.00, 'EUR')
AS t(transaction_ref, account_id, transaction_date, amount, currency);
3. Check the result
sk_transaction_id | transaction_ref
------------------+----------------
2147483646 | TXN-NEW-001
2147483647 | TXN-NEW-002
-2147483648 | TXN-NEW-003
Two rows look fine. The third one wrapped to -2147483648 — which is the exact minimum value of a 32-bit signed integer. And it happens to match an existing row in the dimension table.
That’s the insidious part: it’s not all rows that overflow. It’s one row in a batch, the one that crosses the threshold. In a real pipeline processing thousands of records, you’d never spot it by eyeballing the output.
Note on ANSI mode: By default, Spark wraps silently on integer overflow. If
spark.sql.ansi.enabled = true, it throws anArithmeticExceptioninstead. Check your Databricks cluster settings — most environments I’ve worked with have ANSI mode off.
Why nobody noticed
This is the part that stings the most. There were zero signals:
No exception. Spark treats integer overflow as a valid arithmetic operation. It wraps. It doesn’t care.
Stable row counts. The MERGE was updating instead of inserting, so the table size barely changed. No spike, no drop.
No schema violation. The column is INT, the value is INT. Everything looks correct to Spark’s type system.
No business alerts. Unless someone queried a specific transaction and noticed the data didn’t match, there was no reason to suspect anything.
The bug was discovered when a downstream report showed financial figures that didn’t reconcile. The investigation traced it back to dimension records that had been overwritten — some of them weeks earlier.
The guardrails that would have caught it
Every single one of these checks is simple. None of them existed in the original pipeline.
1. Assert no negative surrogate keys
The simplest possible check. If your keys should always be positive, verify it.
SELECT COUNT(*) AS negative_key_count
FROM new_transactions
WHERE sk_transaction_id < 0
If this returns anything other than zero, stop the pipeline.
2. Check key monotonicity
New keys should always be greater than the current max. If they’re not, something broke.
SELECT
(SELECT MAX(sk_transaction_id) FROM dim_transaction) AS current_max,
MIN(sk_transaction_id) AS new_min
FROM new_transactions
HAVING new_min <= current_max
This query should return no rows. If it does, your key generation is compromised.
3. Validate MERGE outcomes
After the MERGE, check what actually happened. In Databricks, Delta Lake tracks operation metrics:
DESCRIBE HISTORY dim_transaction LIMIT 1
Inspect operationMetrics. If numTargetRowsUpdated is greater than zero when you only expected inserts — you have a problem.
4. Use BIGINT
BIGINT gives you a 64-bit signed integer: max value 9,223,372,036,854,775,807. That’s 9.2 quintillion.
CREATE TABLE dim_transaction (
sk_transaction_id BIGINT,
...
)
The storage cost difference between INT and BIGINT is 4 bytes per row. On a table with 10 million rows, that’s 40 MB. Your data integrity is worth more than 40 MB.
5. Reconsider MAX(id) + ROW_NUMBER()
This pattern is fragile. It relies on the current max being correct, on the type being large enough, and on no concurrent writes happening. Any of these assumptions can break.
Alternatives worth considering:
| Approach | Pros | Cons |
|---|---|---|
monotonically_increasing_id() | Guaranteed unique per job | Not sequential, can look messy |
| UUID / hash-based keys | No collision risk | Larger, not sortable |
Delta GENERATED ALWAYS AS IDENTITY | Clean, managed by Delta | Databricks-specific |
Each has trade-offs, but all of them are more robust than manually computing the next integer.
The lesson Spark won’t teach you
Spark is a compute engine. It processes your data exactly as you tell it to. It won’t second-guess your types, flag suspicious values, or refuse to overwrite good data with bad.
That’s your job.
Every pipeline should answer three questions before writing to a target table:
- Are the keys valid? — positive, unique, monotonically increasing
- Does the MERGE do what I expect? — inserts when expected, updates only when intended
- Would I notice if the data was wrong? — downstream reconciliation, row count deltas, value range checks
If you can’t answer yes to all three, your pipeline isn’t done. It just looks done.
If you’re reading this thinking “I should check my surrogate key columns” — yeah, you probably should. Run a SELECT MIN(sk_id) on your dimension tables. If anything comes back negative, we should talk.