January 28, 2025
What Production Bugs Taught Me About Writing Reliable Code
Nobody tells you this in college: the real curriculum begins the first time your code breaks in production at 2 AM.
I've been building software professionally for five years now — backend systems, APIs, financial data pipelines, the occasional frontend nightmare. In that time, I've written code I'm proud of, and I've written code that quietly destroyed data, silently dropped requests, and once — spectacularly — caused the same payment to be processed twice for over a hundred users. Production is a different world from local development. Your assumptions evaporate. Your clever abstractions betray you. And the bugs you encounter aren't the clean, reproducible kind from textbooks. They're intermittent, context-dependent, and almost always embarrassing in retrospect.
This post isn't a list of tips you can skim in three minutes. It's a reflection on real failures — what went wrong, why, and how those moments fundamentally changed the way I write code.
The Bugs That Changed How I Think
Bug #1: The Race Condition That Double-Charged Customers
The system was a mid-sized e-commerce platform. We had a checkout flow that, on payment confirmation, would mark the order as PAID, generate an invoice, and decrement inventory.
One afternoon, support started getting tickets. Customers were seeing two charges on their bank statements. The pattern was terrifying: always fast double-clicks, always on mobile, always under high traffic. The payment confirmation endpoint wasn't idempotent. When a user clicked "Pay" twice in quick succession, two concurrent requests would both check if order.status != 'PAID' at virtually the same millisecond, both see PENDING, and both proceed to charge the card.
The fix was a database-level row lock combined with an idempotency key tied to the session:
with transaction.atomic():
order = Order.objects.select_for_update().get(id=order_id)
if order.status == 'PAID':
return Response({"detail": "Already processed"}, status=200)
process_payment(order)
order.status = 'PAID'
order.save()
The lesson: Never trust application-level checks for state transitions under concurrent load. The database is your last line of defense — use locks, use transactions, and design every mutation endpoint to be idempotent from day one. Not as an afterthought.
Bug #2: The Edge Case That Silently Corrupted a Feed
This happened on a content aggregation service. We had an API endpoint returning a paginated feed using a cursor parameter — a base64-encoded timestamp — to support infinite scroll. Everything worked perfectly in staging. In production, two weeks after launch, users started reporting their feed would loop — showing the same posts repeatedly.
The root cause: we never validated what happened when two posts had the exact same timestamp. A batch job bulk-inserted posts with created_at = datetime.utcnow() — meaning hundreds of posts could share the same second. When the cursor landed on one of those timestamps, the query WHERE created_at < cursor skipped entire clusters of content.
The lesson: Edge cases aren't edge cases in production — they're just lower-probability scenarios waiting for enough traffic. If your system ingests data in bulk or parallel, collisions are inevitable. Design your data model to handle them from the start.
Bug #3: The Monitoring Gap That Cost Us Four Hours
We were running a background job that processed webhook events from a third-party payment provider, consuming from a Redis queue. One Friday evening, the queue started backing up. By the time anyone noticed, there were 14,000 unprocessed events — roughly four hours of missed updates.
The cause was mundane: a minor schema change in the third-party payload caused our validation to raise an unhandled exception. The worker caught it, logged "Validation error" at WARNING level, and moved on — silently dropping the event with no retry, no alert, no dead-letter queue. The log entry existed. We just had no alert watching for it.
The lesson: A log line that nobody reads is the same as no log line. Every error path in a background job needs to either retry with backoff, push to a dead-letter queue, or trigger an alert — preferably all three. And third-party integrations will silently change their contracts; your consumers need to scream when that happens.
Bug #4: The Performance Cliff That Didn't Exist in Staging
We had an analytics dashboard that queried aggregated user activity. In staging with 5,000 records, the query ran in under 100ms. In production with 2.3 million records, it ran in 18 seconds on peak traffic days — and sometimes timed out entirely.
The query joined three tables, applied a date range filter, and sorted by a derived CASE expression — which isn't indexable, so Postgres did a full sequential scan every time. We found it by finally setting up slow query logging and running EXPLAIN ANALYZE on the offending query. The fix involved pre-computing the derived column into a materialized view, refreshed every 15 minutes.
The lesson: Staging is a lie. It has clean data, low cardinality, and zero concurrent users. If you're not load testing with production-scale data — not just production-scale traffic — you're flying blind.
Patterns I Noticed Across Failures
Assumptions that break silently. Almost every bug above traces back to an assumption that felt reasonable: timestamps are unique, third-party schemas don't change, staging resembles production. Good code makes its assumptions explicit — through assertions, input validation, or at minimum, a comment.
Observability as an afterthought. We consistently under-invested in logging and monitoring during feature development, then scrambled during incidents. Observability has to be part of the definition of done, not a follow-up ticket.
Error handling that handles nothing. Catching errors and logging them without acting on them creates systems that appear healthy while quietly rotting. Errors should propagate, retry, alert, or all of the above. Silence is never the right response to failure.
Ignoring the boundaries of your own abstractions. The feed cursor bug happened because our pagination abstraction assumed uniqueness it couldn't guarantee. Every abstraction has a contract; violating that contract under unexpected input is a bug waiting to be scheduled.
How My Coding Style Changed
Early in my career, my internal benchmark for code was simple: does it work? If the tests passed and the happy path ran cleanly, I shipped it. My benchmark today is closer to: does it fail safely?
Defensive programming stopped feeling paranoid. Data that enters your system from external sources — user input, third-party APIs, message queues — should be validated at the boundary, not trusted throughout. The deeper it gets into your system before you validate it, the more damage it can do.
Logs became part of the feature, not decoration. Now I write logs the way I write tests: before I ship a feature, I ask myself "if this breaks at 3 AM and I'm half-asleep, will I be able to diagnose it from the logs alone?"
# Before: technically a log
logger.info("Payment processed")
# After: actually useful
logger.info(
"Payment processed",
extra={
"order_id": order.id,
"user_id": order.user_id,
"amount": order.total,
"provider": payment_provider,
"duration_ms": duration
}
)
Tests started covering failure modes, not just success paths. A significant portion of my tests are now adversarial: what happens if the database connection drops mid-transaction? What if the third-party returns a 200 with a malformed body? What if two requests arrive simultaneously?
Concurrency stopped being an advanced topic. Once you've debugged a race condition in production, you start seeing potential races everywhere. Every time I write a read-modify-write pattern now, I ask: what happens if two instances of this run simultaneously?
Practical Practices I Follow Now
Structured logging from the start. Every log entry includes a correlation ID, the relevant entity IDs, and timing information. Free-text logs are searchable; structured logs are analyzable.
Dead-letter queues for every consumer. Any background job consuming from a queue needs a dead-letter destination for failed messages. Without it, failures disappear and you discover them through user complaints, not monitoring.
Idempotency keys on all mutation endpoints. Any endpoint that creates or modifies state should accept an idempotency key. Store it, check it, and return the same response if the same key appears twice.
Explicit timeouts on every external call. Database queries, HTTP requests to third-party services, Redis commands — everything external gets a timeout. Requests that hang indefinitely will block your thread pool and cascade into a full outage.
import httpx
# Always explicit, never relying on defaults
response = httpx.get(url, timeout=httpx.Timeout(connect=2.0, read=10.0))
Monitoring alerts on error rates, not just uptime. A service can be "up" while silently failing 5% of requests. I set alerts on 5xx rates, queue depth, and job failure counts — anything that tells me the system is misbehaving before a user does.
Code reviews with a failure mindset. When I review code now, I'm not primarily asking "is this logic correct?" I'm asking: "what are the assumptions here, and which ones can break?" I look for unhandled error paths, missing timeouts, state mutations without locking, and log statements that would be useless during an incident.
Closing Thoughts
Production bugs are inevitable. That's not a cynical statement — it's the only realistic starting point for designing reliable systems. If you build software assuming it will never break, you'll design systems that break catastrophically. If you build it assuming failure is a matter of when, not if, you'll design systems that degrade gracefully, alert loudly, and recover quickly.
The engineers I respect most aren't the ones who write bug-free code — I'm not sure that person exists. They're the ones whose bugs are easy to diagnose, whose systems fail loudly rather than silently, and whose post-mortems focus on systemic fixes rather than blame.
Five years in, the most honest thing I can tell you is this: every production incident I've ever been involved in has made me a better engineer. Not because suffering is virtuous, but because real systems under real conditions will reveal every assumption you didn't know you were making. Build systems that fail gracefully. Then make sure they tell you when they do.