Engineering Debt
|29 min read

Why Every Internal Billing Engine Breaks at $10M ARR (And What the Postmortem Always Says)

Every homegrown billing system hits the same wall at $10M ARR. The postmortem always identifies the same failure modes. Here is what they are and why they are unavoidable.

It's 2:47 AM on a Tuesday. Your CTO is not sleeping. A Slack message arrived at 11:34 PM: "The billing job didn't run. 3,247 customers didn't get invoiced. We don't know why."

By the time you got to your desk, the postmortem document was already half-written. By someone else. The billing person. Who, by the way, gave notice yesterday.

You scroll through the Jira backlog. BILLING-4821. "Fix the cron job so it doesn't hang when we have >50K events per minute." Created 18 months ago. Status: BLOCKED. Blocked on what? A spreadsheet. Specifically, a Google Sheet where the Finance team manually tracks which customers should be invoiced this week. Because it's "faster than updating the database."

Three lines above that, BILLING-4189: "Reconcile 47 invoices from March that double-charged because the system ran twice somehow." The comment thread is 63 messages long. Your VP of Finance contributed 11 of them. All the same question: "How did this happen?" None of them answer it.

This is the moment. The moment every CTO reaches when their internal billing system, born from good intentions and shipping pressure, has become a load-bearing wall.

You cannot renovate it without risking the whole structure.

And the worst part? You knew this was coming. You've read the blog posts. Every single one says "billing is hard" and "consider buying" and you thought, not us. You had an engineer. You had a sprint. You had a billing problem. So you solved it. That engineer is now on sabbatical.


The Frankenstein Billing Stack

Let's name what your system actually is, because every CTO I've talked to has built the exact same thing, using different technologies:

A Frankenstein Billing Stack. A creature stitched together from disparate parts — each one functional in isolation, but the moment you try to move the whole body, something breaks.

It started simple. A single invoices table. A job that ran every day and... worked. Mostly. Then you added usage-based pricing, so you needed a usage_events table. Still tractable. Then you acquired a customer with a custom contract, so you added a discounts table with a reason column that accepts free-form text because nobody knew what the reasons would be. Then you added plan changes mid-cycle, so you needed a subscription_changes table. Then you needed to handle pro-rata credits, so you hardcoded the math in a stored procedure that no one fully understands anymore.

Now, a single change—say, adding a new discount type—requires coordinating changes across:

  • The database schema (because the stored procedure needs a new column)
  • The background job that calculates prorations (because it reads from the schema)
  • The API endpoint that creates discounts (because it needs validation logic)
  • Three separate places in the codebase where the discount calculation happens (because they all have slightly different interpretations of the spec)
  • The Google Sheet the Finance team uses to track exceptions (because it's the actual source of truth, not the database)
  • The reconciliation report the accountant runs in Excel (because it has its own formulas, hardcoded customer IDs)

Change one thing. Break three others. That's the Frankenstein Stack.

But here's the thing: it's not a failure of engineering. It's a failure of initial architecture. You didn't choose poorly. You chose what every rational team chooses at $0–$2M ARR: the fastest path to shipping. And for a while, the fastest path is the right path.

Until it's not.


The Three Phases of Internal Billing (A Story You've Lived)

Phase 1 — The "It's Just Math" Phase ($0–$2M ARR)

You're shipping the MVP. The billing problem is simple: customers pay a fixed monthly fee. You build a table. You build a job. It runs every night. Success feels inevitable.

The job is 150 lines of code. It loops through active subscriptions, creates an invoice, sends an email. The Finance team loves you because invoices are now automated. Your CFO can see projected MRR. Your CEO stops asking "did we invoice customer X this month?"

This phase lasts 18 months to 2 years, depending on how fast you grow. And for most of that time, it's the right call. You have 500 customers. The job finishes in 14 seconds. The billing person is whoever's on rotation that week.

The technical debt is invisible. There's no debt yet, because there's no complexity. Complexity is what creates debt.

Phase 2 — The "We Need a Billing Team" Phase ($2M–$10M ARR)

Around $2M ARR, the first crack appears.

A customer wants to be invoiced weekly instead of monthly. You add a billing_frequency column. The job now has an if-statement. Still manageable. But your Finance team now needs to track this in a spreadsheet because the database query that shows "what gets invoiced this week" is too slow—it's doing a full table scan of 50K subscriptions and 2M usage events.

Then a customer wants usage-based pricing. Oh. Now the math is no longer just rate * quantity. Now it's max(min_spend, (usage * unit_rate) - included_units). And there are tiers, so actually it's more like sum of (tier_i.start_units through tier_i.end_units * tier_i.price). And the customer wants the tier reset daily, not monthly. But a different customer wants it reset quarterly.

Your job is now 800 lines. It has seven branches. It talks to a usage events table that grows by 5M rows per month.

Someone writes a bug. An if-statement returns true when it should return false. 340 customers get double-billed. This takes 18 hours to notice. Another 6 hours to rollback. Another 8 hours to issue credits. That's one person, 32 hours, and the bug was one character: >= instead of >.

You hire a billing engineer. Day one, they say, "This is unsalvageable. The state is implicit. We don't know what phase of the billing cycle any customer is in just by looking at the database. We have to recompute it from the job history." Day three, they ask to rewrite it. Management says "not now, we're focusing on product." Day 47, they say "okay I'm leaving."

By $10M ARR, you have:

  • A database schema held together by comments explaining what each column means
  • A background job that takes 45 minutes to run, and when it fails, nobody knows how to debug it
  • A reconciliation process that involves three people, two spreadsheets, and one SQL query that has the wrong join condition and has never been fixed because fixing it might break something else
  • A Finance team that doesn't trust the system, so they manually validate invoices in a spreadsheet
  • An engineering team that doesn't want to touch billing, because every change spawns bugs in unexpected places
  • A CTO who goes to sleep thinking about billing and wakes up thinking about billing

This is the phase where the load-bearing wall metaphor stops being cute and becomes terrifying.

Phase 3 — The "Nobody Touch the Billing Code" Phase ($10M+ ARR)

By $10M ARR, your billing system owns the company.

Changes move at glacial speed. A simple feature request—"let customers pause their subscription"—becomes a three-week investigation. Why? Because pausing a subscription requires answering: do we charge for the days already elapsed this month? Do we issue a prorated credit? If they resume in 30 days, do we bill from the resume date or from the original cycle date? If we bill from the resume date, do we adjust the annual contract value? If we do, do we issue a new contract amendment?

These questions don't have answers in the codebase. They have answers in tribal knowledge. They're carried around in the head of the one engineer who built the first version and has written 200 additional patches in the last 4 years.

If that engineer leaves, the company is in trouble. Not exaggerating. If that engineer leaves, you don't have a billing system anymore. You have a black box that you feed money into and hope money comes out.

Worse: your product is now constrained by billing. You want to ship a feature—multi-seat pricing, for example. It's a good feature, customers want it, but it would require changing how you calculate invoices. So you don't ship it. Instead, you ship a workaround. You tell Sales, "we can do per-seat pricing, you just have to manually create multiple subscriptions and send us a spreadsheet and we'll reconcile it at month-end." Sales hates you. Customers accept it because they want the feature badly enough. But they hate the workaround.

Your product roadmap is now being shaped by billing architecture, not by customer needs.


💡 CTO Reality Check: If your billing system fails—completely fails, data corruption, loses invoice records—how many hours would it take to rebuild it from scratch? If the answer is more than 8 hours, or if the answer is "I don't know," your system has hit Phase 3. And if you've hit Phase 3, you're no longer building billing; you're maintaining it. The opportunity cost of maintaining it is probably more expensive than buying a real system.


The 5 Failure Patterns (And Why They're Inevitable, Not Accidental)

These aren't bugs. These are patterns. Architectural patterns that emerge when you try to force a monolithic billing system to scale. They're inevitable because each one is a natural response to the previous failure.

Pattern 1 — Schema Rigidity

Your invoices table has columns:

  • id
  • customer_id
  • amount
  • status (DRAFT, SENT, PAID, OVERDUE, WRITTEN_OFF)
  • due_date
  • issued_at

What it doesn't have:

  • invoice_type (standard, credit note, refund receipt)
  • currency (what if you expand to EU and need EUR?)
  • tax_amount (you're currently calculating tax in code; accountants want it on the invoice)
  • payment_terms (NET30, NET60, etc.)
  • po_number (enterprise customers need to reference their PO on the invoice)

Every one of these fields was a feature request. Every one was added as a column or a side table. Each addition multiplied the complexity of the job that generates invoices because now the job has to handle eight different invoice shapes, and the shapes have conflicting requirements.

Now you want to pivot to a new business model. You're adding subscription stacking—customers can buy multiple subscriptions at once. Your invoices table now needs to track which subscriptions generated which line items. You could add a subscription_id column, but that doesn't work because some invoices have multiple subscriptions. You could create a junction table, but that requires changing the reconciliation logic everywhere.

So you add a lineitem_details JSONB column and stuff all the line item data in there. Now you have two sources of truth: the structure of the invoices table and the structure of the JSONB. And they're out of sync. Frequently.

Why it's inevitable: You're trying to store a tree structure (invoice → line items → usage events) in a relational schema. Relational schemas are rigid. Trees are flexible. You pick the wrong tool and then you pay the price.

Pattern 2 — The Implicit State Machine

Your subscriptions go through phases:

  • Active (user is paying, we're billing them)
  • Paused (user asked to pause, we're not billing them, we'll resume later)
  • Pending Cancellation (user asked to cancel at end-of-cycle, we'll bill this month, then stop next month)
  • Cancelled (user is gone, we're not billing them)
  • Suspended (user is paying, but their invoice is overdue, we're stopping them until they pay)

But these states aren't explicit in the database. There's no state column. Instead, the states are implicit in combinations of other columns:

  • If cancellation_requested_at is not null and cycle_end_date is in the future, the subscription is "Pending Cancellation"
  • If cancellation_requested_at is not null and cycle_end_date is in the past, the subscription is "Cancelled"
  • If suspension_reason is not null, the subscription is "Suspended"
  • If pause_requested_at is not null and pause_until_date is in the future, the subscription is "Paused"

Now you want to add a feature: let users downgrade at the end of their cycle instead of canceling. This is a new state: "Pending Downgrade." So you add a downgrade_requested_at column and a new_plan_id column. Now the logic for determining state is even more complex.

Two months later, a customer wants to pause, then downgrade, then resume. That's three different states in sequence. How do you represent that? With more columns? With a state transition log? With a state machine that's documented somewhere no one can find?

Your job has become: "figure out what state this subscription is in, then do the right thing." And "figure out what state" is no longer trivial. It involves reading five different columns, checking dates, checking flags, and hoping you didn't miss a combination.

One day, you'll find a subscription that's in an impossible state. cancellation_requested_at is set, but so is pause_requested_at, and the timestamps indicate they were both set within the same second by different services running in parallel. What state is it in? The code doesn't know. So it breaks.

Why it's inevitable: You're trying to represent a state machine using columns instead of states. Columns are easy to add. States require design. You choose the easy way, and the system becomes harder to understand with every feature.

Pattern 3 — No Rate Plan Versioning

You have a rate plan. Customer A bought it at $99/month. Six months later, you decide to raise the price to $129/month for all new customers, but you grandfather existing customers. Good business logic.

But your rate_plans table doesn't have versions. There's just one row per rate plan. So when you update the price, you're updating the rate for everyone, including customers on the old rate.

So you create a new rate plan: "Original Plan - Legacy" and manually migrate old customers to it. Now you have duplicate rate plans in the system. This breaks reporting. Do they show up as one plan with 60% lower revenue, or as two plans with different revenue profiles? Depends on which query you use.

But wait. Customer A was on the old rate plan, but they upgraded their seat count mid-month. How much did they pay that month? They should pay pro-rata: some days at the old rate, some days at the new rate. But your system doesn't support prorated upgrades across plan versions. So you manually calculate the credit in a spreadsheet and ask Finance to issue it.

Now Customer B wants to upgrade, but they're coming from a legacy plan. How much is the upgrade fee? Is it based on the legacy rate or the new rate? The product team doesn't have an answer. So you tell them, "we don't support that yet." They go to a competitor.

Why it's inevitable: Versioning is hard. When you're at $0–$2M ARR, you assume prices are static. By the time you realize they're not, your system is already built around that assumption. Retrofitting versioning is expensive, so you use workarounds instead. Workarounds compound.

Pattern 4 — Concurrency Blind Spots

Your billing job runs every night. It looks at all active subscriptions and creates invoices. This takes 40 minutes.

Meanwhile, a customer is upgrading their subscription mid-cycle. The upgrade request comes into your API. The API service checks the subscription details, calculates a pro-rata credit, and updates the subscription. All in 200ms.

What if the API updates the subscription at 2:13 AM, and the billing job is running at the same time? Does the job see the old subscription details or the new ones? Does the job create an invoice for the old plan or the new plan?

You don't have locking. You don't have transactions that span both services (the subscription service and the invoicing service). You have a hope and a prayer.

So you add a flag: do_not_invoice_this_month. The API sets the flag when it updates the subscription. The job checks the flag before creating an invoice. But the flag is checked in one part of the code and cleared in another, and there's a 300-millisecond window between checking and clearing where a subscription could be in limbo. It's never happened. Yet.

But you know it will. So you add a second flag: invoice_locked. Now the job checks whether the subscription is locked before creating an invoice. If it's locked, the job retries five minutes later. But the retry logic is separate from the main loop, so there's a chance a subscription is locked forever.

You're not building billing anymore. You're playing 4D chess with concurrency, using flags as pieces, and you're losing.

Why it's inevitable: You built the system when you had one job running at one time per day. Now you have N APIs updating subscriptions in real-time. You added concurrency without adding the locks, transactions, or queues that prevent race conditions. The system "works" because race conditions are rare. That's not the same as working correctly.

Pattern 5 — The Dunning Deferral

A customer's payment fails. Their invoice is overdue. Now what?

You send them a reminder email. Manually. Someone in Finance reads the overdue report and decides which customers to email based on "how much do we like this customer." If it's a big customer, they call instead. If it's a small customer, they add them to a shared Google Sheet with subject line "Follow up on overdue."

Two weeks later, the customer pays. You need to update their subscription status from SUSPENDED back to ACTIVE. Is that automatic, or does someone need to approve it? If it's automatic, what if the customer pays twice—once via credit card, once via wire transfer, and the systems don't talk to each other? Then the customer is overpaid, and you need to issue a credit note. But credit notes are... oh right, they're not in the system because that's a "future feature."

So you have a dunning flow that's half-automated and half-manual. Somewhere between the automation and the manual work, there's a gap. A customer slips through the cracks and you don't notice they stopped paying until you do your quarterly financial close and realize you've been counting them as active customers for three months when they're not.

Why it's inevitable: Dunning is hard. It's not just technical; it's business logic. When should you retry a payment? After 1 day? 3 days? 7 days? Different customers have different terms. Different companies have different policies. You can't hardcode dunning logic; you need to configure it. Configuration is complex. So you defer it. You ship "send an email when payment fails" and call that "dunning." Then six months later you realize you're leaving 2% of revenue on the table because half of those retry payments aren't happening.


The Postmortem Template (Spoiler: It Always Reads the Same)

Every billing system reaches a moment. A catastrophic failure. Not a little bug. A big one. Here's what the postmortem looks like:

Timeline:

  • 11:32 PM: Alerts fire. Billing job failed to start.
  • 11:47 PM: On-call engineer confirms the job hung at the "update subscription status" step.
  • 12:15 AM: Realized the job is stuck in a database lock waiting for a transaction that finished four hours ago and never released.
  • 12:58 AM: Hard-killed the job. Realized we don't know which subscriptions were already processed and which weren't.
  • 1:34 AM: Started a manual query to find the "highest subscription ID we processed before the hang." Took 8 minutes because the query was slow and we needed to add a hint.
  • 2:42 AM: Restarted the job. It ran again, creating duplicate invoices for the subscriptions we'd already processed.
  • 3:15 AM: Realized the duplicate invoices and killed the job again.
  • 4:20 AM: Manually deleted the duplicate invoices from the database.
  • 5:03 AM: Restarted the job. This time monitoring it every 30 seconds.
  • 7:47 AM: Job finished. Probably. We're going to double-check this by running a reconciliation query.

Root Cause: "We don't have idempotency keys. The job doesn't know which subscriptions it's already processed. When the job fails, restarting it is a coin flip."

Timeline:

  • 2013 — Job was written assuming it would always complete in under 30 minutes and would never fail.
  • 2018 — We hit 10K subscriptions. Job started taking 35 minutes.
  • 2021 — We hit 100K subscriptions. Job started failing randomly due to database locks.
  • 2024 — We're at 300K subscriptions. Job fails every month. We restart it and hope. Last night, we got unlucky.

Resolution: "We'll add idempotency keys and a processed subscription log. This will require refactoring the job, writing tests, and deploying carefully."

Estimated effort: 6 weeks.

Actual effort: Will be 12 weeks, but they said 6 weeks and you've already committed it.

Root Cause: Organic Complexity Growth

Here's the part everyone glosses over in the postmortem: this didn't happen because someone built it wrong. It happened because complexity grew faster than the system could accommodate.

Your system was designed for 10K subscriptions. It works fine at 10K. At 100K, it's slower, but it still works. At 300K, it's fragile. At 1M, it's broke.

Every system has a ceiling. You hit yours.

The ceiling wasn't unknown. It was implicit in the architecture. A monolithic job processing customers sequentially will eventually take longer than the time between job runs. When that happens, jobs start failing. It's math. It's not a bug; it's thermodynamics.

So why didn't you redesign before hitting the ceiling?

Because redesigns are expensive. You spent two years getting to 300K subscriptions. Redesigning the system takes three months, and it's risky, and there's always something more urgent. So you patch it. You add a flag. You add a check. You add a comment that says "TODO REFACTOR THIS."

And every patch makes the next patch harder.

The Real Cost: Opportunity, Not Engineering

Here's what the postmortem doesn't say:

In the time spent restarting the billing job, investigating why it failed, manually fixing the duplicate invoices, and deploying the fix, your engineers could have shipped three new product features.

That's not a sunk cost. That's an ongoing cost. For every month this system is in production, it's costing you:

  • Engineering capacity: One engineer, roughly 20% of their time, babysitting the billing system instead of building product.
  • Risk management: Every other feature ships with billing risk attached. Want to change payment terms? Better test it against the billing system. Want to add a new currency? Better think about how it affects invoicing.
  • Velocity: Features move slowly because the team is afraid of breaking billing.
  • Hiring: You can't hire a junior engineer to own billing because the system requires institutional knowledge. You can't hire an engineer who "doesn't want to deal with billing complexity" because that's now 20% of the role. You lose candidates.
  • Business agility: You can't pivot quickly. You can't experiment with pricing models because experiments require billing changes, and billing changes take three weeks of design and testing.

A production-grade billing system costs money to buy. An internal billing system that hits its ceiling costs you opportunity.

At $10M ARR, that opportunity cost is probably $500K–$2M per year in foregone product development.


The Decision Framework — When Build Becomes Buy

You've reached the moment. The moment where you admit: we need to change course.

The question is not "should we buy billing?" The question is "how much longer can we afford to maintain it?"

5 Signals That Your Internal Engine Has Hit Its Ceiling

Signal 1: The Billing Engineer is Typing in All Caps

When the person responsible for billing starts writing GitHub issues in all caps, that's a sign. They're not angry at you; they're angry at the system. They've realized it's unsalvageable. Run, don't walk, to executive leadership and tell them you need a plan. This engineer is six weeks away from leaving, and if they leave, you're in trouble.

Signal 2: Changes to Billing Take Longer Than Three Weeks

If a simple billing feature—"pause a subscription," "change the billing date," "support a new currency"—takes more than three weeks from specification to deployment, you've hit the ceiling. The fact that it takes three weeks is not because the feature is complex. It's because the system is fragile. You're spending two weeks testing to make sure you don't break something else.

Signal 3: Billing Failures Wake Up the On-Call Engineer at 2 AM, and Nobody is Surprised

If the on-call engineer's Slack status is "bracing for impact during billing job" on invoice day, you've hit the ceiling. This is not acceptable. Your system should not be so fragile that invoicing day is an anxiety event.

Signal 4: The "Billing Person" Exists as a Unique Individual

If there's one person who understands how the billing system works, and replacing them would take six weeks, you've hit the ceiling. A system that depends on tribal knowledge is a system you're renting from that person, even though you're paying their salary. That person deserves to work on interesting problems, not babysit a fragile system.

Signal 5: Your Pricing Model Has Outgrown Your Schema

If every new pricing feature requires a database migration, and migrations require downtime, and downtime is bad, so you're avoiding it, and feature are piling up in the backlog—you've hit the ceiling. Your schema was designed for one pricing model. The market has moved on. Your system is playing catch-up.

The Migration Path: Incremental, Not Big-Bang

Here's where most CTOs get it wrong: they assume migrating to a new billing system is a flag flip. You turn off the old system, flip to the new system, hope it works.

That's how you lose data.

The right path is incremental. You run both systems in parallel, reconcile them, gradually shift traffic from the old to the new.

Phase 1 — Pilot (Weeks 1–4):

  • Pick your top 20 customers. Ask them if they'll help you test a new billing system. Offer them a 10% discount for their trouble.
  • Set up your new system (buy it, integrate it, deploy it).
  • Sync their subscriptions from the old system to the new system.
  • Run the new system in shadow mode: it generates invoices, but you don't send them. You compare them to the old system's invoices. Do they match?
  • If they don't match, you've found a gap. Fix it in the new system. Rerun the comparison.

Phase 2 — Soft Launch (Weeks 5–12):

  • Pick 100 more customers. New customer signups get the new system automatically.
  • For existing customers, you decide which ones migrate: customers with "simple" subscriptions migrate first, customers with "complex" subscriptions migrate later.
  • For each customer cohort, run the old system and new system in parallel for one billing cycle. Reconcile the invoices.
  • Once you're confident, start sending invoices from the new system instead of the old system.

Phase 3 — Migration (Weeks 13–26):

  • Migrate the remaining customers.
  • Keep the old system running in shadow mode for another quarter, just in case.
  • Decommission the old system.

This takes 26 weeks, not 4 weeks. But at the end, you have a new billing system that works, and you haven't lost sleep wondering if you broke something.


What a Production-Grade Billing Architecture Actually Looks Like

Now let's talk about what you're buying (or should be building if you insist on building).

A production-grade billing system is not one thing. It's three things, each with its own database, its own API, its own deployment lifecycle:

Separation of Concerns

 ┌─────────────────┐    Kafka     ┌─────────────────┐    Kafka     ┌─────────────────┐
 │  CATALOG (8081)  │ ──────────→ │  PRICING (8083)  │ ──────────→ │  BILLING (8090)  │
 │                  │             │                  │             │                  │
 │  Products        │             │  Rate Plans      │             │  10-Stage        │
 │  Metrics         │             │  Offerings       │             │    Pipeline      │
 │  Features        │             │  Subscriptions   │             │  Invoices        │
 │                  │             │  API Keys        │             │  Wallets         │
 ├──────────────────┤             ├──────────────────┤             ├──────────────────┤
 │  PostgreSQL (own)│             │  PostgreSQL (own)│             │  PostgreSQL (own)│
 └─────────────────┘             └─────────────────┘             └─────────────────┘

  Each service: own DB · own API · own deploy lifecycle
Three-Service Architecture — Separation of Concerns

Catalog Service owns the what: the products, the metrics, the features.

  • Product: "API Gateway"
  • Metrics: "API Requests per Day," "Users per Day," "Data Transferred"
  • Features: "Real-time Alerts," "Custom Domains"

Pricing Service owns the how much: the rate plans, the offerings, the subscriptions.

  • Rate Plan: "Professional Plan" = $999/month for up to 1M API requests, $0.001 per additional request, $0.001 per additional user
  • Offering: "Professional Plan for AWS" = same as above but billed in USD, with a 10% discount for annual commitment
  • Subscription: Customer ABC is on the Professional Plan as of March 1, will renew March 31

Billing Service owns the settlement: the invoices, the payments, the reconciliation.

  • Invoice: Customer ABC, period March 1–31, $999 due April 15
  • Payment: ACH transfer of $999 received April 5, applied to invoice
  • Wallet: Customer ABC has a $100 credit balance from a prior refund

Each service owns its own data. Each service has one reason to change. Each service has its own transaction boundaries.

Now, when you want to change something, you change one service:

  • Want to support a new pricing model? Change the Pricing Service. The Catalog Service and Billing Service don't care; the API contract is the same.
  • Want to support a new payment method? Change the Billing Service. Nothing else changes.
  • Want to add a new metric? Change the Catalog Service, update the Pricing Service to handle the new metric in rate plans, that's it.

Each change is isolated. Each change is testable. Each change is deployable independently.

Key Architectural Properties

Property 1: Explicit State Machines

Subscriptions have nine states:

  • CREATED (just created, waiting to start)
  • TRIALING (trial period active)
  • ACTIVE (paying customer, good standing)
  • PAST_DUE (payment failed, but still within grace period)
  • PAUSED (customer asked to pause)
  • EXPIRING_SOON (subscription is about to expire, time to upsell)
  • EXPIRED (trial period ended, customer didn't convert)
  • CANCELLED (customer cancelled)
  • SUSPENDED (customer is overdue and payment is escalated)

These states are explicit. They're defined in code. They're enforced. You cannot transition from PAUSED to BILLING without going through ACTIVE. You cannot transition from CANCELLED to ACTIVE — CANCELLED is terminal. (This is how Aforo's subscription state machine works: 9 states, defined in a single Map.ofEntries(), with 37 tests covering every valid and invalid transition. Zero ambiguity.)

                    ┌──────────┐
                    │ CREATED  │
                    └────┬─────┘
                         │
                    ┌────▼─────┐
               ┌────│ TRIALING │────┐
               │    └────┬─────┘    │
               │         │          │
          ┌────▼───┐     │     ┌────▼────┐
          │EXPIRED │  ┌──▼──┐  │CANCELLED│ ← terminal
          └────────┘  │ACTIVE│  └─────────┘    (no exit)
                      └──┬──┘
              ┌──────────┼──────────┐
         ┌────▼───┐ ┌────▼────┐ ┌───▼──────┐
         │PAST_DUE│ │ PAUSED  │ │EXPIRING  │
         └───┬────┘ └────┬────┘ │  SOON    │
             │           │      └───┬──────┘
        ┌────▼─────┐     │         │
        │SUSPENDED │     └─────────┘
        └──────────┘       → back to ACTIVE

  37 tests cover every valid + invalid transition
9-State Subscription State Machine

Every transition is valid or invalid. There's no "both paused and cancelled at the same time." There's no "in between states." There are exactly nine states, and exactly N valid transitions between them.

When you want to add "Pending Downgrade," it's a new state. You define it. You define the transitions to it (from ACTIVE). You define the transitions from it (to ACTIVE, after adjustment). You test it. You deploy it. Done. No new columns. No implicit state in combinations of flags.

Property 2: Rate Plan Versioning

When you change a rate plan's price, you don't update the old one. You create a new version.

RatePlan v1: $99/month RatePlan v2: $129/month (created June 1)

All customers on v1 stay on v1 unless they explicitly upgrade. When they upgrade, they're pinned to v2. When you report on revenue, you see: "v1 had 500 customers at $99, v2 has 300 customers at $129." No confusion. No double-counting. No need for manually-maintained legacy plan rows.

Property 3: Multi-Tenant from Day One

Every query includes tenant_id. Every table has a tenant_id column and an index on it. When you make a bug in the code that forgets to filter by tenant_id, the database throws an error (constraint), not a subtle data leak.

Your row-level security isn't "trust the application to filter." It's "the database enforces it." A customer cannot see another customer's invoices because the database won't return them.

Property 4: Per-Subscription Idempotency

When you create an invoice, you include an idempotency key: a unique identifier for "invoice for subscription X for period Y."

If the invoicing job crashes mid-run and restarts, it can process the same subscription twice. When it tries to create the invoice the second time, the idempotency key is the same. The system doesn't create a duplicate; it returns the existing invoice.

No more manual deletion. No more reconciliation. The system is idempotent by design. Aforo enforces this at the pipeline level — every bill run carries a per-subscription idempotency key, and the distributed lock (Redis, SET NX EX, 30-minute TTL) guarantees exactly one process owns the invoice generation for a given tenant at any time.

Property 5: Distributed Locking

When the invoicing job runs, it acquires a lock: "I am the only process running invoices right now." The lock is held in Redis (or your datastore of choice). It expires after 30 minutes, in case the job crashes and never releases it.

While the job holds the lock, a subscription update API cannot modify certain fields—it has to queue the modification and apply it when the invoice job releases the lock.

This prevents race conditions. Not by hoping. By enforcing.

Property 6: Scheduled Dunning

When a payment fails, the system doesn't email the customer manually. It triggers an automated dunning scheduler.

The scheduler runs hourly. It finds all subscriptions with past-due invoices. For each one, it checks:

  • How many retry attempts have been made?
  • How much time has passed since the last retry?
  • Is this customer in escalation (should we suspend them)?

If it's time for a retry, it retries. If the retry succeeds, it automatically transitions the subscription from SUSPENDED back to ACTIVE.

If max retries are exceeded, it escalates—maybe suspend the account, maybe cancel the subscription. The policy is configured by the business, not hard-coded by engineers.


What This Looks Like in Practice:

When a payment fails:

  1. The payment service publishes a payment.failed event to Kafka
  2. The Billing Service receives the event and marks the invoice as OVERDUE
  3. The Dunning Scheduler sees the overdue invoice on its next run (1 minute later)
  4. It reads the dunning config for that customer (retry every 3 days, after 5 retries escalate)
  5. It retries the payment automatically
  6. If the retry succeeds, it emits a payment.succeeded event
  7. The Subscription Service receives the event and transitions the subscription from SUSPENDED back to ACTIVE
  8. The customer is never blocked; they never have to call support

If that system is running, the CTO does not wake up at 2 AM looking at a postmortem.


Audit Yourself: 3 Signs You've Outgrown Your Billing Engine

Before you schedule a meeting with your board about buying a new billing system, do this audit:

Audit 1: The State Question

Open your database. Run:

SELECT COUNT(*) FROM subscriptions WHERE cancellation_requested_at IS NOT NULL AND suspension_reason IS NOT NULL AND pause_requested_at IS NOT NULL;

What's the result? If it's greater than zero, you have subscriptions in impossible states. If you don't know what states they're in, you've hit the ceiling.

Audit 2: The Schema Sprawl Question

Count the columns on your invoices table. If it has more than 20, you've outgrown the schema. If it has more than 30, you've hit the ceiling. If you have 5 or more JSONB columns that are storing structured data, you should have had multiple columns instead.

Audit 3: The Lock Question

Search your codebase for the pattern SELECT ... FOR UPDATE. How many times does it appear? If the answer is zero, you don't have pessimistic locking and you have concurrency bugs. If the answer is more than five, you're using locking as a band-aid on an underlying architectural problem.

If any of these audits reveals a problem, you have two options: rebuild your system from scratch (6–12 months), or buy one (6–12 weeks with migration).


The Path Forward

The CTO at 2:47 AM, staring at the billing postmortem, has a choice. She can:

Option A: Tell the team "nobody touch billing for six months, we're rewriting it." Spend six months building the right way. Ship a new billing system. Move on with life.

Option B: Tell the team "we're buying a billing system built for this exact problem." Platforms like Aforo exist precisely because every internal billing engine follows this same decay curve — the 9-state machine, the rate plan versioning, the distributed locking, the automated dunning are all solved problems. Spend two weeks integrating. Spend two months migrating customers. Move on with life faster.

Option A feels like the right choice because you're an engineer, and engineers build things. But Option A costs you six months of product velocity. Six months is a lifetime in SaaS. Competitors ship three major features in six months. You spent six months moving things around.

Option B feels like admitting defeat. It's not. It's admitting that billing is not your competitive advantage. Your competitive advantage is your product. Building a billing system from scratch is a tax on your competitive advantage.

The best billing system is the one that doesn't require you to think about it.

That system probably isn't in your codebase right now.


Conclusion: The Inevitable Realization

By the time you reach $10M ARR, you've learned something crucial: every internal billing system is built on the assumption that you'll never reach $10M ARR.

You built for a smaller, simpler world. You succeeded. Your world got bigger and more complex. Your system didn't keep up.

This is not a failure. This is growth.

The companies that win at $100M ARR are the ones that made the hard decision at $10M: admit that the internal system is a liability, not an asset, and move on.

The companies that lose are the ones that keep patching, keep adding flags, keep hoping that "one more rewrite will fix it."

There's one more rewrite that will fix it. It's called "buying billing from someone else."

The postmortem on your desk right now? It's not a warning. It's an invitation.

Share this article
JB
Jay Bodicherla
Founder & CEO, Aforo

Product leader building Aforo, the production-grade enterprise monetization platform for SaaS teams scaling usage-based billing.

Ready to ship outcome-based pricing?

Deploy an Intercom-style billing model in 5 minutes.
No custom middleware required.

Try the sandbox free, or talk to our solutions team for a 1:1 enterprise architecture review. No credit card required.