The Page That Counted 800 Million Rows: When Aggressive Denormalization Is the Right Answer

The error report had been coming in for weeks. Sporadic 504s on the company-overview dashboard. Random timeouts on the customer-success view. Sixteen entries in the error bucket, each looking like its own little failure, none of them obviously related to each other. The team’s response was the usual: rescue the timeout, retry, add a loader spinner, ship a fix for “perceived performance.”

The first move was unglamorous: stop trusting the vendor tool’s error grouping. A short NRQL query against the same data, faceted by SQL template instead of stack-trace fingerprint, collapsed those sixteen “different” timeouts into one row. Every single one was the same SELECT COUNT(*) against a single 800-million-row table, fired from four different controllers, all on every dashboard page load.

The fix was not a query optimization. The fix was aggressive denormalization across half a dozen aggregate columns and a daily-buckets table. After it landed, p99 latency on the affected pages dropped by an order of magnitude, the error rate fell to near zero, and the database stopped being a load-bearing piece of the operational anxiety budget.

This is the story of how a count-on-every-page-load pattern silently consumed a database, and the unfashionable engineering answer that fixed it.

The query that ate the database

The collapsed view, once we knew what to look for, showed a single SQL template:

SELECT COUNT(*) FROM "transactions"
WHERE "transactions"."company_id" = $1
  AND "transactions"."deleted_at" IS NULL
  AND "transactions"."draft" = $2

That is the query. It is a count, with three predicates, against a table that had grown to roughly 800 million rows for our largest customer. The query plan was a sequential scan on the partial index for (company_id) WHERE deleted_at IS NULL. For most customers, “most rows for this company_id” was a few thousand rows and the scan was fast. For our largest customer, “most rows for this company_id” was tens of millions of rows, and the scan was anywhere from 8 to 90 seconds depending on the cache state.

Every dashboard page that showed a transaction count, on every load, was running this. Sometimes twice (once for the total, once for a “this month” subset). The retry-on-timeout fallback the team had added was masking the cost: most of the time the second attempt would hit a warm cache and complete. Some of the time it would not, and we got the timeout pages.

The real bug was not the timeout. The real bug was that we were counting all of our data on every page load.

Why “just optimize the query” was wrong

The first instinct, on seeing the query, was to optimize it. The team tried:

A covering index on (company_id, deleted_at, draft) with INCLUDE on id. Helped a little. Still slow on the largest tenant.
A partial index on (company_id) WHERE deleted_at IS NULL AND draft = false. Helped more. Still slow on the largest tenant.
A materialized view refreshed every five minutes. Helped a lot for the count case, but introduced staleness in a place where the product owners did not want staleness, and the refresh itself was a long-running transaction that blocked vacuum.
An approximate count via EXPLAIN. Fine for some pages, unacceptable for billing-related views.

Each option made the query faster. None of them made the pattern better. The pattern was: every page load triggered a count of every row matching three predicates against an unbounded table. The pattern was a count of all of our data. The pattern was wrong.

The number of rows the query was summarizing was small. There were not 800 million distinct “this company’s transactions this month” answers; there were on the order of a thousand companies times twenty months of operation. The cardinality of the output of the query was less than 30,000 rows. The cardinality of the input was 800 million.

Whenever the input is six orders of magnitude larger than the output, and you are recomputing the output on every page load, you are not doing engineering. You are paying tribute to your database vendor.

The right answer is denormalization. The argument against denormalization is “it can drift.” The argument for denormalization is “drift is detectable and fixable; latency is not.”

What we denormalized

We did this in three passes.

Pass 1: per-company counter cache. The companies table got an inspection_count column, maintained by Rails’ built-in counter_cache. Every page that showed “total transactions for this company” now read a single integer.

class Transaction < ApplicationRecord
  belongs_to :company, counter_cache: true
end

Pass 1 fixed the most-hit page (the company overview). It did not fix the time-bucketed views (“transactions this week,” “checkouts this month”).

Pass 2: per-company, per-day, per-event-type aggregate table. We introduced inspection_stats with the shape:

inspection_stats(
  company_id      bigint,
  location_id     bigint,
  day             date,
  event_type      text,    -- 'checkout', 'checkin', 'update'
  kind            text,    -- transaction kind taxonomy
  count           integer,
  damage_count    integer,
  PRIMARY KEY (company_id, location_id, day, event_type, kind)
)

For our largest customer, this collapses tens of millions of transaction rows into roughly 90 days × 50 locations × 3 event_types × handful of kinds ≈ 60,000 rows. For the median customer, it is a few hundred rows. The whole table for all customers is on the order of 10 million rows.

The page that wants “transactions this week, by event type” now does:

InspectionStat
  .where(company_id: company.id, day: 1.week.ago..)
  .group(:event_type)
  .sum(:count)

That query reads tens or hundreds of rows. It returns in single-digit milliseconds against a properly-indexed inspection_stats.

Pass 3: per-company aggregate snapshot columns on the companies table. Things like last_inspection_at, last_inspection_at_30d, inspection_count_30d, inspection_count_90d. Updated by a background reconciliation job, not by callbacks. These are the columns that get pulled by Salesforce sync, by Customer Success dashboards, by the BI tool’s daily snapshot.

The Salesforce sync, before this change, was running for four hours every night because it was recomputing 30/60/90-day inspection counts for every company by walking the transactions table. After this change, it reads the snapshot columns directly. The sync runs in under ten minutes. Same answer, two orders of magnitude faster.

The reconciliation job

Denormalization without reconciliation is a bug factory. The cache eventually drifts. The counter cache miscounts after a soft-delete that bypasses callbacks. The aggregate table misses rows after a bulk import. The snapshot columns lag.

The reconciliation job runs nightly and is the part nobody on the team wanted to write but everyone is now glad we did.

class InspectionStatReconciler
  def reconcile_company(company)
    truth = Transaction
      .where(company_id: company.id, deleted_at: nil, draft: false)
      .where(created_at: 90.days.ago..)
      .group(:event_type, "DATE(created_at)", :location_id, :kind)
      .count

    # Truth is a hash of [event_type, day, location_id, kind] => count.
    # Compare to InspectionStat; emit drift events for mismatches; upsert.
    InspectionStat.transaction do
      truth.each do |(event_type, day, location_id, kind), count|
        InspectionStat.upsert(
          { company_id: company.id, location_id: location_id, day: day,
            event_type: event_type, kind: kind, count: count },
          unique_by: [:company_id, :location_id, :day, :event_type, :kind]
        )
      end
    end
  end
end

The job runs per-company in a sharded queue. Small companies reconcile in seconds; the largest takes a few minutes. The whole nightly pass completes in under 45 minutes.

The first time we ran it, it caught roughly 0.04% drift between the live table and the aggregate table. The drift was real (a handful of soft-deletes that had not propagated, a couple of admin imports that had bypassed callbacks). We fixed the underlying causes; the reconciliation closed the gap.

The job also emits a structured drift event whenever it patches a row. Aggregated, those events told us which code paths were writing transactions without going through callbacks. We hunted them down one by one over the next quarter. The drift rate is now consistently under 0.001%.

What the team learned

Counting all of your data on every page load is a code smell that compounds. It looks fine on day one when the table has 10,000 rows. It looks fine on day 365 when the table has 10 million rows. It does not look fine on day 1,500 when the table has 800 million rows and the customer is on the phone.

Vendor observability fingerprinting will mask the pattern. Sixteen instances of “one bug” can present as sixteen unrelated bug reports if the fingerprint is too narrow. A small custom tool that re-buckets by your heuristics — three hundred lines of Ruby — pays for itself in a week.

Denormalization is not a sin if you have a reconciliation job. The argument against denormalized aggregates is that they drift. The answer is to detect and patch drift on a schedule. The cost of running the reconciliation job is a tiny fraction of the cost of running the slow queries it eliminates. The cost of not running the reconciliation job is a slow, silent data-integrity bug.

The 30/60/90 columns paid for themselves three different ways. The Salesforce sync got 20x faster. The Customer Success dashboards became instant. Kirk from the analytics team built an n8n connector that pulled the data through Metabase, and the load on the read replica went from “this query will not complete” to “this query is trivial.” Same data, three downstream consumers, one denormalization.

A note on counter_cache

Rails’ built-in counter_cache works well most of the time and fails in specific ways that are worth documenting:

It is updated only by ActiveRecord callbacks. Any code path that bypasses callbacks (update_columns, update_all, raw SQL, delete instead of destroy) leaves the counter wrong.
It drifts after a soft delete unless your soft-delete library (paranoid2, acts_as_paranoid) hooks into the counter explicitly.
It does not handle conditional counts (count where deleted_at IS NULL AND draft = false). For those you write your own.

We use counter_cache for the simple total. We use a reconciler for everything else. The reconciler is the cheap insurance that makes the speed gains durable.

Results

p99 latency on the company overview dashboard: from ~12s with intermittent timeouts to ~110ms steady.
p99 latency on the customer success monthly view: from ~24s with frequent timeouts to ~180ms steady.
Salesforce sync runtime: 4 hours to under 10 minutes.
Error rate from ActiveRecord::StatementInvalid on the affected controllers: down by ~99.5%.
Read-replica IOPS at the daily-batch window: roughly halved (we stopped scanning the transactions table for aggregates).
A new use case (Kirk’s n8n connector) unlocked because the data was finally pullable without melting the replica.

The takeaway

If your application is counting all of your data on every page load, the right answer is not a smarter query. The right answer is to count it once, store the answer, and update the answer as new data arrives. Counter caches handle the simple cases. Aggregate tables handle the time-and-dimension-bucketed cases. Snapshot columns on the parent table handle the “show this customer’s headline number” cases.

Then write a reconciliation job. Run it nightly. Trust it but verify.

The unfashionable answer is the right one here. Aggressive denormalization, with the reconciler as the safety net, is the pattern that turns a table-scan problem into a primary-key-lookup problem. Every page that was slow becomes fast. Every downstream consumer that wanted the data gets to consume it cheaply.

This was one of a body of similar database engagements. If your dashboards are slow because they are counting an unbounded table on every load, this is the conversation worth having.