The Question the AWS Console Will Not Answer: RDS IOPS Pattern Detection With Ruby and a Spreadsheet

A senior engineer asks: “What percent of business hours do we run above 10,000 IOPS on the primary database?”

The AWS Console answer: a chart of the last 14 days at 1-minute resolution that you have to eyeball and guess.

The CloudWatch Metrics API answer: a raw stream of data points that you can pull, but only if you have already written the analysis code that turns it into the question’s answer.

The “right tool for the job” answer: there is none. There is no AWS-native UI for the question. The question is good, the data exists, and the gap is between the data and the human who wants the answer.

The cheapest way to close that gap is to write a small Ruby tool. About 600 lines, two evenings, no external dependencies beyond an HTTP client and Ruby’s CSV library. We did this for a client whose RDS bill had been climbing without anyone being able to explain it, and the tool answered five questions in an afternoon that the AWS console could not answer at all.

This is the story of that tool, what it found, and why it is a pattern I have applied to more than one client now.

The thing that breaks the AWS Console for this

The console is built for live monitoring, not for analysis. If you want a chart of IOPS over the last two hours, the console is fine. If you want to answer a structured question — what fraction of weekday business hours did we exceed threshold X over the last three months — the console will not help.

The chart is a chart. You cannot bin its values into thresholds. You cannot subset by time of day. You cannot export the raw points without leaving the console. You cannot diff the pattern between two weeks. You cannot ask “of the times we spiked above 12,000 IOPS, which were on a weekday vs weekend?”

Each of those questions takes one Ruby method to compute from the raw metric stream. The tool I am about to describe is the assembly of those methods plus a CSV exporter so the answers land in a spreadsheet the financial side of the house can read.

The shape of the tool

The architecture is uninteresting on purpose:

  1. Pull CloudWatch metrics for a configured time window. We do this through NewRelic’s monitoring layer (which already has the data correlated to deploys and traffic), but a direct CloudWatch SDK call works the same way.
  2. Bucket each data point by time-of-day window (morning, afternoon, evening, night) and by weekday/weekend.
  3. For each configured threshold (5K, 7.5K, 10K, 12.5K, 15K IOPS), compute the fraction of bucketed time spent above the threshold.
  4. Emit a CSV per analysis pass. One row per (time-segment × threshold), with the percent-above figure.

We named the tool RELIC. The name is not load-bearing.

class RelicAnalyzer
  THRESHOLDS = [5_000, 7_500, 10_000, 12_500, 15_000].freeze

  SEGMENTS = {
    morning:   (6..12),
    afternoon: (12..17),
    evening:   (17..22),
    night:     (22..30) # wraps through midnight
  }.freeze

  def initialize(metric_stream)
    @points = metric_stream # array of { timestamp:, value: }
  end

  def threshold_breakdown
    grouped = group_by_segment(@points)

    grouped.flat_map do |segment, points|
      THRESHOLDS.map do |threshold|
        above = points.count { |p| p[:value] >= threshold }
        {
          segment: segment,
          threshold: threshold,
          total_minutes: points.length,
          minutes_above: above,
          percent_above: (above.to_f / points.length * 100).round(2)
        }
      end
    end
  end

  private

  def group_by_segment(points)
    points.group_by { |p| segment_for(p[:timestamp]) }
  end

  def segment_for(ts)
    hour = ts.localtime.hour
    SEGMENTS.find { |_name, range| range.cover?(hour) || (range.last > 24 && (hour >= range.first || hour < range.last - 24)) }&.first
  end
end

A second module pulls the points from CloudWatch via NewRelic’s NRQL API:

class CloudwatchIopsStream
  def fetch(start_time:, end_time:, instance:)
    nrql = <<~NRQL
      SELECT max(provider.readIops + provider.writeIops) AS iops
      FROM DatastoreSample
      WHERE provider = 'RdsDbInstance'
        AND providerAccountName = '#{instance}'
      SINCE '#{start_time.iso8601}'
      UNTIL '#{end_time.iso8601}'
      TIMESERIES 1 minute
    NRQL
    NewRelicClient.new.fetch(nrql)
  end
end

A third pipes the results to CSV:

CSV.open("iops_breakdown_#{Date.today}.csv", "w") do |csv|
  csv << ["segment", "threshold", "total_minutes", "minutes_above", "percent_above"]
  analyzer.threshold_breakdown.each do |row|
    csv << row.values_at(:segment, :threshold, :total_minutes, :minutes_above, :percent_above)
  end
end

That is the whole tool. Forty lines of substance, a hundred lines of setup and config, the rest of the 600 lines is robustness (retry on rate-limited NRQL responses, environment loading, parameter validation, a small CLI wrapper).

What it found

For this client, the headline question was the one the senior engineer originally asked: what percent of weekday business hours do we run above 10K IOPS? The CSV answered:

Segment Threshold % Above
morning (6–12) 10,000 14.2%
afternoon (12–17) 10,000 38.7%
evening (17–22) 5,000 22.1%
night (22–6) 5,000 6.4%

The afternoon number was the surprise. The team’s intuition was that morning (when most customers log in) was the peak. The data said afternoon. We dug into the breakdown by deploy correlation — NewRelic’s since filter lets you correlate with deployment events — and discovered that the afternoon spike was the internal batch processing window, not customer traffic. A nightly aggregation job had been moved to a “low traffic” 2pm slot by an engineer who left the company two years ago. Nobody on the current team knew the job existed.

That job was rewritten to run at 4am, and the afternoon spike disappeared. The percent-of-business-hours-above-10K dropped from 38.7% to 8.1% in the next analysis pass.

The 22%-of-evenings-above-5K number led to a separate investigation: a webhook fan-out for a partner integration was running at scale every evening. The fan-out was redesigned to batch its writes. Evening percent-above-5K dropped to under 3%.

The tool’s value was not in the analysis. The analysis is trivial. The value was that it answered the questions the team could phrase and the AWS Console could not.

Why a CSV is the right output

I want to defend the CSV explicitly, because some readers will reach for “this should be a Grafana dashboard.”

A dashboard is the right output when the question is repeated forever. The questions we were asking were specific to an investigation, and would not need a dashboard once the investigation concluded. A dashboard would have taken a day to set up properly, would have needed to be parameterized to handle the threshold and segment options, and would have produced a chart that a non-engineer could not easily reshape.

A CSV opens in Excel. The CFO’s finance director, the database lead, and the engineering manager can each pivot the same CSV in three different ways without anyone writing more code. Two of those people would not have opened a Grafana dashboard.

The general rule: investigative analysis goes to CSV. Operational monitoring goes to dashboards. The mistake is using one for the other.

The questions you should consider asking

Once the tool is built, the marginal cost of asking it a new question is small. The questions that have paid off, across multiple clients:

The CSV-and-spreadsheet approach makes each of these a five-minute conversation between the analyst and the data. The chart-and-vibes approach makes each of these a half-day investigation.

The cost angle

The reason this kind of analysis is worth doing is the AWS pricing model. RDS provisioned IOPS is billed at a flat rate per IOPS-month. If you have provisioned 12,000 IOPS but spend 90% of business hours below 7,500, you are paying for headroom you do not use.

The right answer is not always to provision down. Sometimes the headroom is real, and the spike you see at 4pm on the third Tuesday of every month is the billing job that absolutely cannot fail. But you cannot make that call without the data.

For this client, the analysis enabled two decisions:

  1. Provisioned IOPS dropped from 12,000 to 9,000 on the primary, with a documented spike absorption plan via Aurora’s burst capacity. Monthly RDS bill reduction was in the low five figures.
  2. The afternoon batch job moved to off-hours, which was the real underlying problem. The provisioned reduction would not have been safe without that move.

The tool paid for itself in two evenings of engineering time within one billing cycle. We have run versions of it on three other clients since. The numbers vary; the pattern does not.

The takeaway

The AWS Console is built for live monitoring, not analytical questions. If your team is asking analytical questions about performance, billing, or capacity that the console cannot answer, a six-hundred-line Ruby tool over the CloudWatch metrics stream will answer them, today, with a CSV the finance team can read.

Do not over-build it. Do not turn it into a dashboard until you have asked the same question three times and gotten value each time. Do not write it in a framework. Write it as a script, with a CLI, that emits a CSV, and use it like a question-answering machine.

This is one of those tools that is too small to be a product and too useful to skip. If your RDS bill is climbing and nobody can say why, this is the cheapest way to find out.

This was one of a body of small-tool engagements. Happy to talk about yours.