S3 Access Log Archaeology: Turning a Year of Access Logs Into an Archival Policy

Before you tier a petabyte of media, you have a question to answer: which objects are cold?

The intuitive answers — “anything older than 90 days,” “anything that hasn’t been read in 30 days,” “anything not in the most recent transaction” — are all reasonable defaults. They are also all wrong in interesting ways, because they assume things about your access pattern that may not be true. Your access pattern is whatever the access logs say it is, and most teams have never actually looked.

This article is about a Ruby tool we wrote for a client whose S3 bucket had grown past a petabyte of customer-generated media. Before we tiered a single byte, we wrote a parser that ingested a year of S3 access logs into a local Postgres database, joined those logs to the application’s own database, and emitted a per-tenant access-pattern report. The report informed the per-tenant archival policy that drove the S3 bill from “rising linearly” to “approximately flat.”

The methodology generalizes. Most B2B SaaS companies have an S3 bucket whose access pattern they cannot describe. Most of them are paying for that ignorance.

The data sources

S3 server access logs are a beautifully boring data product. AWS writes one log file per (bucket × hour × source), in a fixed text format, into a logging bucket you nominate. Enable the feature on the bucket you care about, wait a few hours, and you have a stream of access records like:

996fbe45a9aceccdd2f899573a6d0e6ecd09dc64dd48a77fb3a2f7eec11d53a5
r360-media-production
[12/Mar/2020:22:29:42 +0000]
24.84.228.47 -
02B2DD5C24EF3198
REST.GET.OBJECT
transactions/11274857/snapshot_images/152014859-image-8533C974-...-tn
"GET /transactions/11274857/snapshot_images/152014859-image-8533C974-...-tn HTTP/1.1"
200 - 73194 73194 33 32
"https://app.platform.example/admin/transaction"
"Mozilla/5.0 ..."

Each line carries: the bucket, the object key, the requesting IP, the user agent, the referer, the operation (GET, HEAD, PUT, DELETE), the HTTP status, the bytes served, and timing.

The object key is the part most analyses ignore and the one with the most useful information embedded in it. Our key shape was transactions/<transaction_id>/snapshot_images/<image_id>-image-<uuid>-tn. That structure means the access log line, by itself, tells us:

Combined with the application database (which knows when the transaction was created and which company owns it), we can compute the age of the object at the moment of access and the tenant on whose behalf the access occurred.

The pipeline

The parser is unflashy. It walks the logging bucket in S3, downloads each hour’s log files, parses each line into a row, and writes rows into a local Postgres table. The cost of the local Postgres is trivial; the cost of doing this analysis directly against S3 every time would be ruinous.

class AccessLogParser
  LINE_REGEX = %r{
    ^(?<bucket_owner>\S+)\s
     (?<bucket>\S+)\s
     \[(?<timestamp>[^\]]+)\]\s
     (?<remote_ip>\S+)\s
     (?<requester>\S+)\s
     (?<request_id>\S+)\s
     (?<operation>\S+)\s
     (?<key>\S+)\s
     "(?<request_uri>[^"]*)"\s
     (?<http_status>\d+)\s
     (?<error_code>\S+)\s
     (?<bytes_sent>\S+)\s
     (?<object_size>\S+)\s
     (?<total_time>\S+)\s
     (?<turn_around>\S+)\s
     "(?<referrer>[^"]*)"\s
     "(?<user_agent>[^"]*)"
  }x

  def parse(line)
    match = LINE_REGEX.match(line)
    return nil unless match
    {
      timestamp: Time.strptime(match[:timestamp], "%d/%b/%Y:%H:%M:%S %z"),
      remote_ip: match[:remote_ip],
      operation: match[:operation],
      key: match[:key],
      bytes_sent: match[:bytes_sent].to_i,
      object_size: match[:object_size].to_i,
      transaction_id: extract_transaction_id(match[:key])
    }
  end

  def extract_transaction_id(key)
    key[%r{^transactions/(\d+)/}, 1]&.to_i
  end
end

The first interesting trick is that line. The transaction_id is in the path. Pulling it out is one regex. The same regex tells us whether the key is a thumbnail or a full asset (the -tn suffix), and which media type the object belongs to (the second path segment).

Rows land in a local Postgres table with indexes on (transaction_id, timestamp) and (remote_ip, timestamp). The first index supports “what is the access pattern for this transaction’s media.” The second supports the user-attribution join described next.

The user-attribution join (and what it can and cannot tell you)

S3 access logs do not include “user.” They include “remote IP.”

Before describing the join, I want to be explicit about what this technique is and is not suited for, because it is a tempting pattern that breaks if you push it too far.

What it is good for. Aggregate, per-tenant access pattern analysis: “what fraction of accesses on this company’s objects happen on data older than 30 days?” The signal is sound at the company level because companies have many users, many sessions, and many IPs, and the noise averages out. This is exactly what we needed for the archival policy work.

What it is bad for. Forensic per-user attribution: “exactly which user accessed this object at this moment.” Coffee-shop wifi gives multiple users the same IP. Mobile carriers rotate IPs faster than session activity. NAT’d corporate networks share IPs across hundreds of employees. VPNs concentrate traffic to a single egress. Treat per-row attribution as a hint, not a fact.

For each attribution we record a confidence band: high (single recent session match, distinctive IP), medium (multiple-candidate matches within the window, picked by recency), low (no UserActivity in the window, fell back to IP-only lookup against historical sessions). The downstream analyses use the band: tenant-level aggregations include all bands; anything resembling a forensic question filters to “high” only and discloses the residual.

With that scope, the join itself is straightforward:

class AccessLogJoiner
  HIGH_CONFIDENCE_WINDOW = 5.minutes

  Attribution = Struct.new(:user, :confidence, :method, keyword_init: true)

  def attribute_user(access_row)
    candidates = UserActivity
      .where(ip_address: access_row[:remote_ip])
      .where(created_at: (access_row[:timestamp] - HIGH_CONFIDENCE_WINDOW)..(access_row[:timestamp] + HIGH_CONFIDENCE_WINDOW))
      .order(Arel.sql("ABS(EXTRACT(EPOCH FROM created_at - '#{access_row[:timestamp].iso8601}'))"))

    case candidates.size
    when 0 then Attribution.new(user: nil, confidence: :low, method: :no_session_match)
    when 1 then Attribution.new(user: candidates.first.user, confidence: :high, method: :unique_session)
    else        Attribution.new(user: candidates.first.user, confidence: :medium, method: :nearest_of_many)
    end
  end
end

Once we have a user, we have a company (every user belongs to a company). Once we have a company, every access log row is tagged with: object key, transaction_id, transaction age at access, company, attribution confidence. The reports filter on confidence wherever it matters.

What the data said

A year of S3 access logs against the live production database revealed access patterns that fundamentally changed the tiering plan we had drafted on Day 1.

The 15-day cliff. Roughly 90% of all GET requests targeted objects under 15 days old. Not 30 days. Not 60. Fifteen.

This is the kind of number that, once you see it, you cannot unsee. Objects between 16 and 30 days old made up 5% of accesses. Objects between 31 and 90 days old made up 3%. Objects older than 90 days made up under 2%, and almost all of those were in two specific patterns (audit access by superadmins, and one customer’s monthly compliance pull).

The 15-day cliff meant we could be much more aggressive with the tiering than the conventional wisdom would have allowed. We had been planning “transition to Standard-IA at 30 days.” The data said we could safely transition to Standard-IA at 16 days, and very little user-facing latency would be affected.

Per-tenant access heterogeneity was enormous. Some tenants accessed their old data weekly. Most did not. The bottom 60% of tenants by access intensity, against their own data older than 30 days, accessed less than once per month. The top 5% of tenants accessed data older than 30 days roughly 50x more than the median. A single tiering policy across all tenants was strictly worse than a per-tenant policy.

We did not have a way to express “different tier policy per tenant” in S3 lifecycle rules directly, but we could express it by key prefix. We re-keyed new objects with a tenant identifier prefix, and applied tenant-specific lifecycle policies via prefix-scoped rules. The migration of old objects to the new key shape was a separate (slow, careful) project; new objects were on the new scheme immediately.

Thumbnails were the elephant. Roughly 35% of GET requests by count were against the -tn thumbnail variants. By byte count, thumbnails were under 5% of the bucket. Conclusion: the thumbnails were responsible for most of the GET-per-byte-stored ratio, and the right archival posture for thumbnails was “never tier them — keep them in Standard, regenerate from source if lost.”

This was the opposite of what we had assumed on Day 1, which was “tier everything by the same rule.”

The “compliance pull” pattern. Two tenants had a recurring pattern of pulling thousands of old objects at month-end. These were detected as access spikes against the 90+ day age bucket. We talked to the tenants directly, learned that the pulls were driven by their internal compliance review process, and built them a different access mechanism (an async export that pre-staged the objects out of Glacier into a customer-facing presigned-URL bundle, with one day’s notice). The pulls moved off the hot path entirely.

The per-tenant policy that came out of the data

The final policy, derived directly from the access-pattern report, had four prefix-scoped tiers:

Old objects migrated to the new key shape via a background re-keying job. Each migrated object validated against the access log database to ensure no surprise access pattern; the bottom 5% of “weird” tenants got individual review before automated migration.

Three months after the policy went live, the S3 bill on this bucket dropped roughly 60%, and the rate of bill growth flattened from ~exponential to ~linear (because old data tiers down as it ages).

What I would do differently

The takeaway

S3 access logs are an underused data product. They are a year of evidence about how your customers actually use your storage, sitting in a bucket nobody is looking at. With ~500 lines of Ruby and a local Postgres, you can join that evidence to your application database and produce a per-tenant access-pattern report that drives a real archival policy.

The cost of the analysis is one engineer for a week. The savings on a multi-hundred-terabyte bucket are usually significant. The savings on a petabyte-scale bucket are large enough to embarrass whoever did not look at the logs sooner.

If you are sitting on a bucket whose growth is outpacing your appetite for the bill, this is the cheapest first step. The answer to “which objects are cold” is in the logs. Go ingest them.

This was one of a body of similar storage engagements. Happy to talk about yours.