One-line hook

The integration brief was straightforward: keep our platform’s inventory state in sync with the customer’s ERP. The ERP’s integration story was equally straightforward: it would email us a flat file. Several times a day. Forever. Here is how we made that into a reliable, observable, idempotent pipeline — and what it taught us about working with software that predates HTTPS.

Who this is for

The setup (vague-client framing)

Fine.

Why “just email it” is actually a defensible protocol

Before sneering at this design, give it credit for the things it gets right: - Email is the most universally deliverable protocol on Earth. It crosses firewalls, NATs, VPNs, air-gapped DMZs, and IT departments that haven’t approved an outbound API call since 2014. - It’s asynchronous by default. The sender doesn’t need to know whether the receiver is up. - It has retries built in at the SMTP layer, with reasonably forgiving semantics. - It has authenticatable provenance (SPF, DKIM, DMARC — when configured, which is a real if). - It produces a human-readable audit trail. When something breaks, support can literally show you the email.

The reason it’s bad is the payload: a flat file, schema-by-convention, no machine-readable contract, sometimes with the file inline and sometimes attached, sometimes zipped and sometimes not.

The architecture

            (ERP cron job)
                  |
                  v
            [email server]
                  |
                  v
       ===========================
        AWS SES inbound endpoint
       ===========================
            /            \
       (raw msg)       (notify)
            |              |
            v              v
           S3            SNS -> SQS
                              |
                              v
                     [parser worker pool]
                              |
                              v
                   [per-record upsert queue]
                              |
                              v
                   [upsert worker pool]
                              |
                              v
                     [operational DB]

The interesting parts

1. The parsing layer is the entire game

The ERP vendor’s “CSV” is not a CSV. It is a flat file that is mostly a CSV, with occasional quirks: header rows sometimes present and sometimes not, encoding inconsistencies (Windows-1252 sneaking in alongside UTF-8 BOMs), date formats that vary by customer’s regional ERP install, and one specific customer whose ERP emits a tab-delimited file labeled .csv. We learned to: - Detect the format per-customer, persist that detection, and reject anything that doesn’t match — fail loudly rather than guess - Validate every row against a pydantic schema with strict types; route rejects to a dead-letter store with the rep’s email contact, not just to a log file - Treat the parsing layer as the only boundary where the ERP’s data quirks are allowed to live; downstream code sees clean, typed records

# Sketch of the parser adapter pattern
class ERPFeedAdapter(Protocol):
    customer_id: str
    expected_columns: list[str]
    date_format: str
    encoding: str

    def parse(self, raw_bytes: bytes) -> Iterator[FeedRecord]: ...

ADAPTERS: dict[str, ERPFeedAdapter] = {
    "customer-acme": AcmeERPAdapter(),
    "customer-globex": GlobexERPAdapter(),
    # ...
}

2. Idempotency is non-negotiable

SES retries. SQS retries. The ERP itself occasionally sends the same daily report twice (we suspect a cron job racing with itself on their end). The upsert layer assumes any individual record may arrive 1..N times. Strategy: - A deterministic idempotency key derived from (customer_id, record_natural_key, source_file_id, row_hash) - An idempotency table that records “we have applied this exact change,” with a TTL long enough to outlive any reasonable retry window - Upserts are diff-based: if the incoming record hashes the same as the stored record, we record the touch but don’t bump version numbers or fire change events

This eliminated an entire category of customer-visible bugs (“why did my system show a change at 3am when nothing actually changed?”).

3. The job queue topology matters

We initially had a single queue. This had two problems: - A noisy customer (large feed, frequent updates) starved the queue for everyone else - A poison message (parser fails) blocked all downstream work behind it

We split into: - Per-customer parser queues with bounded concurrency, so one customer’s feed never blocks another’s - A shared upsert queue with per-customer rate limiting at the consumer level - A dead-letter queue per stage with automated re-drive after operator review

4. The compound index that paid for the project

The upsert workers are doing, in essence: SELECT existing record WHERE (tenant_id, natural_key) -> diff -> UPDATE. At low volume, any reasonable index works. At the volume the largest customers run (hundreds of thousands of records per feed, multiple feeds per day), the query planner started preferring an index that was almost but not quite what we wanted, and the upsert phase started taking ~12 minutes for a feed that should take 90 seconds.

The fix was a single compound index (tenant_id, natural_key, status) INCLUDE (updated_at, version) — designed specifically so the query planner could satisfy the lookup and the diff comparison from the index alone, with no heap fetch. Postgres specifically: CREATE INDEX ... INCLUDE (...) is a covering-index feature that’s underused. Upsert phase went back to ~70 seconds.

The article will walk through: - How we noticed (worker queue depth climbing during peak feed windows) - How we diagnosed (EXPLAIN ANALYZE + pg_stat_statements + a small Python tool that replays representative queries) - Why the “obvious” index was wrong (it indexed the columns the WHERE clause used, but forced a heap fetch for the diff comparison) - How the INCLUDE clause changed the planner’s mind - The cost: increased index size, increased write amplification — accepted because the read-side win was overwhelming

-- The index that fixed it
CREATE INDEX CONCURRENTLY idx_records_upsert_lookup
  ON records (tenant_id, natural_key, status)
  INCLUDE (updated_at, version_hash);

5. Observability for a pipeline you don’t fully control

We can’t see inside the ERP. We can only see the side effects of its email-shaped contract. So we instrumented: - Time-since-last-feed per customer, with alerting when it exceeds expected cadence - Rolling 7-day per-customer feed size distributions (sudden drops are the loudest leading indicator of “their cron job is broken”) - Parse success rate per customer - End-to-end latency: SES received → last record upserted - A weekly summary email back to the customer’s integrations contact with these metrics, because the people who fix problems on the ERP side are not engineers and need a human-readable signal

6. The reconciliation job

Email is lossy in edge cases. A spam filter, a misconfigured DMARC policy, a hard bounce after a quota change — and you’ve silently lost a day’s feed. So once a night: - The ERP also dumps a “current state snapshot” file (separately negotiated, separately emailed) - A reconciliation worker compares snapshot to platform state per-customer - Drift produces a structured report; small drift auto-heals via re-upsert, large drift pages the integrations team

What we’d do differently

Results

The takeaway

Integrations with legacy systems are exercises in constraint engineering. The ERP isn’t going to change. The protocol isn’t going to change. What you control is parsing discipline, idempotency, queue topology, indexes, and observability. Build those five things well and you can build a production-grade integration on top of a protocol as primitive as email. Try to skip any of them and the pipeline will fail in ways your customer notices before you do.

The deeper lesson: any time someone says “we just need a quick integration with $LEGACY_SYSTEM,” double the estimate and put the reconciliation job on the critical path.