One-line hook

The integration brief was straightforward: keep our platform’s inventory state in sync with the customer’s ERP. The ERP’s integration story was equally straightforward: it would email us a flat file. Several times a day. Forever. Here is how we made that into a reliable, observable, idempotent pipeline — and what it taught us about working with software that predates HTTPS.

Who this is for

Engineers who have ever been told “the source of truth is whatever the ERP says”
Anyone integrating with software that has the word “Enterprise” in the name unironically
Teams who think “we’ll just use a webhook” and have not yet been disabused
Platform engineers who underestimate how much of real-world data integration is sad-but-reliable batch pipelines

The setup (vague-client framing)

Client: B2B SaaS platform serving operationally complex customers. The platform’s value depends on the customer’s operational data being current and trustworthy.
The customer-side reality: many of these customers run their inventory/finance on an ERP that was deployed before “cloud” was a marketing word. Some on-prem, some on a vendor’s hosted offering that wraps the same software.
The asks from the customer integrations team: keep tens of thousands of records in sync, multiple times a day, with the ERP as the system of record. Two-way sync was off the table — the ERP is read-only-ish for us. The platform writes back to the customer’s ops workflow but not back to the ERP.
What we tried to get from the ERP vendor: a REST API, an SFTP drop, a webhook, a database read replica, a SOAP endpoint, anything that wasn’t email. The answer was, in order: no, no, no, no, no, and “we have a feature that emails a report on a schedule.”

Fine.

Why “just email it” is actually a defensible protocol

Before sneering at this design, give it credit for the things it gets right: - Email is the most universally deliverable protocol on Earth. It crosses firewalls, NATs, VPNs, air-gapped DMZs, and IT departments that haven’t approved an outbound API call since 2014. - It’s asynchronous by default. The sender doesn’t need to know whether the receiver is up. - It has retries built in at the SMTP layer, with reasonably forgiving semantics. - It has authenticatable provenance (SPF, DKIM, DMARC — when configured, which is a real if). - It produces a human-readable audit trail. When something breaks, support can literally show you the email.

The reason it’s bad is the payload: a flat file, schema-by-convention, no machine-readable contract, sometimes with the file inline and sometimes attached, sometimes zipped and sometimes not.

The architecture

AWS SES inbound receives mail to a dedicated address per customer (feed-{customer_id}@feeds.platform.example)
SES → S3 drops the raw message into a per-customer prefix, encrypted, with retention
SES → SNS → SQS notifies a worker queue that a new message landed
A Python job consumer pulls from SQS, fetches the raw message from S3, parses it through a per-customer adapter, enqueues per-record upserts onto a separate job queue
The upsert workers apply changes to the operational database with idempotency keys and structured audit logging
A reconciliation job runs nightly to detect drift between the platform state and the most recent ERP feed

            (ERP cron job)
                  |
                  v
            [email server]
                  |
                  v
       ===========================
        AWS SES inbound endpoint
       ===========================
            /            \
       (raw msg)       (notify)
            |              |
            v              v
           S3            SNS -> SQS
                              |
                              v
                     [parser worker pool]
                              |
                              v
                   [per-record upsert queue]
                              |
                              v
                   [upsert worker pool]
                              |
                              v
                     [operational DB]

The interesting parts

1. The parsing layer is the entire game

The ERP vendor’s “CSV” is not a CSV. It is a flat file that is mostly a CSV, with occasional quirks: header rows sometimes present and sometimes not, encoding inconsistencies (Windows-1252 sneaking in alongside UTF-8 BOMs), date formats that vary by customer’s regional ERP install, and one specific customer whose ERP emits a tab-delimited file labeled .csv. We learned to: - Detect the format per-customer, persist that detection, and reject anything that doesn’t match — fail loudly rather than guess - Validate every row against a pydantic schema with strict types; route rejects to a dead-letter store with the rep’s email contact, not just to a log file - Treat the parsing layer as the only boundary where the ERP’s data quirks are allowed to live; downstream code sees clean, typed records

# Sketch of the parser adapter pattern
class ERPFeedAdapter(Protocol):
    customer_id: str
    expected_columns: list[str]
    date_format: str
    encoding: str

    def parse(self, raw_bytes: bytes) -> Iterator[FeedRecord]: ...

ADAPTERS: dict[str, ERPFeedAdapter] = {
    "customer-acme": AcmeERPAdapter(),
    "customer-globex": GlobexERPAdapter(),
    # ...
}

2. Idempotency is non-negotiable

SES retries. SQS retries. The ERP itself occasionally sends the same daily report twice (we suspect a cron job racing with itself on their end). The upsert layer assumes any individual record may arrive 1..N times. Strategy: - A deterministic idempotency key derived from (customer_id, record_natural_key, source_file_id, row_hash) - An idempotency table that records “we have applied this exact change,” with a TTL long enough to outlive any reasonable retry window - Upserts are diff-based: if the incoming record hashes the same as the stored record, we record the touch but don’t bump version numbers or fire change events

This eliminated an entire category of customer-visible bugs (“why did my system show a change at 3am when nothing actually changed?”).

3. The job queue topology matters

We initially had a single queue. This had two problems: - A noisy customer (large feed, frequent updates) starved the queue for everyone else - A poison message (parser fails) blocked all downstream work behind it

We split into: - Per-customer parser queues with bounded concurrency, so one customer’s feed never blocks another’s - A shared upsert queue with per-customer rate limiting at the consumer level - A dead-letter queue per stage with automated re-drive after operator review

4. The compound index that paid for the project

The upsert workers are doing, in essence: SELECT existing record WHERE (tenant_id, natural_key) -> diff -> UPDATE. At low volume, any reasonable index works. At the volume the largest customers run (hundreds of thousands of records per feed, multiple feeds per day), the query planner started preferring an index that was almost but not quite what we wanted, and the upsert phase started taking ~12 minutes for a feed that should take 90 seconds.

The fix was a single compound index (tenant_id, natural_key, status) INCLUDE (updated_at, version) — designed specifically so the query planner could satisfy the lookup and the diff comparison from the index alone, with no heap fetch. Postgres specifically: CREATE INDEX ... INCLUDE (...) is a covering-index feature that’s underused. Upsert phase went back to ~70 seconds.

The article will walk through: - How we noticed (worker queue depth climbing during peak feed windows) - How we diagnosed (EXPLAIN ANALYZE + pg_stat_statements + a small Python tool that replays representative queries) - Why the “obvious” index was wrong (it indexed the columns the WHERE clause used, but forced a heap fetch for the diff comparison) - How the INCLUDE clause changed the planner’s mind - The cost: increased index size, increased write amplification — accepted because the read-side win was overwhelming

-- The index that fixed it
CREATE INDEX CONCURRENTLY idx_records_upsert_lookup
  ON records (tenant_id, natural_key, status)
  INCLUDE (updated_at, version_hash);

5. Observability for a pipeline you don’t fully control

We can’t see inside the ERP. We can only see the side effects of its email-shaped contract. So we instrumented: - Time-since-last-feed per customer, with alerting when it exceeds expected cadence - Rolling 7-day per-customer feed size distributions (sudden drops are the loudest leading indicator of “their cron job is broken”) - Parse success rate per customer - End-to-end latency: SES received → last record upserted - A weekly summary email back to the customer’s integrations contact with these metrics, because the people who fix problems on the ERP side are not engineers and need a human-readable signal

6. The reconciliation job

Email is lossy in edge cases. A spam filter, a misconfigured DMARC policy, a hard bounce after a quota change — and you’ve silently lost a day’s feed. So once a night: - The ERP also dumps a “current state snapshot” file (separately negotiated, separately emailed) - A reconciliation worker compares snapshot to platform state per-customer - Drift produces a structured report; small drift auto-heals via re-upsert, large drift pages the integrations team

What we’d do differently

Build the reconciliation job first, not last. It’s the only thing that tells you the whole pipeline is actually working.
Per-customer adapters from day one. We started with one “generic” adapter and refactored into per-customer adapters once the third edge case hit. Should have started there.
Treat dead-lettering as a feature, not a failure mode. A row that can’t be parsed is information, not noise. Ours sits in S3 with a structured envelope and is queryable by the support team.
The compound index should have been in the original schema. We had the query pattern documented in the design doc. We didn’t translate the documented pattern into the index. Should be a checklist item.

Results

~99.9% feed delivery success rate over 18+ months (the gap is mostly customer-side cron failures)
End-to-end latency from email arrival to last record applied: median ~3 minutes for typical feeds, ~8 minutes for the largest
Zero data corruption incidents traced to the pipeline
The integrations team stopped getting paged for “feed seems stuck” because the alerting predicts it earlier than humans notice

The takeaway

Integrations with legacy systems are exercises in constraint engineering. The ERP isn’t going to change. The protocol isn’t going to change. What you control is parsing discipline, idempotency, queue topology, indexes, and observability. Build those five things well and you can build a production-grade integration on top of a protocol as primitive as email. Try to skip any of them and the pipeline will fail in ways your customer notices before you do.

The deeper lesson: any time someone says “we just need a quick integration with $LEGACY_SYSTEM,” double the estimate and put the reconciliation job on the critical path.