Shipment exception agents that scale to 10,000 events a day without breaking.

Your exception handling agent catches shipment delays, reroutes packages, and notifies customers before they notice. It works on 50 test events. It breaks at 5,000. Hatch gives it the infrastructure to handle 10,000+ events per day reliably.

The problem.

Shipment exception workflows have a partial-execution problem that makes naive retry dangerous. When a container is flagged as delayed, your agent runs three sequential steps: notify the customer, update the carrier manifest via EDI, update the warehouse receiving schedule. If the EDI call at step 2 fails after step 1 completes, a retry from the beginning sends a second customer notification — 'your shipment is delayed' — for the same event. At 50 exceptions a day, your ops team catches this manually. At 5,000, you have hundreds of duplicate notifications and no visibility into which records are corrupt.

Carrier API SLAs degrade predictably at the same time your exception volume spikes — during port disruptions, weather events, and peak season. FedEx and UPS tracking APIs have documented rate limits of 1,000 requests per minute per credential. When your exception agent scales from 10 to 100 concurrent workflows during a disruption event, it hits the rate limit within seconds, starts receiving 429s, and your exception queue backs up faster than the backpressure logic can drain it. Standard Kubernetes HPA doesn't know the difference between CPU load and carrier API rate limit exhaustion.

Ambiguous exceptions — delayed customs clearance, damaged goods with unclear liability, carrier-lost versus warehouse-lost — require a human decision before any action is taken. An agent that autonomously reroutes a shipment with unclear liability exposes your company to carrier disputes and customer chargebacks. The handoff to a human needs to carry full context: the original exception event, every API call the agent made, every data point it retrieved, and the specific reason it couldn't resolve the exception automatically. Without structured escalation, your ops team receives a Slack message that says 'exception on order 8842' and has to reconstruct the context themselves.

What Hatch handles.

Hatch persists agent state after each step with the step index and a per-event idempotency key derived from the shipment ID and exception timestamp. When a retry fires, the agent checks the WAL: if step 1 (customer notification) is already marked complete for this idempotency key, it skips to step 2 without re-sending the notification. The customer receives exactly one message per exception event, regardless of how many times the workflow retries.

Carrier API rate limits are managed by Hatch's rate-limit-aware scheduler. Each carrier credential is tracked with a rolling request counter. When the counter approaches the carrier's documented limit, the scheduler throttles new workflow starts for that carrier and queues them with priority ordering — high-severity exceptions first. The exception queue drains at the maximum sustainable rate without triggering 429s. No credential rotation hacks, no manual tuning during incidents.

Human escalation workflows carry the full agent context as a structured payload: the raw exception event, each API call result, the confidence score for each candidate resolution, and the specific condition that triggered escalation. Your ops system receives a webhook with all of this structured data. The ops agent sees a task with everything they need to make a decision. When they act, their decision is posted back to the workflow via a webhook, logged with their user ID, and the agent resumes from the step after escalation. No reconstruction, no context loss.

Event-driven autoscaling watches the exception topic lag on your Kafka cluster. When lag exceeds a configured threshold — indicating a disruption event — Hatch scales agent replicas up within 90 seconds. When lag clears, replicas scale down on a configured cooldown. The scaler is aware of carrier API rate limits and caps replica count to the maximum number of concurrent workflows the available credentials can sustain. You do not manually provision during port disruptions.

Agents that run on Hatch.

Shipment exception handler

Consumes exception events from a Kafka topic, classifies by type and severity using a rules engine plus an LLM for ambiguous cases, executes the resolution workflow (customer notification via SES, carrier manifest update via EDI API, warehouse schedule update via REST), and escalates to an ops queue via webhook when confidence is below threshold — with the full execution trace attached.

10,000+ exception events/day with sub-5-minute response

Route optimizer

Subscribes to a GPS event stream from driver devices, detects route deviations against a geofenced expected path, calls the routing API (Google Maps Platform or HERE) for recalculation, and pushes updated routes to driver devices via a mobile push gateway — with exponential backoff when the mapping API is slow and automatic escalation if the driver has been off-route beyond a configured threshold.

Continuous optimization across all active routes

Carrier ops agent

Polls carrier tracking APIs on a per-shipment schedule derived from expected delivery window, detects status transitions, updates the internal shipment record, triggers downstream workflows (customer notification, billing events, returns processing), and writes a structured interaction log with carrier response payload, latency, and HTTP status for every API call.

500+ carrier interactions/day across multiple providers

The 2-week PoC.

Take your existing delay notification or exception handling workflow. Deploy it as a Hatch agent. In two weeks, it handles real exception volume with idempotent step execution, carrier rate-limit-aware scaling, and structured escalation to your ops team — with zero duplicate customer notifications under retry conditions.

1,000+ shipment exceptions/day handled with zero duplicate customer notifications — idempotency verified under forced-retry test conditions

Carrier API rate limits respected under 10x load — no 429s, queue drains at maximum sustainable throughput

Human escalation delivers full structured execution trace to ops system — no manual context reconstruction

Kafka consumer lag drives autoscaling — replica count responds within 90 seconds of disruption event onset

Why now.

OTIF (On-Time In-Full) penalties from major retailers — Walmart, Target, Amazon — are calculated automatically from carrier scan data. A shipment exception that goes unhandled for four hours because your agent queue is backed up is a direct deduction on your next invoice. The financial exposure from exception handling latency is calculable: your current manual SLA versus an automated sub-5-minute response, multiplied by your OTIF penalty rate. That number is why your ops director is asking for a production deployment, not another demo.

Have an agent stuck in staging?

Tell us what it does and where it's stuck. We'll scope a 2-week PoC and show you what production looks like.

book a call →