Shipment exception agents that scale to 10,000 events a day without breaking.
Your exception handling agent catches shipment delays, reroutes packages, and notifies customers before they notice. It works on 50 test events. It breaks at 5,000. Hatch gives it the infrastructure to handle 10,000+ events per day reliably.
The problem.
Shipment exception workflows have a partial-execution problem that makes naive retry dangerous. When a container is flagged as delayed, your agent runs three sequential steps: notify the customer, update the carrier manifest via EDI, update the warehouse receiving schedule. If the EDI call at step 2 fails after step 1 completes, a retry from the beginning sends a second customer notification — 'your shipment is delayed' — for the same event. At 50 exceptions a day, your ops team catches this manually. At 5,000, you have hundreds of duplicate notifications and no visibility into which records are corrupt.
Carrier API SLAs degrade predictably at the same time your exception volume spikes — during port disruptions, weather events, and peak season. FedEx and UPS tracking APIs have documented rate limits of 1,000 requests per minute per credential. When your exception agent scales from 10 to 100 concurrent workflows during a disruption event, it hits the rate limit within seconds, starts receiving 429s, and your exception queue backs up faster than the backpressure logic can drain it. Standard Kubernetes HPA doesn't know the difference between CPU load and carrier API rate limit exhaustion.
Ambiguous exceptions — delayed customs clearance, damaged goods with unclear liability, carrier-lost versus warehouse-lost — require a human decision before any action is taken. An agent that autonomously reroutes a shipment with unclear liability exposes your company to carrier disputes and customer chargebacks. The handoff to a human needs to carry full context: the original exception event, every API call the agent made, every data point it retrieved, and the specific reason it couldn't resolve the exception automatically. Without structured escalation, your ops team receives a Slack message that says 'exception on order 8842' and has to reconstruct the context themselves.
What Hatch handles.
Agents that run on Hatch.
Shipment exception handler
Consumes exception events from a Kafka topic, classifies by type and severity using a rules engine plus an LLM for ambiguous cases, executes the resolution workflow (customer notification via SES, carrier manifest update via EDI API, warehouse schedule update via REST), and escalates to an ops queue via webhook when confidence is below threshold — with the full execution trace attached.
10,000+ exception events/day with sub-5-minute response
Route optimizer
Subscribes to a GPS event stream from driver devices, detects route deviations against a geofenced expected path, calls the routing API (Google Maps Platform or HERE) for recalculation, and pushes updated routes to driver devices via a mobile push gateway — with exponential backoff when the mapping API is slow and automatic escalation if the driver has been off-route beyond a configured threshold.
Continuous optimization across all active routes
Carrier ops agent
Polls carrier tracking APIs on a per-shipment schedule derived from expected delivery window, detects status transitions, updates the internal shipment record, triggers downstream workflows (customer notification, billing events, returns processing), and writes a structured interaction log with carrier response payload, latency, and HTTP status for every API call.
500+ carrier interactions/day across multiple providers
The 2-week PoC.
Take your existing delay notification or exception handling workflow. Deploy it as a Hatch agent. In two weeks, it handles real exception volume with idempotent step execution, carrier rate-limit-aware scaling, and structured escalation to your ops team — with zero duplicate customer notifications under retry conditions.
Why now.
OTIF (On-Time In-Full) penalties from major retailers — Walmart, Target, Amazon — are calculated automatically from carrier scan data. A shipment exception that goes unhandled for four hours because your agent queue is backed up is a direct deduction on your next invoice. The financial exposure from exception handling latency is calculable: your current manual SLA versus an automated sub-5-minute response, multiplied by your OTIF penalty rate. That number is why your ops director is asking for a production deployment, not another demo.
Have an agent stuck in staging?
Tell us what it does and where it's stuck. We'll scope a 2-week PoC and show you what production looks like.
book a call →