A standard runtime for every internal agent your engineering team ships.

Your engineering team has built six internal agents. Support triage. Code review. Incident response. Each one was built by a different team, deployed differently, and breaks in its own unique way. Hatch gives them all a standard runtime.

The problem.

The incident responder your SRE team built runs as a cron job that polls PagerDuty every 60 seconds, calls the Slack API to create a channel, and posts a message with runbook links. When PagerDuty's API is slow and the cron fires twice before the first run finishes, two incident channels are created for the same alert. The SRE team knows about this; they added a distributed lock in Redis to prevent it. The support triage agent your customer success team built has the same race condition — they don't know about it yet because ticket volume hasn't been high enough to trigger it in production.

Observability across internal agents is either nonexistent or siloed. The incident responder emits metrics to a team-specific Datadog dashboard. The support triage agent logs to stdout captured by Fluentd into a log bucket that nobody monitors. The code review agent fails silently when the GitHub API rate-limits it — pull requests just don't get reviewed, and no alert fires. Your platform team has no single view of which agents are running, which are failing, and what the error rates are. Every incident is diagnosed by the team that built the specific agent.

When an agent breaks and the team that built it has moved on, the operational cost surfaces suddenly. The support triage agent was built six months ago by two engineers who are now on a different product. It's been running on a Kubernetes Deployment with no health checks, no circuit breakers, and a hard-coded Slack webhook token that's about to expire. When it breaks — and it will, because the Linear API changed its pagination format in a recent update — the ops team is reverse-engineering undocumented application code under production pressure.

What Hatch handles.

Hatch enforces exactly-once execution per workflow invocation using a WAL-backed idempotency layer. PagerDuty alert IDs are used as workflow keys. If the cron fires twice for the same alert before the first run completes, the second invocation finds an in-progress record for that alert ID and exits without creating a duplicate channel. The distributed lock is built into the runtime, not re-implemented per agent.

Every agent running on Hatch emits a standard set of metrics to your existing Prometheus endpoint: workflow start rate, completion rate, error rate by step, p50/p95/p99 step latency, and queue depth. These are the same metric names across every agent, so your existing Grafana dashboards work without per-agent configuration. One dashboard covers all six internal agents. Alerting rules are defined once at the platform level.

Teams deploy agents by writing a hatch.yaml that declares steps, retry policy, approval gates, and autoscaling config. The hatch.yaml is reviewed by your DevOps team like any other infrastructure change. When the GitHub API changes its pagination format, the fix is a one-line change to the code and a standard deployment — not a debugging session into a bespoke deployment that only the original author understands. Your DevOps team can operate any agent on the platform without tribal knowledge.

Human escalation is a first-class primitive available to every agent on the platform. The code review agent can pause before posting a critical security finding and wait for a senior engineer's approval via a Slack message with approve/reject buttons. The approval is handled by Hatch's signal API, logged with the approver's identity and timestamp, and the workflow resumes. The escalation pattern is identical across all agents — once your DevOps team understands it for one agent, they understand it for all of them.

Agents that run on Hatch.

Support triage agent

Pulls new tickets from the Linear or Zendesk API, runs a classification model to assign severity and category, routes to the correct team queue, posts an initial acknowledgment via the Slack API, and escalates to a human when the model confidence is below threshold — with the full ticket content and classification reasoning attached to the escalation task.

All incoming support tickets, continuous processing

Incident responder

Receives PagerDuty webhooks, deduplicates against in-flight workflows by alert ID, creates a Slack incident channel, posts relevant runbook links and recent deployment history, pages the on-call engineer via PagerDuty's acknowledge API, and writes an incident record to your internal ops database. Channel creation is idempotent — one channel per alert, regardless of webhook delivery count.

Real-time processing across all production alerts

Code review agent

Subscribes to GitHub PR webhooks, calls the diff API, runs security and style analysis, posts structured review comments via the GitHub Reviews API, and pauses before posting critical findings to wait for a senior engineer's approval via a Slack interactive message. The approval is logged with the reviewer's GitHub identity. GitHub rate limits are handled with per-installation credential rotation.

All pull requests across the engineering org

The 2-week PoC.

Take one internal agent — your support triage bot, your incident responder, whatever is closest to production. Deploy it on Hatch. In two weeks, it runs with idempotent execution, standard Prometheus metrics in your existing dashboards, and a deployment your DevOps team can operate without contacting the team that built it.

Agent running in production under DevOps ownership — deployed via hatch.yaml, operated with kubectl, no tribal knowledge required

Standard Prometheus metrics emitting to existing Grafana dashboards — workflow rate, error rate, p95 step latency

Idempotency verified under duplicate-trigger test conditions — zero duplicate Slack channels, tickets, or PR comments

Human escalation path tested end-to-end — approval logged with identity and timestamp, workflow resumed from correct step

Why now.

Every month, another internal agent gets built with another bespoke retry mechanism and another team-specific observability setup. The compounding cost is not the agents themselves — it's the first production incident for each one, where an engineer who didn't build it has to diagnose it under pressure. If you standardize the runtime after building six agents, you have six migration projects. If you standardize now, the next twenty agents deploy the same way the first one did.

Have an agent stuck in staging?

Tell us what it does and where it's stuck. We'll scope a 2-week PoC and show you what production looks like.

book a call →