Built a production incident response agent with LangGraph the interrupt() checkpoint pattern was the key
I want to share a pattern we used in production that I hadn't seen well-documented: fully durable human-in-the-loop approval using LangGraph's interrupt() + AsyncPostgresSaver.
The problem: We built IRAS, an autonomous incident response agent. One of the nodes generates a remediation plan and needs a human to approve it before anything touches production. The naive approach is polling keep checking a database flag until the human clicks approve. But polling breaks if the server restarts mid-incident. You lose state, lose context, and the on-call engineer is staring at a dead Slack message.
What interrupt() actually does: When the approval node calls interrupt(), LangGraph doesn't just pause execution — it serializes the entire graph state to the checkpointer (in our case, AsyncPostgresSaver writing to PostgreSQL) and suspends the coroutine. The process can die. The server can redeploy. The incident state is safe in Postgres.
When the engineer hits POST /incidents/{id}/approve, the API reconstructs the graph from the checkpoint using the same thread_id, injects a Command(resume={"approved": True}), and the graph picks up exactly where it left off same state, same node, no re-running prior stages.
python
# In the approval node
human_decision = interrupt({"message": "Approve remediation plan?", "plan": state["plan"]})
# Execution suspends here until Command(resume=...) is sent
if human_decision["approved"]:
return {"next": "apply_remediation"}
else:
return {"next": "escalation"}
python
# In the FastAPI route
async def approve_incident(incident_id: str):
await graph.ainvoke(
Command(resume={"approved": True}),
config={"configurable": {"thread_id": incident_id}}
)
Why this matters for production: The graph survives restarts, deployments, and crashes. Approval SLA timeouts (we do 15min for P0, 2hr for P1–P3) are handled by a background monitor that queries PostgreSQL for interrupted threads past their deadline no in-memory state required.
We also use a confidence-gated RCA retry loop if Claude Sonnet's confidence is below 0.7, the graph loops back to context-gathering with a broader evidence window before retrying RCA. Up to 3 attempts before auto-escalating to PagerDuty.
Full repo if you want to see the implementation: https://github.com/krishnashakula/IRAS
Happy to go deeper on the checkpointer setup, the thread_id / incident_id design, or the timeout monitor pattern.
Lead with the durable execution problem, explain how interrupt() + AsyncPostgresSaver solves it, link repo at the end.