u/LoquatAccording5061

I want to share a pattern we used in production that I hadn't seen well-documented: fully durable human-in-the-loop approval using LangGraph's interrupt() + AsyncPostgresSaver.

The problem: We built IRAS, an autonomous incident response agent. One of the nodes generates a remediation plan and needs a human to approve it before anything touches production. The naive approach is polling keep checking a database flag until the human clicks approve. But polling breaks if the server restarts mid-incident. You lose state, lose context, and the on-call engineer is staring at a dead Slack message.

What interrupt() actually does: When the approval node calls interrupt(), LangGraph doesn't just pause execution — it serializes the entire graph state to the checkpointer (in our case, AsyncPostgresSaver writing to PostgreSQL) and suspends the coroutine. The process can die. The server can redeploy. The incident state is safe in Postgres.

When the engineer hits POST /incidents/{id}/approve, the API reconstructs the graph from the checkpoint using the same thread_id, injects a Command(resume={"approved": True}), and the graph picks up exactly where it left off same state, same node, no re-running prior stages.

python

# In the approval node
human_decision = interrupt({"message": "Approve remediation plan?", "plan": state["plan"]})

# Execution suspends here until Command(resume=...) is sent
if human_decision["approved"]:
    return {"next": "apply_remediation"}
else:
    return {"next": "escalation"}

python

# In the FastAPI route
async def approve_incident(incident_id: str):
    await graph.ainvoke(
        Command(resume={"approved": True}),
        config={"configurable": {"thread_id": incident_id}}
    )

Why this matters for production: The graph survives restarts, deployments, and crashes. Approval SLA timeouts (we do 15min for P0, 2hr for P1–P3) are handled by a background monitor that queries PostgreSQL for interrupted threads past their deadline no in-memory state required.

We also use a confidence-gated RCA retry loop if Claude Sonnet's confidence is below 0.7, the graph loops back to context-gathering with a broader evidence window before retrying RCA. Up to 3 attempts before auto-escalating to PagerDuty.

Full repo if you want to see the implementation: https://github.com/krishnashakula/IRAS

Happy to go deeper on the checkpointer setup, the thread_id / incident_id design, or the timeout monitor pattern.

Lead with the durable execution problem, explain how interrupt() + AsyncPostgresSaver solves it, link repo at the end.

Tired of being the human glue between a firing alert and a Slack thread, we built IRAS an autonomous incident response agent that ingests alerts, gathers logs/metrics/deployments, runs root-cause analysis, generates a remediation plan with rollback commands, and pauses for a human to approve before touching anything.

The most interesting engineering problem: how do you pause an AI agent mid-execution, survive a server restart, and resume exactly where you left off? We used LangGraph's interrupt() primitive with PostgreSQL checkpointing. State is fully durable crash the server, redeploy, the incident is still there waiting for approval.

We also don't trust the model. Safety invariants are enforced in code: if the model returns a high-risk step or a missing rollback command, human approval is forced regardless of what the model said. 292 tests including adversarial scenarios where the model deliberately lies about risk levels.

Repo: https://github.com/krishnashakula/IRAS

Happy to answer questions about the architecture, the LangGraph interrupt pattern, or the Pydantic AI typed output approach.

Built a production incident response agent with LangGraph the interrupt() checkpoint pattern was the key