· 2 min read

LangGraph in production: what the docs don't tell you

I have now deployed LangGraph in production for four clients. Here are the three things I wish I had known before the first one.

1. Checkpoint storage is not optional.

The documentation presents checkpointing as a feature for resuming interrupted runs. In production, it is the feature. Without checkpoints, a transient network error or a model timeout means losing the entire agent state. With checkpoints, you can resume from the last successful node.

Use PostgreSQL for checkpoint storage. The in-memory checkpointer is for development only. The SQLite checkpointer is fine for single-instance deployments but breaks under any concurrency. PostgreSQL handles concurrent agents cleanly and gives you a queryable audit log of agent state transitions for free.

2. Tool call errors need explicit handling in the graph.

LangGraph’s default error handling for tool calls is to propagate the exception up the call stack. In a long-running agent, this means a single failed tool call can abort a run that has been executing for minutes.

The pattern I use: wrap every tool call in a node that catches exceptions and returns a structured error result. The agent then has the option to retry, use a fallback, or escalate to the user. This is more code, but it is the difference between an agent that fails gracefully and one that fails loudly at 3am.

3. Token budget management is a first-class concern.

Agentic systems accumulate context. Without explicit token budget management, a long-running agent will eventually hit the model’s context window limit and fail. The failure mode is not graceful.

Build token counting into your graph. Before each LLM call, count the tokens in the current context. If you are approaching the limit, summarise or truncate earlier context. This is not a nice-to-have. It is a requirement for any agent that runs for more than a few turns.

The LangGraph documentation covers the happy path well. These three things are what happen when you leave the happy path, which in production is most of the time.