MCP as a Backend for Frontend

When MCP landed in November 2024 I read the announcement a few times, blinked, and filed it under this month’s AI flavour.

Scepticism about new AI tooling was a reasonable default by then. The previous few years had produced a graveyard of frameworks and abstractions that promised to make building AI systems easier and mostly made them harder to understand. We had been building agents ourselves and had made a conscious decision to move away from the heavyweight frameworks — LangChain, LlamaIndex — in favour of writing closer to the primitives. Yet another protocol? No thanks.

My specific objection to MCP in particular was this. We had been adopting FastAPI across our services. We came for the framework — a well-designed ASGI framework with Pydantic validation at the interface and a clean developer experience — but stayed for the side effects. FastAPI produced rich API documentation almost automatically: interactive, explorable, and available as a machine-readable OpenAPI specification. Our frontend developers loved it. Onboarding onto a new service went from a back-and-forth about what parameters existed to just reading the docs.

So when MCP arrived and described itself as a standard way for agents to discover and use tools, my first thought was: we already have that. It’s called an OpenAPI spec. The examples in the announcement felt abstract. The use cases felt contrived. And anyway, if a model is capable enough to do useful work, shouldn’t it be able to discover the API endpoints itself, read the documentation, and figure out how to call it? The intelligence is in the model. The interface is already there. What exactly is the problem being solved here?

That position held for a while, and it wasn’t entirely wrong. But now in May 2026 I’m beginning to see what all the fuss was about.

There’s a pattern in web architecture called the Backend for Frontend, or BFF, that’s worth understanding here.

The problem it solves is straightforward. You have an API. It’s well-designed, stable, and serves your data faithfully. Then you get a mobile client. The mobile client doesn’t want the same payload as the desktop — it has less screen space, tighter bandwidth constraints, different interaction patterns. You could add query parameters to let clients specify what they want. You could version the API. Or you could introduce a thin layer that sits between the client and the underlying services, shaped specifically for that consumer’s needs. That’s the BFF. It doesn’t replace the API. It translates it.

The pattern exists because different consumers have genuinely different needs, and a good API shouldn’t have to care about all of them simultaneously. The BFF takes on that responsibility instead — shaping responses, aggregating calls, stripping fields the client will never use. The underlying services stay clean. The consumer gets exactly what it needs.

I’ll be honest: we never had much appetite for it. Our frontend developers were enthusiastic — a BFF would have given them exactly the interface shape they wanted. The problem was that building and maintaining it would have fallen on the backend team, which at various points also covered infrastructure, AI pipelines, and whatever else needed doing. Good APIs with good documentation got us most of the way there. The BFF always felt like it would double the maintenance surface for the people least asking for it.

What I hadn’t accounted for was that agents would constitute an entirely new class of consumer — one the BFF pattern describes almost exactly.

The thing that eventually shifted the frame wasn’t an argument or a conference talk. I was looking at a search endpoint.

We had a low-level internal endpoint that accepted a query language directly — expressive, powerful, and deliberately not something you’d expose publicly. It’s not as catastrophic as exposing raw SQL, you’re not going to drop a database, but it’s the same category of mistake: your internal implementation becoming someone else’s interface problem. The query language was complex enough that even frontier models with access to the full index specification would occasionally get it wrong — malformed aggregations, incorrect field references, queries that ran but returned nonsense. And if a capable model struggled with it, a cheaper one had no chance. The expressiveness that made it useful internally was exactly what made it unsuitable as an agent interface.

So I built a tool over it: a keyword search tool that accepted plain parameters — phrases, keywords, date ranges, sort order — and handled the query construction internally. Then I added a report generation tool, built on the same engine behind markdown2pdf.ai, and watched an agent work through a research task and produce a PDF at the end of it, unprompted.

What struck me wasn’t the output. It was how fluently the agent had moved through the tools — no fumbled parameters, no wasted calls probing the interface to understand it. The design decisions baked into the tool were guiding the agent’s reasoning before it made a single call: which parameters to expose, which to absorb internally, how to describe the difference between a phrase and a keyword in terms the model could reason over rather than just execute against.

That’s what a BFF does. It translates between what the underlying system offers and what a specific consumer actually needs. I’d just built one for an agent without meaning to.

There’s a temptation, once you’ve accepted that agents need their own interface layer, to treat it as a purely mechanical problem. Define your inputs, document your outputs, make sure the types are right. That’s how you’d approach an API for a human developer and it mostly works.

It doesn’t quite work for agents, because the interface isn’t just a contract. It’s a prompt.

When an agent encounters a tool, it reads the description and uses it to reason about whether and how to use that tool. The description shapes what queries it constructs, which tools it reaches for first, how it interprets the results. A poorly written description doesn’t just cause confusion — it causes confident wrongness. The agent doesn’t know it’s misreading the interface.

We found this out the hard way with something as simple as the word “articles.” Our corpus contains institutional publications, policy briefings, think tank research, academic outputs — Bank of England working papers, Nature, that sort of thing. We’d been describing it as articles because that’s roughly what they are. The agent accepted this, formed a mental model around news media, and would occasionally betray a kind of surprise in its chain of thought when it surfaced something that didn’t fit. It wasn’t broken, but it was operating with a miscalibrated expectation we had handed it. The fix was a few sentences of honest description of what the corpus actually contained. The agent’s behaviour changed noticeably.

The same dynamic runs through individual parameters. Our keyword search tool has a slop parameter that controls how strictly phrases are matched — whether “critical minerals” would also catch “critical earth minerals” where a word has crept in between. Describing this in technical terms is accurate and useless. Describing it as “0 = exact phrase order required, 2 = up to 2 intervening words allowed” gives the agent something it can reason over at the point of use.

The same principle applies throughout. The distinction between must_keywords and any_keywords is described in terms of AND and OR logic. The publishers parameter includes an explicit instruction never to put publisher names in the keyword fields — because without that instruction, an agent will, and the results will be quietly wrong in a way that’s hard to diagnose.

None of this is documentation in the traditional sense. It’s interface design for a reader with one shot to understand it and no way to ask for clarification.

There’s a cost to getting this wrong that doesn’t exist in traditional API design. When a frontend application calls an endpoint and gets back more data than it needs, it renders what’s relevant and discards the rest. The overhead is negligible. When an agent does the same thing, every token in that response is sitting in the context window, and the context window is a shared, finite, and increasingly expensive resource.

We have a concrete version of this problem. Our “article” endpoints return full text — title, publisher, publication date, body, everything. For an API client that’s fine. It’s great even. You fetch the payload, take what you need, move on. For an agent working through a research task, fetching full articles repeatedly is quietly ruinous. Each response bloats the context. Later steps cost more because the model is now reasoning over a larger window. A seventeen step research run we instrumented came out at over a million input tokens and around two dollars. A meaningful fraction of that was the agent carrying article text it had already processed and no longer needed.

The usual response to context bloat is to push work into a subagent — spin off a cheaper, smaller model to handle retrieval and summarisation, preserve the main context for reasoning. This genuinely helps: a subagent that reads a ten-thousand token article to extract a single link absorbs that cost within its own short context rather than pushing the deadweight through the main agent loop. But it shifts complexity onto the consumer. Implementing sub-agent workflows to compensate for poorly shaped tool responses is a reasonable expectation for a sophisticated internal team. It’s a much stronger assumption to make about any agent that might connect to your MCP server. And subagents typically run on smaller models — Haiku rather than Sonnet — which means your tool interface needs to be legible not just to your most capable model but to the cheapest one in your stack. Designing for the smaller model turns out to be a useful forcing function. If your tool descriptions are clear enough for a lightweight model to use correctly, they’re probably well designed.

Inadvertently exhausting an agent’s token budget through generous API responses probably deserves a place on someone’s LLM security list. Unbounded consumption via poorly designed tooling.

The fix we’re working toward borrows an idea from GraphQL. If an agent only needs the title and a link from an “article”, the tool should be able to return just that. The underlying API doesn’t need to change. The tool layer absorbs the responsibility, the agent gets a leaner response, and the context window stays manageable.

This is the BFF pattern doing its most concrete work.

One of the more useful things we built during this process was an introspection tool. It connects to the MCP server, reads the tool schemas cold, and asks a model to describe what each tool is for, when to use it, and what its pitfalls are. No additional context. Just the descriptions we’d written.

The output is a legibility test. If the model can correctly articulate the routing logic — use semantic search for broad conceptual queries, keyword search when you have specific entities, discovery before retrieval — then the descriptions are doing their job. If it can’t, or gets something subtly wrong, you have a design problem that will show up later in production in ways that are much harder to diagnose.

It doesn’t catch everything. A well-designed interface can still be let down by a poorly designed one sitting adjacent to it. We use the Slack MCP for some internal tooling, and at one point asked it to summarise messages from the last twenty-five days. It came back with a confident, well-formatted summary. Wrong year. The interesting question is where the failure actually lived. If Slack’s MCP navigates by timestamp rather than something more natural like a days_back parameter, then the interface itself is part of the problem — the agent has to reason about timestamps correctly before it can even form a valid query. Which is exactly the kind of complexity a well-designed tool should absorb internally rather than expose upward.

This connects to something worth naming directly. A lot of teams shipping MCP servers are, understandably, generating them automatically from their OpenAPI specs. Export the swagger JSON, wrap it in a server, done. It’s also, perhaps, motivated by something other than a considered view on agent experience — shipping MCP has become a signal in its own right, and auto-generating from swagger is the path of least resistance to being able to say you’ve done it. The quality of the experience that produces is largely the consumer’s problem. The problem is that an OpenAPI spec is designed for a human developer who can try things, ask questions, have their AI assistant write the unit tests, and iterate their way to a working integration — even against a poorly documented API, there’s a feedback loop. And that’s before you get to ontological fuzziness, where the schema is accurate but the underlying model clashes with what the interface implies. Take AWS S3 as an example. It’s a flat key-value store with no native concept of directories — just buckets and keys. The console presents the illusion of a folder hierarchy. The API quietly doesn’t have one. A developer encounters this mismatch, works through it, tries a few things, and gets there eventually through trial, error, and shared confusion. An agent encountering this for the first time — which is every session, for every new interface — reads the schema, forms a model of reality from it, and proceeds on wrong assumptions with no mechanism to correct them. Feeding a spec to an agent gives it a second-class experience: complex interfaces, no guidance on sequencing, no absorption of internal complexity. The BFF exists precisely because “just use the underlying API” was never the right answer for demanding consumers. Auto-generated MCP is the same mistake, one layer up.

These failures point at something that doesn’t have a clean name yet in how we talk about MCP design. It’s not quite usability, which implies a human user. It’s not just prompt engineering, which implies a one-off interaction. It’s something more like the accumulated quality of the experience an agent has when it encounters your tools — the friction or fluency at the point of reasoning, not just at the point of execution. We’ve started calling it Agent Experience, for want of a better term.

Tools are the interface. Agent Experience is the measure of how well an agent can actually use them.

There’s a broader shift happening alongside all of this that’s worth naming.

For most of the past two years, building a useful agent meant a non-trivial engineering project: choose a framework, design the tool interfaces, wire up the orchestration, handle retries and context management, test the whole thing end to end. The knowledge of how to do that well was locked up in the implementation. MCP starts to change this by moving tool maintenance to the server layer — the tools live on the server, any capable agent can discover and use them. What’s emerging alongside it is a class of agent harnesses that handle the orchestration side: give them a prompt and a set of MCP servers and they’ll run a multi-step research task, reflect on their own tool usage, and produce a structured output.

The practical consequence is that the interface contract is now with the agent class, not with a specific client. We built our MCP server for internal use — our own agents, our own harnesses, engineers on the team experimenting with research workflows. But there’s very little separating that from a public-facing server. The same tool descriptions that guide our internal agents will guide any agent pointed at the server. The same design decisions that make the tools legible internally make them legible externally.

This is something we didn’t get for free with our APIs, however good the documentation was. An OpenAPI spec is written for a developer who will read it once, build something durable, and move on. An agent reads the interface fresh every session, reasons from it immediately, and carries no memory of previous mistakes. The interface has to work first time, every time. That constraint turns out to be a useful forcing function for writing better descriptions than you’d otherwise bother with.

MCP will evolve, or be succeeded by something solving the same problem. The protocol is young, the tooling is still maturing, and the current landscape of clients and servers will look different in a year. That’s fine. Protocols come and go.

The pattern underneath it is more durable. Every time you introduce a new class of consumer — mobile clients, third-party integrators, internal services with different latency requirements — you eventually discover that your existing interfaces weren’t designed for them. The response shapes are wrong. The abstractions leak. The documentation that worked for one audience confuses another. The BFF pattern exists because this is a recurring problem with a recurring solution: a layer that translates between what your system offers and what a specific consumer needs.

Agents are a new class of consumer. They read interfaces literally, reason from descriptions immediately, carry no session memory, and operate under hard token constraints that make response bloat genuinely expensive. They need interfaces designed for them, not adapted from interfaces designed for humans. MCP is currently the most coherent attempt to standardise what that looks like.

What we don’t yet have is a rigorous way to measure how well we’re doing it. We can run an introspection agent and check whether the descriptions are legible. We can instrument a research run and watch where the token cost goes. We can ask an agent to reflect on its own tool usage and surface the gaps. We can write evals that test whether a tool produces the right output given a representative set of inputs — and that class of solution is maturing fast. But evals measure output quality after the fact. Agent Experience is a design-time concern: the quality of the interface before the agent ever runs. These are related problems with different solutions, and the second one is less well understood.

We’ve started calling it Agent Experience. It sits alongside Developer Experience as a design concern in its own right: not the schema, not the response format, but the cognitive fluency the interface produces at the point of reasoning. Getting it right is, at the moment, more craft than science. That will probably change.

In the meantime, the most useful reframe we found was the simplest one. Stop thinking about MCP as an API replacement. Start thinking about it as a Backend for Frontend where the client is an agent. The rest follows from there.