Building an AI Feature Gateway: Centralising Prompt Management at Scale

2025-10-14

Building an AI Feature Gateway: Centralising Prompt Management at Scale

If you've ever tried to find out which version of a prompt is running in production, you know the pain I'm about to describe. Early in our AI journey at PageUp, prompts lived where developers put them—scattered across repositories, embedded in application code, hardcoded in configuration files, and sometimes existing only in someone's head. Changing a prompt meant a code deployment. Testing a prompt variation meant a feature branch. Rolling back a bad prompt meant an emergency release.

This was unsustainable. As we scaled from one AI feature to multiple products—our recruiter co-pilot, skill matching, resume intelligence, and interview guide generation—the prompt management problem became an engineering bottleneck. We needed a centralised platform that could manage prompts as first-class artifacts, enable non-technical stakeholders to make changes safely, and deploy across multiple regions with confidence.

That's how the AI Feature Gateway was born. It's now the backbone of our AI infrastructure at PageUp, and building it was one of the most impactful architectural decisions we've made. Here's what we learned.

The Scattered Prompt Problem

Before the gateway, our prompt management looked like most organisations' prompt management: chaotic. Every AI feature had its own approach. Some prompts were in code. Some were in environment variables. Some were in database tables. There was no version history, no rollback capability, and no way to test a change without deploying code.

This creates four critical failure modes that most teams don't recognise until they're in production at scale:

1. Untracked changes with no audit trail. When a prompt changes behaviour in production and no one knows what changed or when, debugging becomes archaeology. In a regulated industry like HR technology, the inability to explain what your AI was doing at a specific point in time is a compliance risk.

2. No rollback path for bad changes. A developer tweaks a prompt to improve one use case and inadvertently degrades another. Without versioning, there's no easy way to revert to the previous version while investigating the issue.

3. No experimentation framework. You can't A/B test prompt variations, gradually roll out improvements, or compare performance across versions without infrastructure specifically designed for it.

4. Team coordination failures. When multiple teams are modifying prompts that feed into shared AI features, conflicts and regressions are inevitable without centralised management.

Industry experience suggests that teams typically hit these pain points when they reach 10,000+ queries per day, have two or more engineers working on AI features, or manage five or more distinct prompts. We hit all three thresholds simultaneously.

What Is an AI Feature Gateway

An AI Feature Gateway is a middleware layer that sits between your application code and your AI model providers. It's similar in concept to an API gateway but purpose-built for AI workloads. Instead of your applications calling AI models directly with hardcoded prompts, they call the gateway with a feature identifier and context, and the gateway handles prompt assembly, model routing, and response processing.

Think of it as the control plane for your AI features. Applications don't need to know which model they're talking to, which version of the prompt is active, or how the request is being routed. They just say "I need a resume summary for this candidate" and the gateway handles everything else.

Core responsibilities of our gateway:

Prompt management: Store, version, and serve prompts as managed artifacts
Model routing: Direct requests to the appropriate model based on feature, region, cost, and availability
Request enrichment: Assemble complete prompts from templates, context, and parameters
Response processing: Validate, transform, and cache AI responses before returning them to applications
Observability: Track every request with cost, latency, quality metrics, and full audit trails
Configuration: Enable self-service management of AI features without code deployments

Architecture Decisions

Building the gateway required several architectural decisions that shaped the final system. The most consequential was choosing between an SDK-based approach (embedding the gateway logic in a client library) and an API-based approach (running the gateway as a standalone service).

We chose the API-based approach for several reasons. First, it creates a single point of control—prompt changes take effect immediately for all consumers without requiring library updates. Second, it enables centralised monitoring and cost tracking. Third, it supports non-technical users who need to manage prompts through a UI rather than code.

Key architectural patterns we adopted:

Feature-based routing: Every AI capability is registered as a "feature" with its own prompt template, model configuration, and evaluation criteria. Applications reference features by identifier, not by model or prompt details.
Environment promotion: Prompts move through development, staging, and production environments with approval gates, mirroring how we deploy application code.
Layered caching: We implemented both exact-match caching (for identical requests) and semantic caching (for meaningfully similar requests). Semantic caching alone reduced our inference costs significantly by recognising when different wordings of the same question could use a cached response.
Dynamic model routing: Requests route to models based on configurable rules—cost optimisation, latency requirements, or availability. If our primary model on AWS Bedrock is experiencing elevated latency, the gateway can automatically route to an alternative.

The performance overhead of the gateway was a concern early on. Industry benchmarks show that well-designed LLM gateways can operate with sub-15 microsecond overhead at thousands of requests per second—negligible compared to the seconds-long inference time of the models themselves. Our gateway adds minimal latency while providing enormous operational benefits.

Self-Service Configuration

One of the most transformative aspects of the gateway is enabling self-service configuration for non-engineering stakeholders. Before the gateway, any prompt change required an engineer to modify code, create a pull request, get it reviewed, and deploy it. This cycle took days at best.

With the gateway, product managers and AI specialists can modify prompts through a management interface, preview changes against sample inputs, and promote them through environments—all without writing code. This reduced our prompt change cycle from days to minutes and dramatically increased the pace of AI feature iteration.

Self-service capabilities we built:

Prompt editor with preview: Edit prompts and immediately see how they perform against a set of representative inputs
Variable management: Define and manage template variables that get filled at request time (candidate name, job title, company context)
Role-based access control: Different permissions for viewing, editing, testing, and promoting prompts to production
Change history and audit log: Every modification is tracked with who made it, when, and why

The key insight is that prompt engineering isn't purely an engineering discipline. Domain experts—in our case, recruitment specialists—often have the best intuition about how to improve AI interactions with users. Giving them safe, controlled access to prompt management unlocked improvements that our engineers wouldn't have discovered on their own.

Prompt Versioning and Evaluation

We treat prompts as versioned, tested artifacts deployed with the same rigour as application code. Every prompt has a version history, and every version must pass through our evaluation pipeline before reaching production.

Our evaluation pipeline scores prompts on multiple dimensions:

Correctness: Does the output accurately reflect the input data?
Faithfulness: Does the output stay grounded in the provided context without hallucinating?
Relevance: Is the output useful for the intended purpose?
Safety: Does the output avoid harmful, biased, or inappropriate content?
Consistency: Does the prompt produce stable outputs across repeated runs?

This is what the industry calls eval-gated CI/CD—evaluation suites that function like test suites, blocking promotion to production when quality scores drop below thresholds. It's fundamentally different from traditional testing because you can't assert exact outputs. Instead, you score on dimensions and set acceptable thresholds.

Our versioning workflow:

Author creates or modifies a prompt in the development environment
Automated evaluation runs against a curated dataset of representative inputs
Results are compared against the current production version
If quality scores meet or exceed thresholds, the prompt is eligible for promotion
A/B testing can optionally split traffic between the new and current version in staging
After validation, the prompt promotes to production with automatic rollback if live metrics degrade

Multi-Region Deployment

PageUp serves customers across multiple regions, and our AI features need to respect data residency requirements. This adds significant complexity to the gateway architecture. A customer's data in one region can't be sent to AI models in another region without careful consideration of privacy regulations and contractual obligations.

The regulatory landscape is increasingly complex. GDPR imposes strict requirements on where EU data can be processed. The EU AI Act introduces additional data governance requirements for AI systems. Australia has its own privacy framework. And these requirements continue to evolve—the sovereign cloud market is projected to grow from $154 billion in 2025 to $823 billion by 2032, reflecting how seriously organisations are taking data residency.

How our gateway handles multi-region:

Region-aware routing: The gateway automatically routes requests to models deployed in the appropriate region based on the customer's data residency configuration
Prompt synchronisation: Prompt versions are synchronised across regions with eventual consistency, ensuring all regions serve the same version
Regional fallback: If a model is unavailable in the preferred region, the gateway can fall back to another region only if the customer's data residency policy permits it
Latency optimisation: AWS cross-region inference adds only single-digit milliseconds of latency—a small fraction of the total inference time—so region-appropriate routing doesn't meaningfully impact user experience

Observability and Cost Tracking

The gateway's position as a central chokepoint for all AI traffic makes it the natural place for observability and cost management. Every request that flows through the gateway is instrumented with metrics that feed into our monitoring and alerting systems.

What we track per request:

Token usage: Input and output tokens, mapped to cost by model and region
Latency breakdown: Network time, queue time, inference time, and post-processing time
Quality signals: Automated quality scores for a sample of responses, user feedback signals where available
Feature attribution: Which product feature generated the request, enabling per-feature cost allocation and usage analytics

What we track in aggregate:

Cost trends by feature, team, customer tier, and region
Quality score distributions with drift detection
Cache hit rates and the cost savings they represent
Model availability and error rates by provider and region

This observability has paid for itself many times over. We've caught cost anomalies within hours, identified quality regressions before they impacted customers, and used usage data to inform product decisions about which AI features to invest in.

Lessons Learned

Building the AI Feature Gateway taught us several lessons that I'd share with any engineering team considering a similar approach:

Start earlier than you think you need to. We built the gateway after hitting pain points in production. In hindsight, we should have started when we had two AI features, not five. The cost of scattered prompt management compounds quickly as you scale.

Design for non-technical users from day one. The self-service capability was initially an afterthought—we built it for engineers first and added the management UI later. The management UI ended up being the most valuable feature of the entire platform.

Invest in evaluation infrastructure. The evaluation pipeline was the hardest part to build and the most valuable part to have. Without it, every prompt change is a leap of faith. With it, prompt iteration becomes a data-driven process.

Don't over-engineer routing early. We built sophisticated dynamic routing capabilities before we needed them. For the first year, simple region-based routing with manual failover would have been sufficient. Build what you need now and architect for what you'll need later.

Centralisation has trade-offs. The gateway is a single point of failure and a single point of latency. We invested heavily in redundancy, caching, and performance optimisation to mitigate these risks. Any team building a centralised gateway needs to treat it as critical infrastructure from the start.

Conclusion: Prompts Are the New Code

The most important mindset shift in building the gateway was recognising that prompts are production artifacts that deserve the same engineering discipline as application code. They need version control, testing, staged rollouts, monitoring, and rollback capabilities.

If your AI prompts still live in code repositories and require deployments to change, you're creating unnecessary friction and risk. A centralised gateway approach—whether you build it yourself or adopt an existing platform—transforms prompt management from a bottleneck into a competitive advantage.

The AI Feature Gateway has become one of the most strategically important pieces of infrastructure at PageUp. It's what allows us to iterate quickly on AI features, maintain quality at scale, and operate confidently across multiple regions. And it all started because someone asked, "Which version of this prompt is running in production?" and nobody could answer.