Managing AI Infrastructure Costs: Balancing Self-Hosted vs API-Based Models

2026-02-03

Managing AI Infrastructure Costs: Balancing Self-Hosted vs API-Based Models

Nobody warns you about the AI cost curve. When you're building your first AI feature, the costs seem negligible—a few dollars a day for API calls during development, maybe a few hundred during testing. Then you ship to production, users love it, adoption climbs, and suddenly you're looking at an invoice that makes your CFO ask very pointed questions.

I've been on the receiving end of those questions. Managing AI infrastructure spend across AWS Bedrock deployments at PageUp has been one of the most educational experiences of my career—not because the technology is complex (it is), but because the economics are counterintuitive. The model that gives you the best results isn't always the most expensive. The cheapest option isn't always the most cost-effective. And the costs you plan for are rarely the costs you actually pay.

The industry data is sobering: 85% of organisations miss their AI cost projections by more than 10%, and nearly 25% miss by 50% or more. After navigating this landscape across multiple AI products, here's what I've learned about managing AI infrastructure costs without sacrificing quality.

The AI Cost Surprise

Most engineering leaders budget for AI costs based on model inference pricing—the per-token cost of calling an AI model. This is like budgeting for a car based on the sticker price without considering fuel, insurance, maintenance, and parking. The inference cost is real, but it's often not the dominant cost once you're running at scale.

When we first projected costs for our AI features at PageUp, we built a straightforward model: estimated requests per day multiplied by average tokens per request multiplied by the per-token price. The projection looked reasonable. The reality was significantly higher, because we hadn't accounted for the full cost stack.

Where the money actually goes:

Model inference: The per-token cost of running prompts through AI models. This is what most people budget for.
Supporting infrastructure: Knowledge bases, vector databases, orchestration services, caching layers, and monitoring systems. On AWS Bedrock, a Knowledge Base backed by OpenSearch can add $350+ per month before you process a single query.
Agent overhead: If you're using agent architectures where models call tools and chain multiple steps, a single user request can trigger 5-10x the tokens you'd expect from the user-facing prompt alone.
Logging and observability: Comprehensive logging of AI interactions for debugging, compliance, and evaluation generates significant storage and compute costs.
Engineering time: The ongoing cost of maintaining, evaluating, and improving AI systems. This is the cost most organisations forget entirely.

Understanding Token Economics

Token pricing is the foundation of AI cost management, but it's more nuanced than it appears. The critical insight is that output tokens cost 3-5x more than input tokens across most providers. On AWS Bedrock, Claude Sonnet charges $3.00 per million input tokens but $15.00 per million output tokens. This asymmetry has significant implications for how you design your prompts and system architecture.

Practical implications of token economics:

Prompt design matters enormously. A well-structured prompt that guides the model to produce concise, focused outputs can cost a fraction of a verbose prompt that produces long responses. We've seen 40-60% cost reductions just from prompt optimisation without any quality degradation.
System prompts are expensive at scale. If you're including a long system prompt with every request, those input tokens add up quickly. At 10,000 requests per day with a 2,000-token system prompt, you're consuming 20 million input tokens daily on system prompts alone.
Caching changes the equation. Caching identical or semantically similar requests eliminates re-processing costs. The savings compound rapidly at scale.
Model selection by task complexity. Not every request needs your most powerful (and expensive) model. Routing simple tasks to smaller, cheaper models while reserving premium models for complex tasks can dramatically reduce costs.

The Hidden Cost Traps

Beyond token economics, several hidden cost traps catch engineering teams by surprise.

The agent token multiplier. Agent architectures where the model plans and executes multi-step workflows can consume 5-10x more tokens than a simple prompt-response interaction. Each planning step, tool call, and result interpretation consumes tokens. A user request that looks like a single interaction from the outside might involve multiple model calls internally.

The knowledge base tax. If you're using retrieval-augmented generation with vector databases, you're paying for both the retrieval infrastructure and the additional input tokens from retrieved context. The retrieval infrastructure runs continuously whether or not it's processing queries, creating a fixed cost floor.

The evaluation overhead. If you've built proper evaluation pipelines (and you should), those pipelines consume tokens too. Running evaluation suites against every prompt change or model update adds a cost that scales with the frequency of your iteration cycle.

The logging cost. Compliance and debugging require logging full prompts and responses. For high-volume AI features, this logging can generate terabytes of data monthly, with associated storage and analysis costs.

The Self-Hosting Calculus

The question of self-hosting versus API-based models has become increasingly relevant as open-source models have reached quality parity with commercial APIs. Models like Llama 4 Maverick and DeepSeek V3 achieve 90-95% of the performance of GPT-4 and Claude on standard benchmarks, at a fraction of the cost.

When API access wins:

Below 1 billion tokens per month, API access is almost always more cost-effective
When you need access to the latest frontier models immediately
When you don't have dedicated ML infrastructure engineering capacity
When your usage is bursty or unpredictable

When self-hosting wins:

At 1-2 billion+ tokens per month consistently, the economics shift in favour of self-hosting
When data residency requirements prevent sending data to external APIs
When you need complete control over model behaviour and availability
When latency requirements demand dedicated infrastructure

The real cost of self-hosting:

Self-hosting isn't just GPU costs. It requires $300,000-$600,000 per year in engineering overhead for infrastructure management, model serving, monitoring, and maintenance. GPU costs on top of that start at roughly $2,200 per month for an H100 instance, though prices have dropped 40-60% since 2024. AWS cut H100 pricing by 44% in June 2025, and cloud GPU providers now offer H100 access at competitive rates.

The break-even point for self-hosting premium models sits around 5-10 million tokens per month. At 100 million+ tokens monthly, self-hosting can save $5-50 million annually depending on the models used. But most organisations aren't at those volumes, and the engineering complexity of running reliable model serving infrastructure is substantial.

AWS Bedrock Optimisation Strategies

For teams running on AWS Bedrock—as we do at PageUp—several optimisation strategies can dramatically reduce costs without sacrificing quality.

Prompt caching is the single most impactful optimisation. When your requests include repetitive content—system prompts, context documents, or standard instructions—prompt caching avoids reprocessing that content on every request. AWS reports cost reductions of up to 90% and latency improvements of up to 85% with prompt caching. In a real-world case, Care Access achieved 86% cost reduction by caching static medical record content while varying only the analysis questions across 300-500+ daily records. We've seen similar results with our recruitment context—job descriptions, company profiles, and evaluation criteria that rarely change but appear in every request.

Intelligent Prompt Routing dynamically routes requests between models within the same family based on predicted response quality and cost. Simple requests go to cheaper models; complex ones go to more capable models. AWS reports up to 30% cost reduction without compromising accuracy, with minimal routing overhead of approximately 85 milliseconds.

Model Distillation creates smaller, task-specific models trained on the outputs of larger models. The distilled models can be up to 500% faster and 75% less expensive, with less than 2% accuracy loss for focused use cases like retrieval-augmented generation. This is particularly effective when you have high-volume, well-defined tasks that don't require the full generality of a frontier model.

Batch processing offers discounted rates for workloads that don't require real-time responses. We use batch processing for bulk resume analysis, periodic skill matching refreshes, and evaluation pipeline runs—any workload where a few minutes of additional latency is acceptable.

The Open-Source Revolution

The open-source model landscape has shifted dramatically, creating genuine alternatives to commercial APIs. The cost differences are striking.

To put real numbers on it: for a typical enterprise workload processing 10 million input tokens and 1 million output tokens daily, the monthly cost varies wildly by model choice. A premium proprietary model might cost $3,900 per month. A mid-tier option might cost $1,350. And an efficient open-source model accessed via API can cost as little as $42. That's not a typo—it's a 90x cost difference.

Open-source models accessed through hosted providers like Together AI, Groq, or Fireworks offer the best value for most teams. You get near-frontier-quality outputs at a fraction of the cost, without the infrastructure burden of self-hosting. Llama 4 Maverick runs at $0.20-0.27 per million input tokens compared to $3.00 for Claude Sonnet or $1.75 for GPT-5.

The quality gap has narrowed enough that for many production use cases, the difference is negligible. The key is evaluating which tasks genuinely require frontier model capabilities and which can be served equally well by more affordable alternatives.

Building a Cost Decision Framework

Rather than making blanket decisions about models and infrastructure, we built a decision framework that evaluates each AI feature individually.

The framework considers five factors:

Volume: Expected request volume determines whether API pricing or self-hosting is more economical
Quality requirements: The minimum acceptable quality threshold determines which models are viable candidates
Latency requirements: Real-time features need low-latency serving; background processing can tolerate batch pricing
Data sensitivity: Highly sensitive data may require self-hosted models or specific regional deployment
Iteration velocity: Features under active development benefit from API flexibility; stable features benefit from optimised infrastructure

Our tiered approach:

Tier 1 (simple tasks): Route to the most cost-effective model that meets quality thresholds—typically a smaller open-source model via API
Tier 2 (standard tasks): Use mid-tier models with prompt caching and intelligent routing to optimise cost
Tier 3 (complex tasks): Use frontier models for tasks that genuinely require them, with aggressive caching and response validation

This hybrid multi-model strategy can reduce total costs by 60-75% compared to routing all requests through a single frontier model. The key is that most requests don't need the most powerful model—they need a model that's good enough for the specific task.

Monitoring and Governance

Cost management isn't a one-time exercise—it requires continuous monitoring and governance.

What to monitor:

Per-feature cost attribution: Know exactly which AI features cost what, so you can make informed investment decisions
Cost per interaction: Track the average cost of serving each user request, broken down by model, caching, and infrastructure components
Cost efficiency trends: Monitor whether costs are growing linearly with usage (expected) or super-linearly (a problem)
Cache hit rates: Low cache hit rates indicate optimisation opportunities; declining rates may indicate changing usage patterns
Model quality vs cost: Track the relationship between model quality scores and costs to ensure you're not over-spending for marginal quality gains

Governance practices:

Cost budgets per feature: Set monthly cost budgets for each AI feature with alerting when approaching thresholds
Model selection reviews: Regularly review whether each feature is using the most cost-effective model for its quality requirements
Quarterly cost optimisation sprints: Dedicate engineering time to reviewing and optimising AI costs, just as you would for any other infrastructure spend
Stakeholder reporting: Provide leadership with clear, regular reports on AI costs, value delivered, and optimisation opportunities

Conclusion: Cost as a Design Constraint

The most important mindset shift in managing AI infrastructure costs is treating cost as a design constraint, not an afterthought. Just as you design for performance, security, and reliability, you should design for cost efficiency from the beginning.

This means choosing the right model for each task, not defaulting to the most powerful option. It means investing in caching, prompt optimisation, and intelligent routing as core infrastructure. It means building monitoring and governance processes that keep costs visible and accountable.

The organisations that succeed with AI at scale aren't necessarily the ones spending the most—they're the ones spending most intelligently. In a landscape where the same workload can cost anywhere from $42 to $3,900 per month depending on your choices, the engineering decisions around cost optimisation have enormous business impact.

AI infrastructure costs will continue to evolve as models become more efficient, competition drives prices down, and new optimisation techniques emerge. The teams that build strong cost management foundations now—with monitoring, governance, and flexible multi-model architectures—will be best positioned to capture those improvements as they arrive.