blog
NOVEMBER 13, 2025
•7 min read
The quiet cost of AI: shadow compute budgets and the new DevOps blind spot
AI projects rarely fail because the model “isn’t smart enough.” They fail because the money meter spins where few teams are watching: GPU hours, token bills, data egress, and serving inefficiencies that quietly pile up after launch.
The real challenge for CTOs in 2025 and the forseeable future is not choosing the right model, but owning the full cost surface—from inference and networking to evaluation pipelines and architectural trade-offs that balance latency, throughput, and cost.
Inference is the real bill, not training
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum
The four hidden lines on your AI cost sheet
1) Accelerator hours (and how you buy them)
On-demand GPU pricing differs drastically across clouds and regions. For example, AWS p5/H100, Azure NC H100, and GCP A3 instances vary roughly 40–60 % depending on region and availability. Google Cloud’s TPU families (v5e, v5p, Trillium) also post transparent per-chip-hour rates, which shift economics for certain workloads.
Spot or preemptible capacity cuts costs by 60–90 % but introduces interruption risk—two-minute notice on AWS, roughly thirty seconds on GCP. For stateless or resilient jobs, that trade is worth it if your orchestration and checkpointing can recover quickly.
Action for CTOs: separate training, batch/offline inference, and interactive inference pools. Use spot where safe, and pin business-critical tiers to on-demand or reserved capacity. Most pipelines aren’t interruption-tolerant until you explicitly test them.
2) Networking and data egress
Ingress is usually free; egress almost never is. Every cloud provider bills for data leaving the region, and cross-region flows multiply the cost. Even though AWS and Google introduced limited “no-cost transfer” policies in 2024 after EU regulatory pressure, those programs tend to be narrow, migration-focused, and conditional, and they do not apply to everyday cross-region or cross-cloud traffic.
Action: co-locate model servers, vector stores, and data pipelines in the same region. Avoid RAG architectures that pull embeddings across regions on every query—it’s a silent, repeating egress charge that remains billable even under these regulatory programs.
3) Token-metered API spend
When you use hosted model APIs, your unit economics depend on input/output tokens, context length, and caching. OpenAI, Anthropic, and others publish pricing by token and model tier.
Small design choices—prompt shaping, output limits, and cache reuse—change cost curves by multiples.
Action:
4) Evaluation, guardrails, and “non-feature” workloads
Safety scanning, red-team tests, and automated evaluation loops have become continuous workloads. UK and EU institutions increasingly expect this kind of evaluation-centric assurance, meaning these tasks now run as background jobs that quietly consume compute and align with formal requirements like the EU AI Act’s post-market monitoring and logging obligations.
Action: treat evaluations as a first-class workload with a defined cadence and budget. Schedule them on cheaper, preemptible capacity rather than letting them overlap with latency-sensitive inference windows.
The serving blind spot: where DevOps playbooks fall short
Design choices that quietly burn money
Long contexts that never pay back. Large context windows sound attractive, but they increase memory pressure and slow prefill. Unless you genuinely use them, shorter contexts plus RAG summaries are cheaper and faster.
Cross-region RAG. Pulling documents or embeddings across regions on every call is a textbook egress trap. Replicate data locally instead; these flows are not covered by the 2024 “no-cost transfer” programs and remain fully billable.
Unmanaged spot usage in interactive paths. Spot is fine for batch or evaluation jobs, not for live traffic without failover logic.
Token-blind product features. Features like “show reasoning traces” inflate output tokens per user. Make token cost visible in dashboards so PMs see cost per interaction.
A cost control blueprint that actually works
1) Make cost an SLO alongside latency and quality
Publish three SLOs per route: latency, quality, and cost per successful call. If cost is invisible, it will drift.
2) Choose a serving stack that exposes the right levers
Use frameworks offering continuous batching, KV caching, quantization, and inflight batching. Benchmark before committing.
3) Separate pools by interruption tolerance
Interactive → on-demand or reserved.Batch → spot with checkpointing.Training/evaluation → preemptible with retries.
4) Keep the bytes local
Run the model, vector store, and document cache in the same region or availability zone. Measure egress precisely.
5) Introduce budget-aware decoding and routing
Implement token caps, prompt caching, and fallback models. Measure real cost per feature.
6) Adopt FinOps for AI with a real data model
Normalize billing using the FOCUS standard so finance and engineering see the same metrics. Review FinOps Foundation’s AI workload guides to forecast training versus inference spend.
7) Measure by model shape
Different architectures—Mixture-of-Experts, RAG, long-context—shift cost drivers differently. Instrument them separately.
💡 Days 0–30
💡 Days 31–60
💡 Days 61–90
A 30-60-90 plan for taking back control
👉 Accelerator pricing dispersion: H100 hourly rates differ by region and cloud. Benchmark quarterly.👉 Token pricing revisions: Vendors adjust rates and caching discounts often—monitor official pages.👉 Egress policy shifts: Regulatory updates under the EU Data Act and UK competition rulings may change regional transfer costs, but so far they mainly affect migration scenarios rather than day-to-day multiregion traffic.👉 Serving breakthroughs: New batching and attention techniques can improve efficiency without new hardware.👉 FinOps standardization: The FOCUS schema is spreading across clouds, enabling unified cost visibility.
Shadow compute budgets aren’t a finance issue but an architecture issue. Treat cost as a first-class SLO, deploy serving frameworks with explicit utilization levers, keep data local, and instrument every token and gigabyte that leaves your stack.
The goal isn’t merely to shrink your bill, it’s to make cost predictable enough that you can scale AI usage confidently.
Watchlist: facts that move your budget in 2025 and 2026
Want an outside pair of hands?
Blocshop designs and delivers custom software and AI integrations with measurable performance, quality, and cost targets.
If you’d like a second opinion on your serving stack, GPU plan, or token economics, we can help. 👉 Schedule a free consultation with Blocshop
Learn more from our insights

NOVEMBER 3, 2025 • 7 min read
CE marking software under the EU AI Act – who needs it and how to prepare a conformity assessment
From 2026, AI systems classified as high-risk under the EU Artificial Intelligence Act (Regulation (EU) 2024/1689) will have to undergo a conformity assessment and obtain a CE marking before being placed on the EU market or put into service.

October 19, 2025 • 7 min read
EU and UK AI regulation compared: implications for software, data, and AI projects
Both the European Union and the United Kingdom are shaping distinct—but increasingly convergent—approaches to AI regulation.
For companies developing or deploying AI solutions across both regions, understanding these differences is not an academic exercise. It directly affects how software and data projects are planned, documented, and maintained.

October 9, 2025 • 5 min read
When AI and GDPR meet: navigating the tension between AI and data protection
When AI-powered systems process or generate personal data, they enter a regulatory minefield — especially under the EU’s General Data Protection Regulation (GDPR) and the emerging EU AI Act regime

September 17, 2025 • 4 min read
6 AI integration use cases enterprises can adopt for automation and decision support
The question for most companies is no longer if they should use AI, but where it will bring a measurable impact.
The journey to your
custom software
solution starts here.
Services
Let's talk!
blog
NOVEMBER 13, 2025
•7 min read
The quiet cost of AI: shadow compute budgets and the new DevOps blind spot
AI projects rarely fail because the model “isn’t smart enough.” They fail because the money meter spins where few teams are watching: GPU hours, token bills, data egress, and serving inefficiencies that quietly pile up after launch.
The real challenge for CTOs in 2025 and the forseeable future is not choosing the right model, but owning the full cost surface—from inference and networking to evaluation pipelines and architectural trade-offs that balance latency, throughput, and cost.
Inference is the real bill, not training
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum
The four hidden lines on your AI cost sheet
1) Accelerator hours (and how you buy them)
On-demand GPU pricing differs drastically across clouds and regions. For example, AWS p5/H100, Azure NC H100, and GCP A3 instances vary roughly 40–60 % depending on region and availability. Google Cloud’s TPU families (v5e, v5p, Trillium) also post transparent per-chip-hour rates, which shift economics for certain workloads.
Spot or preemptible capacity cuts costs by 60–90 % but introduces interruption risk—two-minute notice on AWS, roughly thirty seconds on GCP. For stateless or resilient jobs, that trade is worth it if your orchestration and checkpointing can recover quickly.
Action for CTOs: separate training, batch/offline inference, and interactive inference pools. Use spot where safe, and pin business-critical tiers to on-demand or reserved capacity. Most pipelines aren’t interruption-tolerant until you explicitly test them.
2) Networking and data egress
Ingress is usually free; egress almost never is. Every cloud provider bills for data leaving the region, and cross-region flows multiply the cost. Even though AWS and Google introduced limited “no-cost transfer” policies in 2024 after EU regulatory pressure, those programs tend to be narrow, migration-focused, and conditional, and they do not apply to everyday cross-region or cross-cloud traffic.
Action: co-locate model servers, vector stores, and data pipelines in the same region. Avoid RAG architectures that pull embeddings across regions on every query—it’s a silent, repeating egress charge that remains billable even under these regulatory programs.
3) Token-metered API spend
When you use hosted model APIs, your unit economics depend on input/output tokens, context length, and caching. OpenAI, Anthropic, and others publish pricing by token and model tier.
Small design choices—prompt shaping, output limits, and cache reuse—change cost curves by multiples.
Action:
4) Evaluation, guardrails, and “non-feature” workloads
Safety scanning, red-team tests, and automated evaluation loops have become continuous workloads. UK and EU institutions increasingly expect this kind of evaluation-centric assurance, meaning these tasks now run as background jobs that quietly consume compute and align with formal requirements like the EU AI Act’s post-market monitoring and logging obligations.
Action: treat evaluations as a first-class workload with a defined cadence and budget. Schedule them on cheaper, preemptible capacity rather than letting them overlap with latency-sensitive inference windows.
The serving blind spot: where DevOps playbooks fall short
Design choices that quietly burn money
Long contexts that never pay back. Large context windows sound attractive, but they increase memory pressure and slow prefill. Unless you genuinely use them, shorter contexts plus RAG summaries are cheaper and faster.
Cross-region RAG. Pulling documents or embeddings across regions on every call is a textbook egress trap. Replicate data locally instead; these flows are not covered by the 2024 “no-cost transfer” programs and remain fully billable.
Unmanaged spot usage in interactive paths. Spot is fine for batch or evaluation jobs, not for live traffic without failover logic.
Token-blind product features. Features like “show reasoning traces” inflate output tokens per user. Make token cost visible in dashboards so PMs see cost per interaction.
A cost control blueprint that actually works
1) Make cost an SLO alongside latency and quality
Publish three SLOs per route: latency, quality, and cost per successful call. If cost is invisible, it will drift.
2) Choose a serving stack that exposes the right levers
Use frameworks offering continuous batching, KV caching, quantization, and inflight batching. Benchmark before committing.
3) Separate pools by interruption tolerance
Interactive → on-demand or reserved.Batch → spot with checkpointing.Training/evaluation → preemptible with retries.
4) Keep the bytes local
Run the model, vector store, and document cache in the same region or availability zone. Measure egress precisely.
5) Introduce budget-aware decoding and routing
Implement token caps, prompt caching, and fallback models. Measure real cost per feature.
6) Adopt FinOps for AI with a real data model
Normalize billing using the FOCUS standard so finance and engineering see the same metrics. Review FinOps Foundation’s AI workload guides to forecast training versus inference spend.
7) Measure by model shape
Different architectures—Mixture-of-Experts, RAG, long-context—shift cost drivers differently. Instrument them separately.
💡 Days 0–30
💡 Days 31–60
💡 Days 61–90
A 30-60-90 plan for taking back control
👉 Accelerator pricing dispersion: H100 hourly rates differ by region and cloud. Benchmark quarterly.👉 Token pricing revisions: Vendors adjust rates and caching discounts often—monitor official pages.👉 Egress policy shifts: Regulatory updates under the EU Data Act and UK competition rulings may change regional transfer costs, but so far they mainly affect migration scenarios rather than day-to-day multiregion traffic.👉 Serving breakthroughs: New batching and attention techniques can improve efficiency without new hardware.👉 FinOps standardization: The FOCUS schema is spreading across clouds, enabling unified cost visibility.
Shadow compute budgets aren’t a finance issue but an architecture issue. Treat cost as a first-class SLO, deploy serving frameworks with explicit utilization levers, keep data local, and instrument every token and gigabyte that leaves your stack.
The goal isn’t merely to shrink your bill, it’s to make cost predictable enough that you can scale AI usage confidently.
Watchlist: facts that move your budget in 2025 and 2026
Want an outside pair of hands?
Blocshop designs and delivers custom software and AI integrations with measurable performance, quality, and cost targets.
If you’d like a second opinion on your serving stack, GPU plan, or token economics, we can help. 👉 Schedule a free consultation with Blocshop
Learn more from our insights

NOVEMBER 3, 2025 • 7 min read
CE marking software under the EU AI Act – who needs it and how to prepare a conformity assessment
From 2026, AI systems classified as high-risk under the EU Artificial Intelligence Act (Regulation (EU) 2024/1689) will have to undergo a conformity assessment and obtain a CE marking before being placed on the EU market or put into service.

October 19, 2025 • 7 min read
EU and UK AI regulation compared: implications for software, data, and AI projects
Both the European Union and the United Kingdom are shaping distinct—but increasingly convergent—approaches to AI regulation.
For companies developing or deploying AI solutions across both regions, understanding these differences is not an academic exercise. It directly affects how software and data projects are planned, documented, and maintained.

October 9, 2025 • 5 min read
When AI and GDPR meet: navigating the tension between AI and data protection
When AI-powered systems process or generate personal data, they enter a regulatory minefield — especially under the EU’s General Data Protection Regulation (GDPR) and the emerging EU AI Act regime

September 17, 2025 • 4 min read
6 AI integration use cases enterprises can adopt for automation and decision support
The question for most companies is no longer if they should use AI, but where it will bring a measurable impact.
The journey to your
custom software
solution starts here.
Services
Head Office
Revoluční 1
110 00, Prague Czech Republic
hello@blocshop.io
Let's talk!
blog
NOVEMBER 13, 2025
•7 min read
The quiet cost of AI: shadow compute budgets and the new DevOps blind spot

AI projects rarely fail because the model “isn’t smart enough.” They fail because the money meter spins where few teams are watching: GPU hours, token bills, data egress, and serving inefficiencies that quietly pile up after launch.
The real challenge for CTOs in 2025 and the foreseeable future is not choosing the right model, but owning the full cost surface—from inference and networking to evaluation pipelines and architectural trade-offs that balance latency, throughput, and cost.
Inference is the real bill, not training
Across the industry, inference dominates lifetime spend. Training is episodic; inference happens every time a user interacts with the system. According to research presented at OSDI 2024, the main performance and cost trade-offs now revolve around inference system design—throughput versus latency. Improving one usually worsens the other unless the serving pipeline is rebuilt around adaptive batching and memory management.
Two practical consequences follow:
The four hidden lines on your AI cost sheet
1) Accelerator hours (and how you buy them)
On-demand GPU pricing differs drastically across clouds and regions. For example, AWS p5/H100, Azure NC H100, and GCP A3 instances vary roughly 40–60 % depending on region and availability. Google Cloud’s TPU families (v5e, v5p, Trillium) also post transparent per-chip-hour rates, which shift economics for certain workloads.
Spot or preemptible capacity cuts costs by 60–90 % but introduces interruption risk—two-minute notice on AWS, roughly thirty seconds on GCP. For stateless or resilient jobs, that trade is worth it if your orchestration and checkpointing can recover quickly.
Action for CTOs: separate training, batch/offline inference, and interactive inference pools. Use spot where safe, and pin business-critical tiers to on-demand or reserved capacity. Most pipelines aren’t interruption-tolerant until you explicitly test them.
2) Networking and data egress
Ingress is usually free; egress almost never is. Every cloud provider bills for data leaving the region, and cross-region flows multiply the cost. Even though AWS and Google introduced limited “no-cost transfer” policies in 2024 after EU regulatory pressure, those programs tend to be narrow, migration-focused, and conditional, and they do not apply to everyday cross-region or cross-cloud traffic.
Action: co-locate model servers, vector stores, and data pipelines in the same region. Avoid RAG architectures that pull embeddings across regions on every query—it’s a silent, repeating egress charge that remains billable even under these regulatory programs.
3) Token-metered API spend
When you use hosted model APIs, your unit economics depend on input/output tokens, context length, and caching. OpenAI, Anthropic, and others publish pricing by token and model tier.
Small design choices—prompt shaping, output limits, and cache reuse—change cost curves by multiples.
Action:
4) Evaluation, guardrails, and “non-feature” workloads
Safety scanning, red-team tests, and automated evaluation loops have become continuous workloads. UK and EU institutions increasingly expect this kind of evaluation-centric assurance, meaning these tasks now run as background jobs that quietly consume compute and align with formal requirements like the EU AI Act’s post-market monitoring and logging obligations.
Action: treat evaluations as a first-class workload with a defined cadence and budget. Schedule them on cheaper, preemptible capacity rather than letting them overlap with latency-sensitive inference windows.
The serving blind spot: where DevOps playbooks fall short
Traditional DevOps optimizes for availability and latency. AI serving introduces a third variable—GPU utilization. Under-utilized accelerators waste budget; over-batched systems break latency SLOs.
Modern serving frameworks have matured around these challenges:
Take action: choose a serving framework early, benchmark with your real traffic patterns, and track throughput at 95th percentile latency as your core cost-efficiency metric.
Design choices that quietly burn money
Long contexts that never pay back. Large context windows sound attractive, but they increase memory pressure and slow prefill. Unless you genuinely use them, shorter contexts plus RAG summaries are cheaper and faster.
Cross-region RAG. Pulling documents or embeddings across regions on every call is a textbook egress trap. Replicate data locally instead; these flows are not covered by the 2024 “no-cost transfer” programs and remain fully billable.
Unmanaged spot usage in interactive paths. Spot is fine for batch or evaluation jobs, not for live traffic without failover logic.
Token-blind product features. Features like “show reasoning traces” inflate output tokens per user. Make token cost visible in dashboards so PMs see cost per interaction.
A cost control blueprint that actually works
1) Make cost an SLO alongside latency and quality
Publish three SLOs per route: latency, quality, and cost per successful call. If cost is invisible, it will drift.
2) Choose a serving stack that exposes the right levers
Use frameworks offering continuous batching, KV caching, quantization, and inflight batching. Benchmark before committing.
3) Separate pools by interruption tolerance
Interactive → on-demand or reserved.Batch → spot with checkpointing.Training/evaluation → preemptible with retries.
4) Keep the bytes local
Run the model, vector store, and document cache in the same region or availability zone. Measure egress precisely.
5) Introduce budget-aware decoding and routing
Implement token caps, prompt caching, and fallback models. Measure real cost per feature.
6) Adopt FinOps for AI with a real data model
Normalize billing using the FOCUS standard so finance and engineering see the same metrics. Review FinOps Foundation’s AI workload guides to forecast training versus inference spend.
7) Measure by model shape
Different architectures—Mixture-of-Experts, RAG, long-context—shift cost drivers differently. Instrument them separately.
A 30-60-90 plan for taking back control
💡 Days 0–30
💡 Days 31–60
💡 Days 61–90
Watchlist: facts that move your budget in 2025 and 2026
👉 Accelerator pricing dispersion: H100 hourly rates differ by region and cloud. Benchmark quarterly.👉 Token pricing revisions: Vendors adjust rates and caching discounts often—monitor official pages.👉 Egress policy shifts: Regulatory updates under the EU Data Act and UK competition rulings may change regional transfer costs, but so far they mainly affect migration scenarios rather than day-to-day multiregional traffic.👉 Serving breakthroughs: New batching and attention techniques can improve efficiency without new hardware.👉 FinOps standardization: The FOCUS schema is spreading across clouds, enabling unified cost visibility.
Shadow compute budgets aren’t a finance issue but an architecture issue. Treat cost as a first-class SLO, deploy serving frameworks with explicit utilization levers, keep data local, and instrument every token and gigabyte that leaves your stack.
The goal isn’t merely to shrink your bill, it’s to make cost predictable enough that you can scale AI usage confidently.
Want an outside pair of hands?
Blocshop designs and delivers custom software and AI integrations with measurable performance, quality, and cost targets.
If you’d like a second opinion on your serving stack, GPU plan, or token economics, we can help. 👉 Schedule a free consultation with Blocshop
Learn more from our insights

NOVEMBER 3, 2025 • 7 min read
CE marking software under the EU AI Act – who needs it and how to prepare a conformity assessment
From 2026, AI systems classified as high-risk under the EU Artificial Intelligence Act (Regulation (EU) 2024/1689) will have to undergo a conformity assessment and obtain a CE marking before being placed on the EU market or put into service.

October 19, 2025 • 7 min read
EU and UK AI regulation compared: implications for software, data, and AI projects
Both the European Union and the United Kingdom are shaping distinct—but increasingly convergent—approaches to AI regulation.
For companies developing or deploying AI solutions across both regions, understanding these differences is not an academic exercise. It directly affects how software and data projects are planned, documented, and maintained.

October 9, 2025 • 5 min read
When AI and GDPR meet: navigating the tension between AI and data protection
When AI-powered systems process or generate personal data, they enter a regulatory minefield — especially under the EU’s General Data Protection Regulation (GDPR) and the emerging EU AI Act regime

September 17, 2025 • 4 min read
6 AI integration use cases enterprises can adopt for automation and decision support
The question for most companies is no longer if they should use AI, but where it will bring a measurable impact.
The journey to your
custom software solution starts here.
Services