Dev Espresso #9 - AI Costs, Security Risks, and the Shift to Local Models

Dev Espresso #9 - AI Costs, Security Risks, and the Shift to Local Models

Dariusz Luber
Dariusz Luber
📺

Prefer video content?

Jump to video

For the best experience, use both — watch the video and read the article; together they give you the full picture

One thing is already clear to me: AI is no longer just an "assistant" layer. It is becoming an infrastructure layer. This is no longer about who can generate a prettier chatbot response. It is about who can build agent systems that are effective, secure, and economically measurable.

I see a clear transition here: from the prompt era to the operations era. Companies are simultaneously fighting for model sovereignty, organizing agent memory, seeking an edge in cybersecurity, and colliding with physical scaling limits such as energy.

TL;DR

  • The market is shifting from monolithic chatbots to agent architectures with memory, orchestration, and cost control.
  • NVIDIA and Microsoft are investing in their own models and their own stack because, at this stage of AI, technological independence equals business advantage.
  • Local models are back in play: browsers, operating systems, and new hardware are turning AI into a built-in device capability, not only a cloud service.
  • Memory ops is no longer just a UX add-on. Memory consolidation is a foundation for quality in long-running workflows.
  • The most advanced model capabilities in cybersecurity are both a valuable defensive tool and a serious abuse risk.
  • Regulation and energy cost are starting to directly shape AI product roadmaps.

1. The Battle of Giants: NVIDIA and Microsoft Declare Technological Independence

Over the last two years, most of the market operated in a simple model: you buy access to someone else's model and build your product on top of an API. To me, that model is now breaking down, because the biggest players no longer want to be customers of someone else's APIs. They are building their own models, their own runtime environments, and their own cost control.

NVIDIA Nemotron 3 Ultra: A Model for Long Agent Runs

NVIDIA introduced Nemotron 3 Ultra as a model designed not for "nice conversation," but for long-form reasoning and execution in agent loops. Key parameters (550B total parameters, 55B active) and a focus on a very large context window signal a strategic direction: the model is meant to be an engine for multi-step execution, not just a conversational interface.

Source: NVIDIA Developer Blog.

Microsoft MAI: Control of the Model Means Control of Margin

At Build 2026, Microsoft clearly signaled its intent to reduce dependence on external model providers. The MAI family and its focus on token efficiency are not only a technology move. They are also a financial and product move: owning the model gives tighter control over inference cost, response time, and feature iteration speed.

Source: Microsoft Build 2026 - keynote transcript.

AI Is Becoming More Local: Chrome and Microsoft Show That Not Everything Must Go Through the Cloud

This is one of those threads that is easy to miss if you only look at major model launches. I have a strong sense that the most interesting developments are happening closer to the user.

Chrome is expanding built-in AI powered by Gemini Nano, a model managed directly by the browser. The model is downloaded to the device and can run locally, without sending content to an external API for every use. That is an important signal: the browser is no longer only a window into the cloud, but is becoming a lightweight runtime environment for local AI.

Sources:

Microsoft took a similar path from the operating-system side. Fluid dictation in Voice Access on Copilot+ PCs runs on-device and uses small language models to improve punctuation, grammar, and filler words while dictating. It looks like a "small" feature, but it clearly shows the direction: a local model does not have to be good at everything. It only needs to do one thing well, quickly, and with privacy in mind.

Source: Fluid dictation | Microsoft Support.

And in my opinion, this part of the market is still heavily underrated. The most interesting local-model use cases do not need to look like yet another chatbot. They can simply become part of the system utility layer: dictation, text correction, content filtering, local summarization, simple classification, or private support for an agent running next to us.

NVIDIA RTX Spark: No Longer Just a GPU, but a Full Computer Platform for Local AI

The second missing piece in this puzzle is hardware. NVIDIA is no longer limiting itself to shipping another generation of graphics cards for PCs with someone else's CPU. RTX Spark marks entry into a new category: a full superchip for Windows PCs, co-developed with MediaTek.

Technically, this looks like a very explicit attempt to bring "AI-first hardware" logic from servers and Apple Silicon into personal computers:

  • up to a 20-core CPU,
  • Blackwell RTX GPU with 6,144 CUDA cores,
  • up to 1 petaflop of AI performance,
  • up to 128 GB of unified memory.

This is not just benchmark trivia. It is a concrete hardware thesis: the personal computer should be ready to run large models and agents locally, without bouncing every action back to the cloud. According to NVIDIA, this class of hardware should be enough to run models around 120B with context in the range of 1 million tokens, and Microsoft adds a security layer plus OpenShell (running models/agents in an isolated environment) as a runtime for local agents.

Sources:

2. When AI Starts "Dreaming": Why Memory Becomes a Critical Layer

The biggest hidden agent problem today is not "point intelligence," but the long-term memory breakdown that appears over time. The longer the workflow, the higher the risk of:

  • conflicting notes,
  • outdated facts,
  • loss of business context between sessions.

That is why interest is growing in approaches like Anthropic's Auto Dream: cyclical memory consolidation and knowledge organization outside the active session. The analogy is REM sleep: the system is not creating new work, but reorganizing how knowledge is represented.

[Agent work] -> session logs and artifacts
  |
  v
[Consolidation phase] -> contradiction cleanup
  |
  +-> time and fact normalization
  +-> working memory compression
  +-> durable knowledge update

Source: Auto Dream mechanics.

In practice, this means a new engineering discipline: memory ops. In my experience, without it, even a good model starts making progressively weaker decisions over time.

3. Claude Mythos and Project Glasswing: The Thin Line Between Shield and Sword

Around Claude Mythos, a narrative has emerged that captures the current state of the industry well: the most advanced model capabilities in cybersecurity are both the most desirable and the most dangerous.

If a model can detect critical vulnerabilities faster, it can also support offensive activity faster. That is why controlled-access programs such as Project Glasswing are used instead of fully open public availability.

In reports and media coverage (including Financial Times), one specific narrative appeared repeatedly: that Mythos was being used in offensive-style scenarios, and full access was not offered to the broader market but only to a limited group under Project Glasswing. These were media-circulation reports, not full technical documentation publicly released.

There were also reports of such capabilities being used by state actors, including threads related to cooperation with U.S. security institutions. To me, this signals that around frontier models, a real cyber arms race is beginning: the same capabilities can strengthen defense, but they can also shorten the path to attack.

Sources:

This is a topic that imposes a new architecture principle: access to models with very strong cybersecurity capabilities should be managed like privileged access, not like just another integration quickly dropped into the backlog.

4. White Papers That Actually Change Practice

In recent weeks, several papers have appeared that have direct consequences for engineers building AI systems.

Publication What It Adds Why It Matters
Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference Separates KV and SSM memory at the paging layer Less OOM, higher throughput on real hardware
Task Structure Reverses Layerwise State Encoding in Sequence Models Shows task structure can reverse state-encoding profiles Benchmarks without task context can mislead
Measuring Progress Toward AGI: A Cognitive Taxonomy Cognitive framework for measuring progress Better language to compare capabilities and risks
AI Infrastructure in the Age of Sovereignty... Infrastructure-sovereignty framework Compute strategy is now part of geopolitics

Sources:

AVMP is especially important operationally: hybrid Mamba+Transformer architectures have different memory profiles, so treating them the same way in runtime systems wastes resources. In short, Mamba is an alternative to classical attention, designed to handle long sequences more efficiently in memory terms. Classical attention, used in Transformers, compares elements across the whole context at every step to decide what the model should focus on.

In practice, the difference is straightforward: classical attention needs more and more memory as context length grows, while Mamba state remains much more predictable and does not grow in the same way.

So the longer the document, conversation, or agent work history, the more the Transformer memory footprint balloons, while Mamba stays more stable. That is exactly why throwing both mechanisms into one bucket leads to memory waste and performance loss.

Without asymmetric memory management, we pay for unnecessary padding and lose throughput.

5. Regulation and Energy: Two "Invisible" AI Bottlenecks

The regulation debate is no longer abstract legal theory. Government decisions are beginning to affect how and when models reach production.

On June 2, 2026, the U.S. President signed an executive order requiring cyber-risk assessment for covered frontier models. In practice, this was widely interpreted as a requirement/expectation that government should receive access to new models about 30 days before public release. It is worth adding that OpenAI publicly declared readiness for such an early-access process for U.S. administration. At the same time, the industry is strongly pushing for federal alignment, including legislative proposals such as the "Great American AI Act," which would temporarily reduce divergence between strict state-level regulations (for example California and Colorado) and federal rules.

Sources:

At the same time, something even more down-to-earth is happening: the fight for energy. As data center load grows, power availability and electricity pricing are becoming a hard limit on development.

The Arizona example is particularly telling: local utility APS filed for a significant increase in charges for the most energy-intensive AI data center workloads (public discussion often cited a scale around 45%). If AI workload energy consumption keeps growing at double-digit year-over-year rates (some analyses point to around 15%), part of new data center investment may shift to other states.

Context source: AI news analysis (June 5, 2026).

This is where three worlds meet: software, policy, and physics. The best model, without power and without regulatory clearance, remains a slide in a deck.

6. What This Means for AI Product Architecture

The next quarters will reward not the "loudest model," but the best execution architecture. I have seen this the same way from the start: different models perform different tasks differently, so they must be selected by role, cost, and context.

A hybrid setup looks more realistic: a frontier model handles planning, harder decisions, and orchestration, while smaller models, often local as well, handle execution where cost, privacy, or response speed matters most.

In practice, I organize this across three axes:

  1. Model ops and orchestration.

Separate model roles: one handles planning and quality control, others handle specialized execution. That lowers cost and improves stability. If the task is well decomposed, a local or smaller model often does not need to "be exceptional" - it just needs to execute a specific step sensibly and cheaply.

  1. Memory ops.

Introduce memory policies: what is durable, what is working memory, what gets consolidated, and when. Without this, agents degrade over time.

  1. Infra + FinOps + Energy awareness.

Measure not only response quality, but also token cost, inference cost, and energy cost as one decision system.

This is exactly why local models are no longer just a hobby for home-lab enthusiasts. Rising frontier-model access prices, growth in NPU-enabled devices, and new platforms like RTX Spark are jointly pushing the market toward practical hybrid architectures. This is not about "the end of the cloud." It is about not sending everything there simply because, until now, there was no sensible alternative.

Summary

To me, this is the beginning of an era in which AI is simultaneously application technology, critical infrastructure, and a national-security topic. That is why real advantage no longer emerges at the level of a single prompt.

Model families, memory layers, regulation, and energy are starting to form one connected system. Whoever can design and keep that system in balance will win the next phase of AI.

Additional Links From The Podcast


Found this helpful? Power up my next content!

Buy me a coffee