Running AI models is turning into a memory game

When we talk about the cost of AI infrastructure, the focus is usually on Nvidia and GPUs, but memory is an increasingly important part of the picture. As hyperscalers prepare to build billions of dollars worth of new data centers, the price for DRAM chips has jumped roughly seven times in the last year. At the same time, there is a growing discipline in orchestrating all that memory to make sure the right data gets to the right agent at the right time. The companies that master it will be able to make the same queries with fewer tokens, which can be the difference between folding and staying in business.

Semiconductor analyst Dan O’Laughlin has an interesting look at the importance of memory chips on his Substack, where he talks with Val Bercovici, chief AI officer at Weka. They are both semiconductor experts, so the focus is more on the chips than the broader architecture, but the implications for AI software are pretty significant too.

I was particularly struck by this passage, in which Bercovici looks at the growing complexity of Anthropic’s prompt-caching documentation. He notes that their pricing page started as a simple recommendation six or seven months ago, but has now become an encyclopedia of advice on exactly how many cache writes to pre-buy. There are five-minute tiers, which are very common across the industry, or one-hour tiers, and nothing above that. This is a really important detail. Then there are arbitrage opportunities around the pricing for cache reads based on how many cache writes you have pre-purchased.

The question is how long Claude holds your prompt in cached memory. You can pay for a five-minute window, or pay more for an hour-long window. It is much cheaper to draw on data that is still in the cache, so if you manage it right, you can save an awful lot. There is a catch though. Every new bit of data you add to the query may bump something else out of the cache window.

This is complex stuff, but the upshot is simple enough. Managing memory in AI models is going to be a huge part of AI going forward. Companies that do it well are going to rise to the top.

There is plenty of progress to be made in this new field. Back in October, I covered a startup called TensorMesh that was working on one layer in the stack known as cache-optimization.

Opportunities exist in other parts of the stack. For instance, lower down the stack, there is the question of how data centers are using the different types of memory they have. The interview includes a discussion of when DRAM chips are used instead of HBM, although it is pretty deep in the hardware weeds. Higher up the stack, end users are figuring out how to structure their model swarms to take advantage of the shared cache.

As companies get better at memory orchestration, they will use fewer tokens and inference will get cheaper. Meanwhile, models are getting more efficient at processing each token, pushing the cost down still further. As server costs drop, a lot of applications that do not seem viable now will start to edge into profitability.