I’ve read some of Ed Zitron’s long posts on why the AI industry is a bubble that will never be profitable (and will bring down a lot of companies and investors), and one of the recurring themes is that the AI companies are trying to capture growing market share in an industry where their marginal profits are still negative, and that any increase in revenue necessarily increases their costs of providing their services.
But some of the comments in various HackerNews threads are dismissive, saying that each new generation of models makes the cost of inference lower, so that with sufficient customer volume, the companies running the models can make enough profit on inference to make up for the staggering up-front capital expenditures it took to build out the data centers, train their models, etc.
It’s all pretty confusing to me. So for those of you who are familiar with the industry, I have several questions:
- Is the cost of running any given pretrained model going down, for specific models? Are there hardware and software improvements that make it cheaper to run those models, despite the model itself not changing?
- Is the cost of performing a particular task at a particular quality level going down, through releases of newer models of similar performance (i.e., a smaller model of the current generation performing similarly to a bigger model of the previous generation, such that the cost is now cheaper)?
- Is the cost of running the largest flagship frontier models going down for any given task? Or does running the cutting edge show-off tasks keep increasing in cost, but where the companies argue that the improvement in performance is worth the cost increase?
I suspect that the reason why the discussion around this is so muddled online is because the answers are different depending on which of the 3 questions is meant by “is running an AI model getting cheaper over time?” And the data isn’t easy to synthesize because each model has different token prices and different number of tokens per query.
But I wanted to hear from people who are knowledgeable about these topics.


That all makes sense to me, and lines up with what I’ve been reading too. I saw the model download and I was like “guhhhh” to it because I was also excited to try it on my 3090. I’ll be waiting for the quants.
Yeah I like the end there too, that OpenAI / Anthropic have been desperately trying to figure out how to do this, and a few guys with limited hardware did it. When you have unlimited resources, you end up needing unlimited resources. When you only have 300 GPUs, you make it work. It’s why tech is littered with people starting in garages, they found a way to make it work.
And to be clear, you need 3090 + at least 96GB of fast CPU RAM (really 128GB) to run Deepseek Flash coherently. It is a big model; there’s no way around it.
If you have less RAM, try Qwen 27B now (which also uses an exotic attention mechanism). It’ll fit on your 3090 just fine.
For DeepSeek Pro, you’d need a Xeon or EPYC homelab.
I view it differently.
In the US, there are either megacorps, or “people in garages” which honestly don’t have resources and stuff like legal support to do huge innovations. They publish cool papers, which never get implemented because they don’t have $200k+ for a bigger test, and can’t work on it themselves for a living. Any “garage devs” who get too big, get smited or amalgamated into Big Tech gray goo, and whatever was interesting gets lost in oblivion.
There’s no cooperation, no sharing, either.
And OpenAI/Anthropic are way more conservative than you’d think. Same with Meta; they want results next quarter. Zuckerburg literally fired the whole Llama team, which put meta on the AI map and basically founded the open weights space, when they had one failed experiment. In other words, I’d argue clueless billionaires and the Tech Bro acolytes surrounding them are poisoning LLM development, and it’s starting to catch up.
In China, things are different. The GPU sanctions forced these gigantic companies like Alibaba or Tencent to be compute-thrifty, but they all seem to have access to suspiciously good training data… I would be the Chinese govt is helping them under the table. Chinese devs also have an interesting attitude; I would characterize them as “cooperative,” with lots of private forum sharing going on, most models being open-weights, and clearly not a lot of desire to censor their models for the government. But they have their own forms of dysfunction too, sometimes by copying other firms a little to closely, or corporate/personal drama like anywhere.