Is AI inference getting cheaper or more expensive over time?

GamingChairModel@lemmy.world · 1 day ago

Is AI inference getting cheaper or more expensive over time?

brucethemoose@lemmy.world · 15 hours ago

Yes.

It’s dropping, dramatically.

Look at the history of open and closed releases, on benchmarks that aren’t totally gamed, and it’s easy to see. LLM capabilities are plateauing, and bigger models are getting more and more niche.

But inference efficiency is increasing exponentially. Tiny models are getting closer and closer to frontier ones. See: Qwen 27B, and how it can do most of what mega models did just months ago.

And there’s tons of unpicked efficiency fruit in papers. Bitnet is the big one, but I’ve seen dozens of proof of concepts, just yet to be tried in a production model, that are dramatic efficiency boosts.

General_Effort@lemmy.world · 15 hours ago

The cost of hardware has always gone down. If it seems different in the last few years, that’s because demand has gone up and people are outbidding each other to get the same scarce resource. But the fundamentals have not changed. Once Production catches up, prices will return to the previous trajectory.
Going down. At the same size, models are more capable. This is both because training data is being incrementally refined, and because better methods are thought up.
Going up. Bigger models are better, but more expensive to run. They also require more data to train. The trend towards bigger models will continue for the foreseeable future. How quickly models can be scaled up is limited by the logistics (data centers and training data)

GamingChairModel@lemmy.world · 11 hours ago

The cost of hardware has always gone down

Seems like performance per unit cost necessarily has to spread between performance per dollar that the actual computer hardware costs and performance per watt-hour of ongoing energy consumption. Obviously each generation of chips shows exponential growth in performance, but how much is that advancement offset by increased power consumption and increased price of the chip itself?

That’s what I mean. I’m interested in seeing actual numbers, like seeing how costs differ in specific model/hardware combinations, with certain assumptions on the price of electricity and maybe interest rates or amortization schedules.

General_Effort@lemmy.world · 6 hours ago

how much is that advancement offset by increased power consumption and increased price of the chip itself?

It’s not. Performance per watt or per dollar increases exponentially. It’s not subtle.

That’s what I mean. I’m interested in seeing actual numbers, like seeing how costs differ in specific model/hardware combinations, with certain assumptions on the price of electricity and maybe interest rates or amortization schedules.

What is it that you want to work out? There are various price comparisons out there.

humanspiral@lemmy.ca · 14 hours ago

new generation of models makes the cost of inference lower, so that with sufficient customer volume, the companies running the models can make enough profit on inference to make up for the staggering up-front capital expenditures

cost per quality is definitely going down at a fast rate. LLM providers are in extremely competitive field, where open weight models are at a huge competitive advantage for any quality level (privacy, customizability). The competition is all on 2 month release cycles that essentially throw away the old version/code/weights each time. When Claude pretends its newest model is too powerful for non oligarchs to use, it limits its token reach, and then required contribution margin per token.

The buisness model flaw is “one day, a winner becomes a monopoly, and AGI self improves the model at low (except for ultra expensive compute) cost.” Monopoly pricing power is very hard/impossible to achieve, because if necessary, foreign governments will subsidize competition to not let a hostile US empire AGI monoplist take hold. Due to corrupt energy oligarchy, it is categorically impossible for US hosted services to ever provide comparative value compared to rational economic energy policies outside of the US. Distillation (Teacher/student RL) means that using another AGI (or leading LLM) will improve models that are behind. There will always be competition on the price/quality curve that prevents even the best/most expensive model from capturing all share. There’s always free tier LLM competition availability as well.

Finally, there are layers above LLMs. Agentic and swarm and “deterministic program access”/validation front ends to LLMs can add various levels of token burn, but also divert most tokens from the expensive LLMs, and iteratively improve output. There isn’t just a cost/quality curve there is a cost/speed/quality/privacy curve, where non AI coordination tools can improve on the latter curve points independently of leading/expensive LLM/AGI quality.

Danitos@reddthat.com · 1 day ago

https://www.tobyord.com/writing/hourly-costs-for-ai-agents. This person analyzed a question very similar to yours. For me, this means:

Cost of running state of the art models is increasing exponentially.
For a given target “intelligence”, the cost is decreasing linearly.

I don’t really know what to make out of that in a broader picture

GamingChairModel@lemmy.world · 17 hours ago

Thanks, this is great.

I’ve found that very few people are asking this question. And when I ask people what they think is happening to these costs over time, their opinions vary wildly.

I’m similarly baffled. How is it that this industry generates so much discussion, but nobody is asking this fundamental question, with an ecosystem of comments, discussions, critiques, and corrections on those analyses?

MagicShel@lemmy.zip · 18 hours ago

It means that capability growth is going to slow and require more creative ways to improve than just more tokens and more compute. I’ve seen some research that we can create chips that are 10x as efficient but they can only run a single model and aren’t upgradable. If a model is viable for a number of years because genuine improvement is so slow, the math starts to make sense there.

It’s going to be a good thing when we are forced to start looking at more creative ways of improvement.

General_Effort@lemmy.world · 15 hours ago

FYI: Ed Zitron is a PR expert. He has no background in engineering or finance.

He has the skills to make people listen to him and give him money. He does not have the skills to determine if any of his assertion are true or not. If you’re wondering if I’m calling him a liar, then I can only say that I can’t read minds. If you’re not wondering, then you weren’t paying attention.

GamingChairModel@lemmy.world · 14 hours ago

FYI: Ed Zitron is a PR expert. He has no background in engineering or finance.

I’m not super interested in people’s credentials (good or bad). I need for the actual substance of the words on the page to be well supported and well reasoned.

Zitron does the work of actually gathering the public statements (across SEC filings, public disclosures, public or leaked documents) and crafting a narrative around those statements. He links to original source documents a lot. Other people should be doing the same, but for whatever reason not a lot of other people are.

He needs an editor. His articles could be better organized, more tightly argued, and more focused in scope.

I have some skepticism about many of the extrapolations that he makes from the facts, but on my read, his factual claims are mostly well supported. When he calls other people liars by showing those contradictions out in the open, I think those arguments stand for themselves regardless of what his background, credentials, or even motivations are. So I draw a line between his factual claims about the past and present and his predictions about the future.

And that’s the reason for this thread. He makes factual claims about the exponential rise in costs for these companies, and infers/extrapolates into the future with it, but I want to check whether those extrapolations actually fit the data we already can see. That’s what I’m trying to learn by asking here.

Of course, if you have specific examples of him making false statements about the past or present (no need to attribute intentionality to the speaker), I’d love to see those, too.

General_Effort@lemmy.world · 14 hours ago

Do you feel that he has given you the right idea about where AI was heading in the last few years?

GamingChairModel@lemmy.world · 11 hours ago

On some issues, absolutely.

He flagged the issue with flat rate subscriptions not making any sense for the underlying token pricing and usage by users, and predicted that a lot of the AI startups that act as some kind of subscription middleman would feel the squeeze and eventually impose rate limits/quotas, degrade the quality of their offerings (i.e., push users towards cheaper models), or fail. I think that’s a pretty good summary of what has been happening at the user/pricing level with Perplexity, Lovable, and Cursor. Microsoft’s Copilot plans are also seeing a lot of changes to pricing and rate limits, as well as model choice, in ways that user complaints have gotten louder in the past month or two.

He was a skeptic on Stargate right out of the gate, and I think that external visibility into how that loose collection of projects under that banner has been going over the past year shows that something inside is fundamentally wrong. That isn’t necessarily an indictment of the broader AI ecosystem as a whole, but Zitron’s most pointed financial criticism has been directed at OpenAI and Oracle, and the costs of data center construction. Those criticisms have looked especially prescient this calendar year (and generally fits into my preconceived notions that building physical stuff is slow and expensive and that we Americans aren’t very good at keeping megaprojects on schedule and under budget).

I’m a money guy. I don’t have any special expertise in industry trends and how money will be spent in the future on industries where I’m not an insider (i.e., AI), but I find Zitron’s accounting of how money is being spent in the present to largely seem accurate. So that’s why I’m in this thread asking people about how they see the present and the future of spending/pricing/volume, to see if those projections of revenue needed are actually feasible.

Scrubbles@poptalk.scrubbles.tech · 1 day ago

I think we’re seeing a lot of optimization right now. The most exciting one I’ve seen is TurboQuant. Short version, every message you send to a model has context, the entire conversation you’ve had, instructions, skills, everything. That takes up an exponential amount of ram, and this is what is causing the VRAM/RAM shortage. TurboQuant (and other copycats now) claims that it can reduce that VRAM usage of the context by 20x. That’s absolutely huge, that’s 1M context models running on consumer hardware potential huge.

Deepseek v4 also boasts some large claims, saying they have a model that does better than Anthropic’s or OpenAI, while being 1/10th the size. That also is a huge reduction in compute and VRAM, but I’ll be looking for the proof.

We’ve seen other items too, with upgrades in running models, how quickly results are streamed, to me TurboQuant is the most exciting.

I think it’s good that they’re finally looking at optimization. Yes, their cost has been power and compute. NVidia is more than happy to keep things inefficient because they sell GPUs that way. Software companies are doing the opposite now, reducing the compute overhead to start saving them money, which they desperately need to do if this is going to continue. New technology has always been horribly inefficient, it’s only once more people see it does it start to get optimized.

I think this is what is going to be required to finally push past the horribleness of AI companies, and they need to do this quickly.

brucethemoose@lemmy.world · edit-2 15 hours ago

Also, on Deepseek V4… you can run it yourself, for free. There’s no mystery. And there’s tons of benchmarks out there already.

It’s indeed very efficient if you’re into long context. But at shorter context lengths, it’s not too different than Deepseek’s previous releases (and the flood of MoE models that have come since then).

Scrubbles@poptalk.scrubbles.tech · 15 hours ago

How do you do that, I checked it out and it was 700 gb or something

brucethemoose@lemmy.world · edit-2 15 hours ago

Okay, I fudged the part about “for free.” The problem is DeepSeekv4 is literally in preview, and its architecture is so new that engine support for its weights is poor.

Right this second, you can either pay a few cents to try it from some API (there are many providers since its open weights), or rent a GPU (or maybe a CPU) instance if you don’t trust the public tests, and actually want to test resource usage yourself.

Or you can quantize it and self host it. I plan to do so on my 128GB RAM/RTX 3090 desktop, which is a affordable config to rent if you don’t have a desktop like that.

But llama.cpp support is a work-in-progress. Same with other backends like Ktransformers. Realistically your options are:

Wait a week, maybe a few weeks, for the llama.cpp/ik_llama.cpp developers to implement to DSV4 architecture.
Try one of the janky GPU/Apple forks availible right now.
Try one of the slightly-less-janky, but slow CPU-only chinese forks.

But once its implemented, I’m going to make my own personal IQ3_KS mixed quantization for 128G desktops, and see how it compares to older architectures myself.

Another confounding factor is, if you’re researching “AI farm inference costs,” thats very different.

Frugal providers like Deepseek use complicated schemes to batch requests over many GPUs, with each taking requests in parallel. In other words, the more GPUs they have, the more speed per GPU they can squeeze out. For DeepseekV3, last I heard, Around 300 GPUs or so was an ideal deployment number…

And they aren’t even going to be using Nvidia GPUs anyway. I believe Deepseek is switching to Huawei for inference.

But however you slice it, they’re using order of magnitudes fewer resources than Tech Bro providers like OpenAI or Grok. They have been, for over a year.

Scrubbles@poptalk.scrubbles.tech · 14 hours ago

That all makes sense to me, and lines up with what I’ve been reading too. I saw the model download and I was like “guhhhh” to it because I was also excited to try it on my 3090. I’ll be waiting for the quants.

Yeah I like the end there too, that OpenAI / Anthropic have been desperately trying to figure out how to do this, and a few guys with limited hardware did it. When you have unlimited resources, you end up needing unlimited resources. When you only have 300 GPUs, you make it work. It’s why tech is littered with people starting in garages, they found a way to make it work.

brucethemoose@lemmy.world · edit-2 13 hours ago

And to be clear, you need 3090 + at least 96GB of fast CPU RAM (really 128GB) to run Deepseek Flash coherently. It is a big model; there’s no way around it.

If you have less RAM, try Qwen 27B now (which also uses an exotic attention mechanism). It’ll fit on your 3090 just fine.

For DeepSeek Pro, you’d need a Xeon or EPYC homelab.

brucethemoose@lemmy.world · edit-2 13 hours ago

I view it differently.

In the US, there are either megacorps, or “people in garages” which honestly don’t have resources and stuff like legal support to do huge innovations. They publish cool papers, which never get implemented because they don’t have $200k+ for a bigger test, and can’t work on it themselves for a living. Any “garage devs” who get too big, get smited or amalgamated into Big Tech gray goo, and whatever was interesting gets lost in oblivion.

There’s no cooperation, no sharing, either.

And OpenAI/Anthropic are way more conservative than you’d think. Same with Meta; they want results next quarter. Zuckerburg literally fired the whole Llama team, which put meta on the AI map and basically founded the open weights space, when they had one failed experiment. In other words, I’d argue clueless billionaires and the Tech Bro acolytes surrounding them are poisoning LLM development, and it’s starting to catch up.

In China, things are different. The GPU sanctions forced these gigantic companies like Alibaba or Tencent to be compute-thrifty, but they all seem to have access to suspiciously good training data… I would be the Chinese govt is helping them under the table. Chinese devs also have an interesting attitude; I would characterize them as “cooperative,” with lots of private forum sharing going on, most models being open-weights, and clearly not a lot of desire to censor their models for the government. But they have their own forms of dysfunction too, sometimes by copying other firms a little to closely, or corporate/personal drama like anywhere.

brucethemoose@lemmy.world · edit-2 15 hours ago

TurboQuant is total baloney.

It’s just KV cache quantization, and we’ve had all sorts of that for ages. Backends, not just papers, have had 4-bit cache with hadamard rotation (a major component of TurboQuant), and very low loss, since like 2023.

We’ve had proof that Bitnet works for over a year.

And no one cares. No one uses that kind of quantization because it reduces batched throughput, just like TurboQuant.

Besides, new architectures (like DeepSeek V4) render it obsolete, as they don’t use traditional KV cache anymore. I honestly have no idea how TurboQuant became such a meme, other than major astroturfing.

TL;DR All AI news is total bull. It’s chum for investors.

You need to look at what the engines, papers and actual LLM weight architectures are doing.

Zikeji@programming.dev · edit-2 1 day ago

There’s also speculative decoding and adjacent techniques getting traction, increasing performance of the models on the same hardware.

GamingChairModel@lemmy.world · 1 day ago

New technology has always been horribly inefficient, it’s only once more people see it does it start to get optimized.

Well, I wonder if the frontier ends up looking like supersonic commercial flight (prohibitively expensive so that there wasn’t enough of a market for consumers at the actual cost of providing the service): technology that continues to exist but never really gets used, because the alternatives that aren’t as good are still much, much cheaper.

MagicShel@lemmy.zip · 18 hours ago

Not everyone needs a Lamborghini or Concorde to get where they are going.

Work is pushing us to use cloud models and I haven’t had time to experiment more than a few limited tests. Qwen 3.6 ~30B Q4 runs pretty well on 36GB of ram. It’s a very capable model. It did choke when I tried to connect Cline to it for Java dev. But when I just conversationally ask to write python scripts it works pretty well.

I can see a future where a goodly amount of ram and an AI chip can produce the results we are currently getting only from cloud models.

GamingChairModel@lemmy.world · 14 hours ago

Not everyone needs a Lamborghini or Concorde to get where they are going.

I agree with that. Still, Lamborghinis are still being built, operated, and maintained, while Concordes are not.

I’m wondering whether the future of AI looks like the last 50 years of aviation, where there aren’t that many generational advances because the cost of developing new stuff becomes prohibitively expensive, but where the commoditization of what has already been invented makes it so that the experience for the average person really isn’t that different between 2026 and 1976, where the sweet spot for cost effectiveness isn’t at the bleeding edge at all.

And for my own curiosity on this line of thinking, I wanted to know whether the day-to-day cost of running these models is going down, and in which contexts.

Zarxrax@lemmy.world · 1 day ago

Its easy to think of it similar to something like computer hardware or game consoles. There is always newer and better hardware coming out. And the newer stuff is always more efficient (performance/watt) than the old stuff. But the user’s expectations increase as well, so new hardware doesn’t just aim to be more efficient, it aims to be more powerful. Then that sets a new baseline for expectations.

So a lot of these LLM and other types of models are very much like that. The newer models definitely bring improvements in efficiency and performance. But no one wants to sit still, they have to keep pushing the envelope to make them better and more powerful.

BlameThePeacock@lemmy.ca · 1 day ago

For an equivalent prompt and similar quality answer, yes. Inference prices are dropping.

However, higher quality answers (or more complex prompt handling) are currently going up in inference price.

The fun part will be once quality hits a point where the average user (or even business) doesn’t care about the incremental quality change any more. Then it’s going to be a race to the bottom for performance per dollar.

Who cares if the not all companies or investors make money? They can make their bets, some will win and some will lose. I just want better tech for cheaper prices.

GamingChairModel@lemmy.world · 1 day ago

Who cares if the not all companies or investors make money?

I care about the downstream effects on everyone else, of who else gets hurt in a crash.

BlameThePeacock@lemmy.ca · 1 day ago

That has nothing to do with the technology. The last crash was caused by a global virus, and the one before that was the banking system…

GamingChairModel@lemmy.world · 14 hours ago

That has nothing to do with the technology.

Good thing my core question isn’t asking about the technology, then, and is asking about the financials of running that technology.