Is AI inference getting cheaper or more expensive over time?

GamingChairModel@lemmy.world · 1 day ago

Is AI inference getting cheaper or more expensive over time?

brucethemoose@lemmy.world · edit-2 16 hours ago

Also, on Deepseek V4… you can run it yourself, for free. There’s no mystery. And there’s tons of benchmarks out there already.

It’s indeed very efficient if you’re into long context. But at shorter context lengths, it’s not too different than Deepseek’s previous releases (and the flood of MoE models that have come since then).

Scrubbles@poptalk.scrubbles.tech · 15 hours ago

How do you do that, I checked it out and it was 700 gb or something

brucethemoose@lemmy.world · edit-2 15 hours ago

Okay, I fudged the part about “for free.” The problem is DeepSeekv4 is literally in preview, and its architecture is so new that engine support for its weights is poor.

Right this second, you can either pay a few cents to try it from some API (there are many providers since its open weights), or rent a GPU (or maybe a CPU) instance if you don’t trust the public tests, and actually want to test resource usage yourself.

Or you can quantize it and self host it. I plan to do so on my 128GB RAM/RTX 3090 desktop, which is a affordable config to rent if you don’t have a desktop like that.

But llama.cpp support is a work-in-progress. Same with other backends like Ktransformers. Realistically your options are:

Wait a week, maybe a few weeks, for the llama.cpp/ik_llama.cpp developers to implement to DSV4 architecture.
Try one of the janky GPU/Apple forks availible right now.
Try one of the slightly-less-janky, but slow CPU-only chinese forks.

But once its implemented, I’m going to make my own personal IQ3_KS mixed quantization for 128G desktops, and see how it compares to older architectures myself.

Another confounding factor is, if you’re researching “AI farm inference costs,” thats very different.

Frugal providers like Deepseek use complicated schemes to batch requests over many GPUs, with each taking requests in parallel. In other words, the more GPUs they have, the more speed per GPU they can squeeze out. For DeepseekV3, last I heard, Around 300 GPUs or so was an ideal deployment number…

And they aren’t even going to be using Nvidia GPUs anyway. I believe Deepseek is switching to Huawei for inference.

But however you slice it, they’re using order of magnitudes fewer resources than Tech Bro providers like OpenAI or Grok. They have been, for over a year.

Scrubbles@poptalk.scrubbles.tech · 14 hours ago

That all makes sense to me, and lines up with what I’ve been reading too. I saw the model download and I was like “guhhhh” to it because I was also excited to try it on my 3090. I’ll be waiting for the quants.

Yeah I like the end there too, that OpenAI / Anthropic have been desperately trying to figure out how to do this, and a few guys with limited hardware did it. When you have unlimited resources, you end up needing unlimited resources. When you only have 300 GPUs, you make it work. It’s why tech is littered with people starting in garages, they found a way to make it work.

brucethemoose@lemmy.world · edit-2 14 hours ago

And to be clear, you need 3090 + at least 96GB of fast CPU RAM (really 128GB) to run Deepseek Flash coherently. It is a big model; there’s no way around it.

If you have less RAM, try Qwen 27B now (which also uses an exotic attention mechanism). It’ll fit on your 3090 just fine.

For DeepSeek Pro, you’d need a Xeon or EPYC homelab.

brucethemoose@lemmy.world · edit-2 14 hours ago

I view it differently.

In the US, there are either megacorps, or “people in garages” which honestly don’t have resources and stuff like legal support to do huge innovations. They publish cool papers, which never get implemented because they don’t have $200k+ for a bigger test, and can’t work on it themselves for a living. Any “garage devs” who get too big, get smited or amalgamated into Big Tech gray goo, and whatever was interesting gets lost in oblivion.

There’s no cooperation, no sharing, either.

And OpenAI/Anthropic are way more conservative than you’d think. Same with Meta; they want results next quarter. Zuckerburg literally fired the whole Llama team, which put meta on the AI map and basically founded the open weights space, when they had one failed experiment. In other words, I’d argue clueless billionaires and the Tech Bro acolytes surrounding them are poisoning LLM development, and it’s starting to catch up.

In China, things are different. The GPU sanctions forced these gigantic companies like Alibaba or Tencent to be compute-thrifty, but they all seem to have access to suspiciously good training data… I would be the Chinese govt is helping them under the table. Chinese devs also have an interesting attitude; I would characterize them as “cooperative,” with lots of private forum sharing going on, most models being open-weights, and clearly not a lot of desire to censor their models for the government. But they have their own forms of dysfunction too, sometimes by copying other firms a little to closely, or corporate/personal drama like anywhere.