I’ve read some of Ed Zitron’s long posts on why the AI industry is a bubble that will never be profitable (and will bring down a lot of companies and investors), and one of the recurring themes is that the AI companies are trying to capture growing market share in an industry where their marginal profits are still negative, and that any increase in revenue necessarily increases their costs of providing their services.

But some of the comments in various HackerNews threads are dismissive, saying that each new generation of models makes the cost of inference lower, so that with sufficient customer volume, the companies running the models can make enough profit on inference to make up for the staggering up-front capital expenditures it took to build out the data centers, train their models, etc.

It’s all pretty confusing to me. So for those of you who are familiar with the industry, I have several questions:

  1. Is the cost of running any given pretrained model going down, for specific models? Are there hardware and software improvements that make it cheaper to run those models, despite the model itself not changing?
  2. Is the cost of performing a particular task at a particular quality level going down, through releases of newer models of similar performance (i.e., a smaller model of the current generation performing similarly to a bigger model of the previous generation, such that the cost is now cheaper)?
  3. Is the cost of running the largest flagship frontier models going down for any given task? Or does running the cutting edge show-off tasks keep increasing in cost, but where the companies argue that the improvement in performance is worth the cost increase?

I suspect that the reason why the discussion around this is so muddled online is because the answers are different depending on which of the 3 questions is meant by “is running an AI model getting cheaper over time?” And the data isn’t easy to synthesize because each model has different token prices and different number of tokens per query.

But I wanted to hear from people who are knowledgeable about these topics.

  • Scrubbles@poptalk.scrubbles.tech
    link
    fedilink
    English
    arrow-up
    17
    ·
    1 day ago

    I think we’re seeing a lot of optimization right now. The most exciting one I’ve seen is TurboQuant. Short version, every message you send to a model has context, the entire conversation you’ve had, instructions, skills, everything. That takes up an exponential amount of ram, and this is what is causing the VRAM/RAM shortage. TurboQuant (and other copycats now) claims that it can reduce that VRAM usage of the context by 20x. That’s absolutely huge, that’s 1M context models running on consumer hardware potential huge.

    Deepseek v4 also boasts some large claims, saying they have a model that does better than Anthropic’s or OpenAI, while being 1/10th the size. That also is a huge reduction in compute and VRAM, but I’ll be looking for the proof.

    We’ve seen other items too, with upgrades in running models, how quickly results are streamed, to me TurboQuant is the most exciting.

    I think it’s good that they’re finally looking at optimization. Yes, their cost has been power and compute. NVidia is more than happy to keep things inefficient because they sell GPUs that way. Software companies are doing the opposite now, reducing the compute overhead to start saving them money, which they desperately need to do if this is going to continue. New technology has always been horribly inefficient, it’s only once more people see it does it start to get optimized.

    I think this is what is going to be required to finally push past the horribleness of AI companies, and they need to do this quickly.

    • brucethemoose@lemmy.world
      link
      fedilink
      arrow-up
      3
      ·
      edit-2
      13 hours ago

      Also, on Deepseek V4… you can run it yourself, for free. There’s no mystery. And there’s tons of benchmarks out there already.

      It’s indeed very efficient if you’re into long context. But at shorter context lengths, it’s not too different than Deepseek’s previous releases (and the flood of MoE models that have come since then).

        • brucethemoose@lemmy.world
          link
          fedilink
          arrow-up
          2
          ·
          edit-2
          13 hours ago

          Okay, I fudged the part about “for free.” The problem is DeepSeekv4 is literally in preview, and its architecture is so new that engine support for its weights is poor.

          Right this second, you can either pay a few cents to try it from some API (there are many providers since its open weights), or rent a GPU (or maybe a CPU) instance if you don’t trust the public tests, and actually want to test resource usage yourself.

          Or you can quantize it and self host it. I plan to do so on my 128GB RAM/RTX 3090 desktop, which is a affordable config to rent if you don’t have a desktop like that.

          But llama.cpp support is a work-in-progress. Same with other backends like Ktransformers. Realistically your options are:

          • Wait a week, maybe a few weeks, for the llama.cpp/ik_llama.cpp developers to implement to DSV4 architecture.

          • Try one of the janky GPU/Apple forks availible right now.

          • Try one of the slightly-less-janky, but slow CPU-only chinese forks.

          But once its implemented, I’m going to make my own personal IQ3_KS mixed quantization for 128G desktops, and see how it compares to older architectures myself.


          Another confounding factor is, if you’re researching “AI farm inference costs,” thats very different.

          Frugal providers like Deepseek use complicated schemes to batch requests over many GPUs, with each taking requests in parallel. In other words, the more GPUs they have, the more speed per GPU they can squeeze out. For DeepseekV3, last I heard, Around 300 GPUs or so was an ideal deployment number…

          And they aren’t even going to be using Nvidia GPUs anyway. I believe Deepseek is switching to Huawei for inference.

          But however you slice it, they’re using order of magnitudes fewer resources than Tech Bro providers like OpenAI or Grok. They have been, for over a year.

          • Scrubbles@poptalk.scrubbles.tech
            link
            fedilink
            English
            arrow-up
            2
            ·
            12 hours ago

            That all makes sense to me, and lines up with what I’ve been reading too. I saw the model download and I was like “guhhhh” to it because I was also excited to try it on my 3090. I’ll be waiting for the quants.

            Yeah I like the end there too, that OpenAI / Anthropic have been desperately trying to figure out how to do this, and a few guys with limited hardware did it. When you have unlimited resources, you end up needing unlimited resources. When you only have 300 GPUs, you make it work. It’s why tech is littered with people starting in garages, they found a way to make it work.

            • brucethemoose@lemmy.world
              link
              fedilink
              arrow-up
              2
              ·
              edit-2
              11 hours ago

              And to be clear, you need 3090 + at least 96GB of fast CPU RAM (really 128GB) to run Deepseek Flash coherently. It is a big model; there’s no way around it.

              If you have less RAM, try Qwen 27B now (which also uses an exotic attention mechanism). It’ll fit on your 3090 just fine.

              For DeepSeek Pro, you’d need a Xeon or EPYC homelab.

            • brucethemoose@lemmy.world
              link
              fedilink
              arrow-up
              1
              ·
              edit-2
              11 hours ago

              I view it differently.

              In the US, there are either megacorps, or “people in garages” which honestly don’t have resources and stuff like legal support to do huge innovations. They publish cool papers, which never get implemented because they don’t have $200k+ for a bigger test, and can’t work on it themselves for a living. Any “garage devs” who get too big, get smited or amalgamated into Big Tech gray goo, and whatever was interesting gets lost in oblivion.

              There’s no cooperation, no sharing, either.

              And OpenAI/Anthropic are way more conservative than you’d think. Same with Meta; they want results next quarter. Zuckerburg literally fired the whole Llama team, which put meta on the AI map and basically founded the open weights space, when they had one failed experiment. In other words, I’d argue clueless billionaires and the Tech Bro acolytes surrounding them are poisoning LLM development, and it’s starting to catch up.


              In China, things are different. The GPU sanctions forced these gigantic companies like Alibaba or Tencent to be compute-thrifty, but they all seem to have access to suspiciously good training data… I would be the Chinese govt is helping them under the table. Chinese devs also have an interesting attitude; I would characterize them as “cooperative,” with lots of private forum sharing going on, most models being open-weights, and clearly not a lot of desire to censor their models for the government. But they have their own forms of dysfunction too, sometimes by copying other firms a little to closely, or corporate/personal drama like anywhere.

    • brucethemoose@lemmy.world
      link
      fedilink
      arrow-up
      2
      ·
      edit-2
      13 hours ago

      TurboQuant is total baloney.

      It’s just KV cache quantization, and we’ve had all sorts of that for ages. Backends, not just papers, have had 4-bit cache with hadamard rotation (a major component of TurboQuant), and very low loss, since like 2023.

      We’ve had proof that Bitnet works for over a year.

      And no one cares. No one uses that kind of quantization because it reduces batched throughput, just like TurboQuant.

      Besides, new architectures (like DeepSeek V4) render it obsolete, as they don’t use traditional KV cache anymore. I honestly have no idea how TurboQuant became such a meme, other than major astroturfing.


      TL;DR All AI news is total bull. It’s chum for investors.

      You need to look at what the engines, papers and actual LLM weight architectures are doing.

    • Zikeji@programming.dev
      link
      fedilink
      English
      arrow-up
      5
      ·
      edit-2
      1 day ago

      There’s also speculative decoding and adjacent techniques getting traction, increasing performance of the models on the same hardware.

    • GamingChairModel@lemmy.worldOP
      link
      fedilink
      arrow-up
      2
      ·
      1 day ago

      New technology has always been horribly inefficient, it’s only once more people see it does it start to get optimized.

      Well, I wonder if the frontier ends up looking like supersonic commercial flight (prohibitively expensive so that there wasn’t enough of a market for consumers at the actual cost of providing the service): technology that continues to exist but never really gets used, because the alternatives that aren’t as good are still much, much cheaper.

      • MagicShel@lemmy.zip
        link
        fedilink
        English
        arrow-up
        2
        ·
        16 hours ago

        Not everyone needs a Lamborghini or Concorde to get where they are going.

        Work is pushing us to use cloud models and I haven’t had time to experiment more than a few limited tests. Qwen 3.6 ~30B Q4 runs pretty well on 36GB of ram. It’s a very capable model. It did choke when I tried to connect Cline to it for Java dev. But when I just conversationally ask to write python scripts it works pretty well.

        I can see a future where a goodly amount of ram and an AI chip can produce the results we are currently getting only from cloud models.

        • GamingChairModel@lemmy.worldOP
          link
          fedilink
          arrow-up
          2
          ·
          12 hours ago

          Not everyone needs a Lamborghini or Concorde to get where they are going.

          I agree with that. Still, Lamborghinis are still being built, operated, and maintained, while Concordes are not.

          I’m wondering whether the future of AI looks like the last 50 years of aviation, where there aren’t that many generational advances because the cost of developing new stuff becomes prohibitively expensive, but where the commoditization of what has already been invented makes it so that the experience for the average person really isn’t that different between 2026 and 1976, where the sweet spot for cost effectiveness isn’t at the bleeding edge at all.

          And for my own curiosity on this line of thinking, I wanted to know whether the day-to-day cost of running these models is going down, and in which contexts.