[Meta] How do y'all post clips/animations on Lemmy? Only GIF seems to work.

brucethemoose@lemmy.world · edit-2 26 minutes ago

A lot, but less than you’d think! Basically a RTX 3090/threadripper system with a lot of RAM (192GB?)

With this framework, specifically: https://github.com/ikawrakow/ik_llama.cpp?tab=readme-ov-file

The “dense” part of the model can stay on the GPU while the experts can be offloaded to the CPU, and the whole thing can be quantized to ~3 bits average, instead of 8 bits like the full model.

That’s just a hack for personal use, though. The intended way to run it is on a couple of H100 boxes, and to serve it to many, many, many users at once. LLMs run more efficiently when they serve in parallel. Eg generating tokens for 4 users isn’t much slower than generating them for 2, and Deepseek explicitly architected it to be really fast at scale. It is “lightweight” in a sense.

…But if you have a “sane” system, it’s indeed a bit large. The best I can run on my 24GB vram system are 32B - 49B dense models (like Qwen 3 or nemotron), or 70B mixture of experts (like the new Hunyuan 70B).

brucethemoose@lemmy.world · edit-2 14 hours ago

DeepSeek, now that is a filtered LLM.

The web version has a strict filter that cuts it off. Not sure about API access, but raw Deepseek 671B is actually pretty open. Especially with the right prompting.

There are also finetunes that specifically remove China-specific refusals. Note that Microsoft actually added saftey training to “improve its risk profile”:

https://huggingface.co/microsoft/MAI-DS-R1

https://huggingface.co/perplexity-ai/r1-1776

That’s the virtue of being an open weights LLM. Over filtering is not a problem, one can tweak it to do whatever you want.

Grok losing the guardrails means it will be distilled internet speech deprived of decency and empathy.

Instruct LLMs aren’t trained on raw data.

It wouldn’t be talking like this if it was just trained on randomized, augmented conversations, or even mostly Twitter data. They cherry picked “anti woke” data to placate Musk real quick, and the result effectively drove the model crazy. It has all the signatures of a bad finetune: specific overused phrases, common obsessions, going off-topic, and so on.

…Not that I don’t agree with you in principle. Twitter is a terrible source for data, heh.

brucethemoose@lemmy.world · edit-2 15 hours ago

Nitpick: it was never ‘filtered’

LLMs can be trained to refuse excessively (which is kinda stupid and is objectively proven to make them dumber), but the correct term is ‘biased’. If it was filtered, it would literally give empty responses for anything deemed harmful, or at least noticably take some time to retry.

They trained it to praise hitler, intentionally. They didn’t remove any guardrails. Not that Musk acolytes would know any different.

brucethemoose@lemmy.world · edit-2 5 days ago

…iOS forces uses Apple services including getting apps through Apple…

Can’t speak to the rest of the claims, but Android practically does too. If one has to sideload an app, you’ve lost 99% of users, if not more.

It makes me suspect they’re not talking about the stock systems OEMs ship.

Relevant XKCD: https://xkcd.com/2501/

brucethemoose@lemmy.world · edit-2 5 days ago

deleted by creator

brucethemoose@lemmy.world · edit-2 6 days ago

Yep.

It’s not the best upscale TBH.

Hence I brought up redoing it with some of the same techniques (oldschool vapoursynth processing + manual pixel peeping) mixed with more modern deinterlacing and better models than Waifu2X. Maybe even a finetune? Ban.

brucethemoose@lemmy.world · edit-2 6 days ago

Pro is 120hz.

But they are expensive as heck. I only got the 16 Plus because its a carrier loss leader, heh.

And wouldn’t fix some of my other quibbles with iOS’s inflexibility. My ancient jailbroken iPhone 5 was more customizable than now, and Apple is still slowly, poorly implementing features I had a decade ago. It’s mind boggling, and jailbreaking isn’t a good option anymore.

brucethemoose@lemmy.world · 6 days ago

Nah I meant the opposite. Journalistic integrity was learned through long, hard history.

Now that traditional journalism is dying, its like the streamer generation has to learn it from scratch, heh.

brucethemoose@lemmy.world · edit-2 6 days ago

I got banned from a fandom subreddit for pointing out that a certain fan remaster was (partially, with tons of manual work) made with ML models. Specifically with oldschool GANs, and some smaller, older models as part of a deinterlacing pipeline, from before ‘generative AI’ was even a term.

brucethemoose@lemmy.world · edit-2 6 days ago

My last Android phone was a Razer Phone 2, SD845 circa 2018. Basically stock Android 9.

And it was smooth as butter. It had a 120hz screen while my iPhone 16 is stuck at 60, and I can feel it. And it flew through some heavy web apps I use while the iPhone chugs and jumps around, even though the new SoC should objectively blow away even modern Android devices.

It wasn’t always this way; iOS used to be (subjectively) so much faster that it’s not even funny, at least back when I had an iPhone 6S(?). Maybe there was an inflection point? Or maybe it’s only the case with “close to stock” Android stuff that isn’t loaded with bloat.

brucethemoose@lemmy.world · edit-2 6 days ago

Random aside, I switched from Android to iOS a year ago. I miss Android already.

The UI is more convoluted an clunky than iOS from years ago, just as uncustomizable, and performs shockly bad on heavy webpages on a brand new 16+. It’s got no freaking RAM, no sd card slot. Some free FOSS apps are nonexistant or paid only.

Security and OOTB privacy is better and app support is generally better, but that’s about it? I’d probably keep an iPhone around to bank on when I eventually switch…

brucethemoose@lemmy.world · 7 days ago

Its kinda like influencers (and their younger viewers) are relearning the history of journalism from scratch, heh.

brucethemoose@lemmy.world · edit-2 7 days ago

Surpressing sponsors is a perverse incentive too; all the more reason to not disclose who’s paying the creator.

And yeah, any ‘moral’ justification for web ads is dead like 100 times over. I hate how hard it makes life for ‘old web’ style sites with like one innocent banner ad, but still.

brucethemoose@lemmy.world · edit-2 8 days ago

One thing about Anthropic/OpenAI models is they go off the rails with lots of conversation turns or long contexts. Like when they need to remember a lot of vending machine conversation I guess.

A more objective look: https://arxiv.org/abs/2505.06120v1

https://github.com/NVIDIA/RULER

Gemini is much better. TBH the only models I’ve seen that are half decent at this are:

“Alternate attention” models like Gemini, Jamba Large or Falcon H1, depending on the iteration. Some recent versions of Gemini kinda lose this, then get it back.
Models finetuned specifically for this, like roleplay models or the Samantha model trained on therapy-style chat.

But most models are overtuned for oneshots like fix this table or write me a function, and don’t invest much in long context performance because it’s not very flashy.

brucethemoose@lemmy.world · edit-2 8 days ago

Not at all. Not even close.

Image generation is usually batched and takes seconds, so 700W (a single H100 SXM) for a few seconds for a batch of a few images to multiple users. Maybe more for the absolute biggest (but SFW, no porn) models.

LLM generation takes more VRAM, but is MUCH more compute-light. Typically one has banks of 8 GPUs in multiple servers serving many, many users at once. Even my lowly RTX 3090 can serve 8+ users in parallel with TabbyAPI (and modestly sized model) before becoming more compute bound.

So in a nutshell, imagegen (on an 80GB H100) is probably more like 1/4-1/8 of a video game at once (not 8 at once), and only for a few seconds.

Text generation is similarly efficient, if not more. Responses take longer (many seconds, except on special hardware like Cerebras CS-2s), but it parallelized over dozens of users per GPU.

This is excluding more specialized hardware like Google’s TPUs, Huawei NPUs, Cerebras CS-2s and so on. These are clocked far more efficiently than Nvidia/AMD GPUs.

…The worst are probably video generation models. These are extremely compute intense and take a long time (at the moment), so you are burning like a few minutes of gaming time per output.

ollama/sd-web-ui are terrible analogs for all this because they are single user, and relatively unoptimized.

brucethemoose@lemmy.world · edit-2 8 days ago

TBH most people still use old SDXL finetunes for porn, even with the availability of newer ones.

brucethemoose@lemmy.world · 8 days ago

Also, one other thing is that Nvidia clocks their GPUs (aka the world’s AI accelerators) very inefficiently, because they have a pseudo monopoly, and they can.

It doesn’t have to be this way, and likely wont in the future.

brucethemoose@lemmy.world · edit-2 8 days ago

The UC paper above touches on that. I will link a better one if I find it.

But specifically:

streaming services

Almost all the power from this is from internet infrastructure and the end device. Encoding videos (for them to be played thousands/millions of times) is basically free since its only done once, with the exception being YouTube (which is still very efficient). Storage servers can handle tons of clients (hence they’re dirt cheap), and (last I heard) Netflix even uses local cache boxes to shorten the distance.

TBH it must be less per capita than CRTs. Old TVs burned power like crazy.

brucethemoose@lemmy.world · edit-2 8 days ago

Bingo.

Altman et al want to kill open source AI for a monopoly.

This is what the entire AI research space already knew even before deepseek hit, and why they (largely) think so little of Sam Altman.

The real battle in the space is not AI vs no AI, but exclusive use by AI Bros vs. open models that bankrupt them. Which is what I keep trying to tell /c/fuck_ai, as the “no AI” stance plays right into the AI Bro’s hands.

brucethemoose@lemmy.world · 8 days ago

I think that’s going a bit far. ML models are tools to augment people, mostly.

brucethemoose@lemmy.world · edit-2 3 months ago

[Meta] How do y'all post clips/animations on Lemmy? Only GIF seems to work.

brucethemoose@lemmy.world · edit-2 7 months ago

[Rumor] Shipping Listing Suggests 24GB+ Intel Arc B580

brucethemoose@lemmy.world · edit-2 9 months ago

Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · 1 year ago

Alleged AMD Strix Halo APU Appears in Benchmark