• 1 Post
  • 111 Comments
Joined 1 year ago
cake
Cake day: January 25th, 2024

help-circle

  • It’s whole deal is “declarative” system configuration. Essentially, it means if your config file is identical to someone else’s, your systems will have identical software/dependencies, and everything should, in theory, run the same, generally speaking.

    So for instance, to install a package, instead of running sudo apt install nameofpackage, I would just edit my NixOS config file to have the line pkgs.nameofpackage in it, run a command to “rebuild” my system using sudo nixos-rebuild switch, and it would automatically be installed.

    That’s not the whole of it of course, but that’s just a general overview. It’s really good for if you’re running multiple systems that need the same software, because all you need to do is copy the config file over, run sudo nixos-rebuild switch, and the systems now have identical software.

    Oh yeah, and you can also easily rollback. If you break anything, you can, while starting to boot NixOS, just select the previous configuration, start your system, and any changes you’d made to software/settings will just be undone. It’s great for troubleshooting.

    AFAIK NixOS also has the largest number of supported packages out of any distro.


  • True, but I’m of the belief that we’ll probably see a continuation of the existing trend of building and improving upon existing models, rather than always starting entirely from scratch. For instance, you’ll almost always see nearly any newly released model talk about the performance of their Llama version, because it just produces better results when you combine it with the existing quality of Llama.

    I think we’ll see a similar trend now, just with R1 variants instead of Llama variants being the primary new type used. It’s just fundamentally inefficient to start over from scratch every time, so it makes sense that newer iterations would be built directly on previous ones.


  • So are these techiques so novel and breaktrough?

    The general concept, no. (it’s reinforcement learning, something that’s existed for ages)

    The actual implementation, yes. (training a model to think using a separate XML section, reinforcing with the highest quality results from previous iterations using reinforcement learning that naturally pushes responses to the highest rewarded outputs) Most other companies just didn’t assume this would work as well as throwing more data at the problem.

    This is actually how people believe some of OpenAI’s newest models were developed, but the difference is that OpenAI was under the impression that more data would be necessary for the improvements, and thus had to continue training the entire model with additional new information, and they also assumed that directly training in thinking times was the best route, instead of doing so via reinforcement learning. DeepSeek decided to simply scrap that part altogether and go solely for reinforcement learning.

    Will we now have a burst of deepseek like models everywhere?

    Probably, yes. Companies and researchers are already beginning to use this same methodology. Here’s a writeup about S1, a model that performs up to 27% better than OpenAI’s best model. S1 used Supervised Fine Tuning, and did something so basic, that people hadn’t previously thought to try it: Just making the model think longer by modifying terminating XML tags.

    This was released days after R1, based on R1’s initial premise, and creates better quality responses. Oh, and of course, it cost $6 to train.

    So yes, I think it’s highly probable that we see a burst of new models, or at least improvements to existing ones. (Nobody has a very good reason to make a whole new model of a different name/type when they can simply improve the one they’re already using and have implemented)




  • I don’t personally think it’s because of that. Sure, federation as a concept outside of email has a bit of a messaging problem for explaining it to newbies, but… everyone uses email, and knows how that works. This is identical, just with it being posts instead of emails. Users aren’t averse to federation, in concept or practice.

    Bluesky was directly created as a very close clone of Twitter’s UI, co-governed and subsequently pushed by the founder of Twitter himself, who will obviously have more reach than randoms promoting something like Mastodon, and, in my opinion, kind of just had better branding.

    “Bluesky” feels like a breath of fresh air, while “Mastodon” just sounds like… well, a Mastodon, whatever that makes the average person think of at first.

    So when you compare Bluesky, with a familiar UI, nice name, and consistent branding, not to mention algorithms, which Mastodon lacks, all funded by large sums of money, to Mastodon, with unfamiliar branding, minimal funding, and substantially less reach from promoters, which one will win out, regardless of the technology involved?


  • To anyone bemoaning BlueSky’s lack of federation, check out Free Our Feeds.

    It’s a campaign to create a public interest foundation independent from the Bluesky team (although the Bluesky team has said they support them) that will build independent infrastructure, like a secondary “relay” as an alternative to Bluesky’s that can still communicate across the same protocol (The “AT Protocol”) while also doing developer grants for the development of further social applications built on open protocols like the AT Protocol or ActivityPub.

    They have the support of an existing 501c(3), and their open letter has been signed by people you might find interesting, such as Jimmy Wales (founder of Wikipedia).






  • I doubt that will be the case, and I’ll explain why.

    As mentioned in this article,

    SFT (supervised fine-tuning), a standard step in AI development, involves training models on curated datasets to teach step-by-step reasoning, often referred to as chain-of-thought (CoT). It is considered essential for improving reasoning capabilities. DeepSeek challenged this assumption by skipping SFT entirely, opting instead to rely on reinforcement learning (RL) to train the model. This bold move forced DeepSeek-R1 to develop independent reasoning abilities, avoiding the brittleness often introduced by prescriptive datasets.

    This totally changes the way we think about AI training, which is why while OpenAI spent $100m on training GPT-4, running an expected 500,000 GPUs, DeepSeek used about 50,000, and likely spent that same roughly 10% of the cost.

    So while operation, and even training, is now cheaper, it’s also substantially less compute intensive to train models.

    And not only is there less data than ever to train models on that won’t cause them to get worse by regurgitating other worse quality AI-generated content, but even if additional datasets were scrapped entirely in favor of this new RL method, there’s a point at which an LLM is simply good enough.

    If you need to auto generate a corpo-speak email, you can already do that without many issues. Reformat notes or user input? Already possible. Classify tickets by type? Done. Write a silly poem? That’s been possible since pre-ChatGPT. Summarize a webpage? The newest version of ChatGPT will probably do just as well as the last at that.

    At a certain point, spending millions of dollars for a 1% performance improvement doesn’t make sense when the existing model just already does what you need it to do.

    I’m sure we’ll see development, but I doubt we’ll see a massive increase in training just because the cost to run and train the model has gone down.


  • That set of tokens/s is the performance, or response time if you’d like to call it that. GPT-o1 tends to get anywhere from 33-60, whereas in the example I showed previously, a Raspberry Pi can do 200 on a distilled model.

    Now, granted, a distilled model will produce worse performance than the full one, as seen in a benchmark comparison done by DeepSeek here (I’ve outlined the most distilled version of the newest DeepSeek model, which is likely the kind that is being run on the Raspberry Pi, albeit likely with some changes made by the author of that post, as well as OpenAI’s two most high-end models of a comparable distillation)

    The gap in quality is relatively small for a model that is likely distilled far past what OpenAI’s “mini” model is, when you consider that even regular laptop/PC hardware is orders of magnitudes more powerful than a Raspberry Pi, or that an external AI accelerator can be bought for as little as $60, the quality in practice could be very comparable with even slightly less distillation, especially with fine-tuning for a given use case (e.g. a local version of DeepSeek in a code development platform would be fine-tuned specifically just to produce code-related results)

    If you get into the region of only cloud-hosted instances of DeepSeek that are running at-scale on GPUs like OpenAI’s models are, the performance is only 1-2 percentage points off from OpenAI’s model, at about 3-6% of the cost, which effectively means 3-6% of the total amount of GPU power being paid for compared to the amount of GPU power OpenAI is paying for.




  • I hate community notes, it’s a cost free way of fact checking with no accountability.

    I don’t think it’s necessarily bad, but it can be harmful if done on a platform that has a significant skew in its political leanings, because it can then lead to the assumption that posts must be true because they were “fact checked” even if the fact check was actually just one of the 9:1 ratio of users that already believes that one thing.

    However, on platforms that have more general, less biased overall userbases, such as YouTube, a community notes system can be helpful, because it directly changes the platform incentives and design.

    I like to come at this from the understanding that the way a platform is designed influences how it is used and perceived by users. When you add a like button but not a dislike button, you only incentivize positive fleeting interactions with posts, while relegating stronger negative opinions to the comments, for instance. (see: Twitter)

    If a platform integrates community notes, that not only elevates content that had any effort at all made to fact check it (as opposed to none at all) but it also means that, to get a community note, somebody must at least attempt to verify the truth. And if someone does that, then statistically speaking, there’s at least a slightly higher likelihood that the truth is made apparent in that community note than if none existed to incentivize someone to fact check in the first place.

    Again, this doesn’t work in all scenarios, nor is it always a good decision to add depending on a platform’s current design and general demographic political leanings, but I do think it can be valuable in some cases. (This also heavily depends on who is allowed access to create the community notes, of course)



  • All requests are proxied through DuckDuckGo, and all personalized user metadata is removed. (e.g. IPs, any sort of user/session ID, etc)

    They have direct agreements to not train on or store user data, (the training part is specifically relevant to OpenAI & Anthropic) with a requirement they delete all information once no longer necessary (specifically for providing responses) within 30 days.

    For the Llama & Mixtral models, they host them on together.ai (an LLM-focused cloud platform) but that has the same data privacy requirements as OpenAI and Anthropic.

    Recent chats that are saved for later are stored locally (instead of on their servers) and after 30 conversations, the last chat before that is automatically purged from your device.

    Obviously there’s less technical privacy guarantees than a local model, but for when it’s not practical or possible, I’ve found it’s a good option.