Nvidia delivers first Vera Rubin AI GPU samples to customers — 88-core Vera CPU paired with Rubin GPUs with 288 GB of HBM4 memory apiece

@RegularJoe@lemmy.world · 4 months ago

Nvidia delivers first Vera Rubin AI GPU samples to customers — 88-core Vera CPU paired with Rubin GPUs with 288 GB of HBM4 memory apiece

@AliasAKA@lemmy.world · edit-2 4 months ago

Current models are speculated at 700 billion parameters plus. At 32 bit precision (half float), that’s 2.8TB of RAM per model, or about 10 of these units. There are ways to lower it, but if you’re trying to run full precision (say for training) you’d use over 2x this, something like maybe 4x depending on how you store gradients and updates, and then running full precision I’d reckon at 32bit probably. Possible I suppose they train at 32bit but I’d be kind of surprised.

Edit: Also, they don’t release it anymore but some folks think newer models are like 1.5 trillion parameters. So figure around 2-3x that number above for newer models. The only real strategy for these guys is bigger. I think it’s dumb, and the returns are diminishing rapidly, but you got to sell the investors. If reciting nearly whole works verbatim is easy now, it’s going to be exact if they keep going. They’ll approach parameter spaces that can just straight up save things into their parameter spaces.

in_my_honest_opinion · 4 months ago

Sure, but giant context models are still more prone to hallucination and reinforcing confidence loops where they keep spitting out the same wrong result a different way.

@AliasAKA@lemmy.world · 4 months ago

Sorry, I’m not saying that’s a good thing. It’s not just the context that’s expanding, but the parameter of the base model. I’m saying at some point you just have saved a compressed version of the majority of the content (we’re already kind of there) and you’d be able to decompress it even more losslessly. This doesn’t make it more useful for anything other than recreating copyrighted works.

in_my_honest_opinion · 4 months ago

Ah I see, however you do bring up another point. I really think we need a true collection of experts able to communicate without the need for natural language and then a “translation” layer to output natural language or images to the user. The larger parameters would allow the injection of experts into the pipeline.

Thanks for the clarification, and also for the idea. I think one thing we can all agree on is that the field is expanding faster than any billionaire or company understands.

@AliasAKA@lemmy.world · 4 months ago

This already happens intrinsically in the models. The tokens are abstracted in the internal layers and only translated in the output layer back to next token prediction. Training visual models is slightly different because you’re not outputting tokens but pixel values (or possibly bounding boxes or edges, but not usually; conversely if not generative you may be predicting labels which could theoretically be in token space).

The field itself is actually fairly stagnant in architecture. It’s still just attention layers all the way down. It’s just adding more context length and more layers and wider layers while training on more data. I personally think this approach will never achieve AGI or anything like it. It will get better at perfectly reciting its training data, but I don’t expect truly emergent phenomena to occur with these architectures just because they’re very big. They’ll be decent chatbots, but we already have that, and they’ll just consumer ever more resources for vanishingly small improvements (and won’t functionally improve any true logical capability beyond regurgitating logical paths already trodden in their training data but in a very brittle way, because they do not actually understand the logic or why the logic is valid, they have no true state model of objects which are described in the token space they’re traversing probabilistically).

in_my_honest_opinion · edit-2 4 months ago

will never achieve AGI or anything like it

On this we absolutely agree. I’m targeting a more efficient interactive wiki essentially. Something you could package and have it run on local consumer hardware. Similar to this https://codeberg.org/BobbyLLM/llama-conductor but it would be fully transform native and there would only need to be one LLM for interaction with the end user. Everything else would be done in machine code behind the scenes.

I was unclear I guess, I was talking about injecting other models, running their prediction pipeline for the specific topic, and then dropped out of the window to be replaced by another expert. This functionality handled by a larger model that is running the context window. Not nested models, but interchangeable ones dependent on the vector of the tokens. So a qwq RAG trained on python talking to a qwen3 quant4 RAG trained on bash wrapped in deepseekR1 as the natural language output to answer the prompt “How do I best package a python app with uv on a linux server to run a backend for a …”

Currently this type of workflow is often handled with MCP servers from some sort of harness and as I understand it those still use natural language as they are all separate models. But my proposal leverages the stagnation in the field and leverages it as interoperability.