u/Asgard_Heima — reddlx

Kimi K2 on Cerebras ~1000 token per second

This is a massive validation that we are going to see frontier models of any size significantly faster on Cerebras.

https://www.cerebras.ai/blog/cerebras-kimi-k2-Enterprise

You have Cerebras Wrong

I have been a long time Prof G and then Prof G Markets fan and I normally just kind of laugh off the show when it gets out of its depth on something. This time Cerebras is my single largest holding and I feel like I just watched two people bash a evolutional technology cause they don’t like the partners and friends with zero understanding of the value of the company itself.

Where you are right is that historically they have had G42 and the UAE more generally as their largest customer. For reported revenue till now, they still are the largest portion. And they needed a big money backer to get the massive research and build out costs done when they didn’t have the best product. They needed to move down the nm scale until the wafer could hold enough SRAM and compute to make it competitive with everything. UAE needed AI compute and they had limited options. Now Cerebras is moving into the largest names in US AI.

And this is one of the points you got wrong, their overall customer mix is incredible now. They have the world’s best medical and pharmaceutical researchers as clients. Some of the hottest AI startups like cognition, notion, perplexity, alpha sense, and verve as customers. Major DOE labs as client. And then Meta, Mistral, IBM, and now OpenAI and AWS as their largest customer.

You mentioned the current revenue of half a billion in 2025, but haven’t mentioned the backlog of over 5 billion not from OpenAI. And then yes OpenAI has another contractual 20B over 3 years. Cerebras is faster, more energy efficient and cheaper than GPUs for inference and training. OpenAI will do everything they can to make this work so they can be profitable on all the tokens these systems serve. The main thing you knock OpenAI for Cerebras is the solution to solve. It widens the margins on the inference delivered to be profitable. Also the existing backlog is the RPO from the S-1, so verified contractual amounts. The AWS agreement is revenue share for every token they produce with their systems in AWS data centers and not a part of that total. So you have the largest cloud in the world baking Cerebras into their AI delivery platform Bedrock and also using them directly for their top model Nova, and none of that revenue is priced in yet.

You need someone with real depth of knowledge you trust to give you the rundown on why Cerebras is going to change the AI hardware space in the long run. The distance of memory from compute and overall memory bandwidth if half the issue, but the fact GPUs have to distribute and replicate memory in a fundamentally less efficient architecture is the other half. Size of chip matters a lot in the performance of training and inference for AI hardware.

Cerebras is in a breakout moment and their revenue growth rate is growing. Just seeing AWS run top SOTA models on bedrock 5x faster could see their valuation jump and every hyper scaler working to acquire units as fast as they can.

reddit.com

u/Asgard_Heima — 4 days ago

▲ 11 r/CerebrasSystems

What/Why Cerebras?

Posted this in a couple thread and see this question asked in various form a lot right now, but here is my view…

At core is the technology, which comes from top level management executing since 2015. They have made something others have tried for decades and been unable to accomplish. And now they have extensive patents to secure that moat.

If we just look at the physics of what they have built, it’s the maximum compute and memory bandwidth to feed that compute possible in a single wafer. The two fundamental constraints for AI in combination are compute and memory. If you starve compute the memory can’t be consumed fast enough and if you don’t have the data ready to compute, the cores are sitting idle. If you have both on the same wafer and consume that whole wafer, you can’t get them any closer or faster or larger. So at the most basic level they should have the very best physically possible solution.

If you look at any other architecture for large AI models you will find their main bottleneck issue is memory bandwidth to feed compute. This is a direct result of moving the data that needs computed further away. Every atom further the data is from the compute cores adds latency and energy use. SRAM is closest, next is HBM, then DRAM, then SSD.

Next is off wafer data which comes down to wafer size. Every time you split a wafer, the more data you have to send not just from memory to compute, but from entire wafer to wafer. This is the interconnect tax. It’s an even larger problem than memory bandwidth currently. Every time you have to share data between wafers it’s now bottlenecked by network bandwidth.

This is the most important issue for GPU inference and training and why groq small inference chips aren’t a winning solution. In training all chips need to share all results across each layer, updating the model in each GPU’s memory every time. For inference it’s much the same, especially as models scale to a massive size.

Because a SOTA model won’t fit onto the HBM of a single conventional GPU, it has to be split across multiple chips. This means every single time a token is generated, the data has to constantly jump between cores over network cables, crushing your latency and massively increasing your power consumption.

I want to also highlight we are hitting the max power and cooling possible in a single rack with GPUs, they are only increasing the needed power per rack with liquid to chip cooling becoming required. Cerebras can fit two WSE units in a single rack under 80kW with backside air or liquid cooling. Can do one unit in any data center with a new whip. Cause power use scales with the energy needs of sending data further distances, this is a strategic advantage.

These all reinforce Cerebras has the wining solution and it will only grown in how much better it is as Cerebras moves down the nm wafer used till its orders of magnitude for most things like it is for memory bandwidth already.

Cerebras even with the most ideal solution has two main bottlenecks today. Total SRAM on a wafer, and wafer to wafer networking speed. If either of these are solved, it will no longer matter what size or quantized model or any edge case we are talking about, Cerebras will be an order of magnitude better in every real world performance metric than the competition. And they are solving for both.

The partnership with Ranovus will add fiber co-packaged on wafer and add somewhere between 50-100Tbps networking with light speed latencies at wafer edge. This is not fiber networking of today since those require De Ser which still compounds latency. It will be fiber directly onto the wafer with non perceivable latency in use.

The second is SRAM which TSMC is helping them add two wafers bonded together, so they can make an entire wafer of SRAM connected vertically to a wafer of compute cores. Look for these two details in any WSE-4 announcements this year and this will be a major pivot moment.

Cerebras has to execute on it and find methods to ramp production, but if they ship something like this which is expected, every hyper scaler is going to be on their side trying to get them shipped since it will increase their token and training margins by 10x. Any WSE-4 like this will be an order of magnitude to multiple orders of magnitude more energy efficient per token delivered, provide today SOTA models training in weeks instead of months, and allow for 10M context windows on 10T+ parameter models with near 100% efficiency.

They can accomplish this since they can scale vertically into massive clusters. This will also unlock something GPUs have reached a limit on and that’s model depth. As a distributed architecture, GPUs have maxed out at 80-120 layers. So we have wide models with extremely larger data sets, but the number of layers to refine results is shallow with 120 max steps before you get the result. Going further just kills GPUs and they have to decrease layer count as models get wider with SOTA being under 100 layers.

Cerebras already with WSE-3 can go deeper in layers, but with a WSE-4 we could see 1000 layer models with a whole new area of research for intelligence gains. There is a current gradient decent problem, but the hardware hasn’t existed till now in any way to research past it. There are already lots of ideas like static weight for stretches of layers which could also make Cerebras even more efficient skipping them along with the zero weights it already does while GPUs can’t for either.

This is much more natural like how biological brains have depth in thought that should unlock much more cognitive reasoning capabilities. Cerebras accomplishes this with fine grained data flow as an architecture which scales seamlessly. It was purpose built to train and use AI models from the start and only requires cores compute the data received as needed and skips all zero weight making them drastically faster at spare training and inference.

GPUs use single instruction multiple threads. This requires GPUs to split the compute and finish across all in a synchronized steps. So no skipping weights zero or static across layers. GPUs wait for each step computation to synchronize across all GPUs used in training. Cerebras dynamically handles compute as the data arrives per core without waiting. Each layer is feed from MemoryX in training in a deterministic fashion so it can supply the weights as a stream over all the wafers.

I could dig deeper in a lot of places like hardware failures in training (GPUs have to halt and go back to last step complete, WSE just reroutes data and keeps going), software complexity for inference and training (CUDA was built to solve a problem Cerebras doesn’t have), expected life value per system vs GPUs, and on and on as each of these areas help give me conviction in Cerebras, but this is already way too long.

Scaling production with TSMC which is significantly over allocated is my biggest risk factor, but that’s really about time and scale of the success they will have.

References:

Co Packaged Optics (fiber):

https://ranovus.com/cerebras-ranovus-revolutionize-ai-compute-platform/

Wafer on Wafer (SRAM 3x):

https://3dfabric.tsmc.com/english/dedicatedFoundry/technology/SoIC.htm#SoIC_WoW

https://arxiv.org/html/2603.05266v2

https://fact-lab.hkust.edu.hk/publications/conference-paper/2025/bai-2025-accelstack/c20-paper.pdf

Updated: By popular request, broke it into paragraphs for ease of reading.

u/Asgard_Heima — 6 days ago

▲ 6 r/CerebrasSystems

I think Taalas has a bright future and could end up owning edge AI chips potentially putting small models into laptops or workstation and be the hard coded brain of cars and robots. But I also have to imagine if Cerebras purchased Taalas or licensed their tech in a partnership, they produce hard coded wafers with Cerebras redundant pathways and wafer cooling. You could see frontier models air cooled with crazy small energy foot print and absolutely unimaginable speed.

reddit.com

u/Asgard_Heima — 19 days ago

▲ 13 r/CerebrasSystems

I keep hearing the cuda moat called out by financial analysts with no real world experience or data scientist to talk to.

Data Scientists and AI Engineers use PyTorch and TensorFlow for 99%+ of accelerator hardware work. The major users of CUDA are the model builders such as OpenAI that just signed up with Cerebras. They only use it because Nvidia requires it for the overly complex distributed compute architecture of Nvidia. Cerebras removes the need to deal with kernel optimizations and handling network latency completely with a single massive WSE system in a box. CUDA is actually a competitive disadvantage when you realize its complexity for less performant results vs Cerebras.

Last, even if there was a transition required for some edge case uses, you can use AI to transition your code for near zero effort. This will be a fast realization as the number of users running on Cerebras hardware increases. The places getting deep into the weeds of CSoft like OpenAI and AWS are going to be building the best models going forward.

reddit.com

u/Asgard_Heima — 1 month ago