r/computerarchitecture

ISCA: Worth it?

Hello! I am deliberating on attending ISCA this year and would appreciate some advice

I just graduated from my undergrad from a T10 in the US. I am joining FT at a big chip company to do top level CPU DV in the fall. I have done tapeouts and CPU design in the past. I like HW but i am unsure if I want to work towards getting on an architect tract at my organization or in general.

I got accepted to one of the ISCA workshops and am wondering if i should stick around for the entire conference. Has anyone who has been in the last couple years share their inputs and thoughts?

TLDR: Trying to guage if ISCA is worthwhile experience for aiding in figuring out the direciton of my industry career as a NCG.

reddit.com
u/JoyousRaccoon — 1 day ago
▲ 5 r/computerarchitecture+2 crossposts

Perf verification vs Perf modeling

Which role will lead to becoming an IP/unit level architect? I understand that perf modeling works closely with architects but wouldn't perf verification lead to better low level understanding of the IP/sub-system? Does it even make a difference?

reddit.com
u/sub_micron — 1 day ago
▲ 0 r/computerarchitecture+1 crossposts

Breaking the Binary Bottleneck: Native Base-8 Logic Architecture (NDR-Octabit-Core) with O(1) Performance. Looking for Hardware/Quantum Partners

Hello everyone, For decades, the computing industry has been locked into the binary paradigm. While silicon scaling is hitting its physical limits, most optimization efforts remain at the software level, leaving the underlying foundational logic untouched. I have developed and officially registered the NDR-Octabit-Core, a computational logic system designed to run on a native Base-8 architecture instead of traditional Base-2.

⚙️ The Core Innovation The NDR-Octabit-Core bypasses the standard binary tree-structures for data processing. By implementing a native 8-state logical mapping, the system achieves a predictable O(1) time complexity in execution benchmarks, eliminating the latency fluctuations (O(log n)) typical of traditional binary address and allocation mechanisms. Scientific Timestamp & Registry: The architecture, formal benchmarks, and Core implementation in C++ have been published and indexed via Zenodo with a public Digital Object Identifier (DOI): https://doi.org/10.5281/zenodo.20128879

🚀 The Next Frontier: Scaling into Quantum & Hardware The mathematical framework of the NDR-Octabit-Core naturally aligns with the next generation of computing: Hardware (FPGA/ASIC): Moving from software emulation to native multi-level logic gates (similar to advanced MLC/QLC concepts but at a logic-gate level). Quantum Computing (Qudits): Traditional quantum computing focuses on 2-level qubits. The NDR-Octabit logic is structurally ready to map natively into 8-level Qudits (Octits), potentially offering a more efficient control layer and real-time state tracking without classical binary translation overhead.

💼 What I am looking for: The foundational logic is proven and benchmarked. I am now looking to transition this project from a validated scientific model into a physical/emulated reality. I am seeking: Deep Tech Investors / Venture Capital: Interested in pre-seed infrastructure, semiconductor licensing, or paradigm-shifting hardware patents. Hardware & FPGA Engineers: To collaborate on building a hardware description layer (VHDL/Verilog) for physical prototyping. Quantum Computing Labs/Researchers: To co-develop the driver layer, mapping the Base-8 NDR logic into multi-level quantum simulators (like Qiskit) or physical qudit platforms. If you are tired of incremental software patches and want to discuss a foundational architecture shift, let's connect. Contact: jarav2001 [at] gmail.com

reddit.com
u/Wrong_Vacation3262 — 3 days ago

Good material on how cpu's fetch ram values?

Hi,

Any good read or watch on how specifically the cpu retrieves data? Stack or heap and why buffer overflows *can* occur.

reddit.com
u/Yha_Boiii — 2 days ago
▲ 10 r/computerarchitecture+1 crossposts

Testing whether machine memory can be built from deterministic primitives instead of only LLM context, vector search, or databases.

I’m building Crystal: a local deterministic memory substrate for machines by biological memory primitives.

Instead of starting with language generation, I’m starting with memory primitives:

consolidation, temporal association, simplicity selection, bounded curiosity, and embodied feedback.

I’m releasing the work layer by layer so each claim can be tested.

reddit.com
u/Salt_Diamond5703 — 4 days ago

Force a cpu to run userspace stuff in ring 0 / EL3 ?

Hi,

I'm writing an app but kernel space restricts sys calls able to make but don't want stall time when flushing for a new security level, so is there a way to run a userspace app forced on the vlsi level to not switch. a usual 50-300 cycle penalty per switch is expensive when polling network and manipulating it in userspace, rinse and repeat?

reddit.com
u/Yha_Boiii — 4 days ago

Is this decomposition-based area modeling approach reasonable for microarchitecture DSE?

I am exploring a lightweight area modeling flow for microarchitecture DSE (design-space exploration). The goal is not signoff-accurate area estimation, but fast and structurally meaningful area prediction across many gem5 / HDL parameter configurations.

The core idea is to avoid using a single black-box model. Instead, I decompose the design into several structure classes and model them separately:

  1. SRAM-like storage structures (e.g., caches, BTBs, large regular arrays)
  2. Register/state-array structures (e.g., register files, rename tables, scoreboards)
  3. Queue/buffer-like structures (e.g., ROB, LSQ, FIFO, write buffers)
  4. CAM / associative selection logic (e.g., wakeup-select, associative lookup, priority/age selection)
  5. Remaining control and arithmetic datapath (modeled as residual area after subtracting the first four categories)

For SRAM-like structures, I plan to use OpenRAM / SRAM compiler results as ground truth. For logic-like structures, I plan to synthesize representative RTL with Yosys and train separate ML models. The final chip area would be the sum of all category predictions.

The motivation is that different microarchitectural structures scale very differently with parameters like ports, entries, width, associativity, and issue width, so a single global predictor may not capture these scaling behaviors well.

My questions are:

  1. Does this decomposition make sense for early-stage microarchitecture DSE?
  2. Are these categories architecturally meaningful from an area-modeling perspective?
  3. Would you classify structures like ROB, issue queue, LSQ, rename table, and physical register file differently?
  4. Is combining SRAM compiler/OpenRAM results with synthesized logic models a reasonable flow?
  5. What are the biggest pitfalls of this approach?
  6. Are there prior works or open-source projects that use a similar methodology?

I am mainly trying to understand whether this “decompose-by-structure-type” modeling strategy is fundamentally sound, even if absolute area accuracy is limited.

reddit.com
u/Low_Car_7590 — 6 days ago

Power modeling

How is power modeling done in industry and/or research? I feel like performance modeling is easy to understand with needing to model cycle behavior, but power seems much more difficult to estimate with abstract representations?

reddit.com
u/Visplay — 11 days ago

how big is execution time penalty for cpu mode switching?

Hi,

If a cpu runs a program in userspace contrary to kernel space how much of execution time is penalized on context switching and cpu modes? there are two forces: cpu mode itself bit vector being flipped (eg. el0 - el3) and then the kernel switching.

nothing specific, just wet finger in air

reddit.com
u/Yha_Boiii — 11 days ago
▲ 11 r/computerarchitecture+1 crossposts

CIM as a compute macro

I genuinely think CIM has more promise and future than the SIMT architecture that is dominating the market space right now. Yet, CIM narrative has gotten stuck on the narrative — eliminate data movement, co-locate compute with memory, show a power efficiency chart. Unfortunately, a lot of these claims do not scale as the performance required increases to enterprise grade. It’s not sufficient for a product, with 2 of the three capping out - memory size, bandwdith or TFLOPs.
I’ve spent significant time working through what it actually takes to make CIM a first-class compute macro in an enterprise datapath. Something that can handle mixed precision, scale with data bandwidth, tensor size etc — it sits alongside a CPU or GPU tile, exposes a clean interface to a compiler stack, and meets the reliability bar that production workloads demand.

Here are some problems that are living rent-free in my head and worth actually debating about:
The macro interface is still an open problem. Memory-mapped, tensor-core-like, or something purpose-built for dataflow — each choice has deep implications for how a workload scheduler sees the device and how much you’re asking a compiler team to build from scratch. Unfortunately, most CIM architectures punt on this and call it a software problem.
How do you architect a CPU or GPU that actually harnesses CIM at scale? The interesting question is how you redesign the memory hierarchy, execution units, and dataflow control so CIM becomes a native citizen of the compute fabric rather than an accelerator bolted on the side. What does the ISA surface look like? How does the scheduler reason about CIM availability without destroying pipeline efficiency?
Datacenter-level deployment is a network fabric problem as much as a silicon problem. A CIM macro that wins on a single chip means little if the inference serving architecture can’t distribute workloads across a rack or pod efficiently. How do you design the interconnect and topology so that CIM’s power efficiency advantage isn’t eaten by communication overhead? What does a CIM-native inference cluster actually look like?
These are the conversations I find most scarce — people who’ve thought past the device level into the full system stack.

Particularly interested in hearing from anyone who’s seriously engaged with the architecture above the macro.

reddit.com
u/AdmirableProject1575 — 11 days ago

Is "execution model" a property of each abstraction level independently, or a top-level design principle?

​

Hi everyone, I'm an Italian student following the awesome Onur Mutlu's Digital Design and Computer Architecture course (ETH/CMU 447, publicly available on YouTube). I'm trying to understand what "execution model" actually means — apologies in advance, English is not my first language and I used AI assistance to help me formulate this question clearly, but the confusion is genuinely mine.

The way it's introduced in the course, "execution model" sounds like a top-level design principle — you choose an execution model (Von Neumann, dataflow) and then derive an ISA and a microarchitecture from it. But in practice, both the ISA and the microarchitecture seem to have their own execution model independently — and they can differ. OOO processors are the obvious example: sequential at the ISA level, dataflow-like at the microarchitecture level.

This makes me wonder: is "execution model" just a per-level descriptor — a way to characterize how instructions fire at each layer of the hierarchy — rather than a single overarching principle?

The reason I'm confused is that Von Neumann is presented as an execution model, but it's much more than a firing mechanism — it also includes stored program and a specific hardware organization. Dataflow, by contrast, is described almost purely as a firing mechanism. So either "execution model" means different things in the two cases, or Von Neumann is being used as a shorthand for something more specific.

Is there a clean definition of "execution model" in the literature, or is it consistently informal?

IMHO the "problem" Is that pedagogically speaking the von Neumann model Is presented as an indivisibile package, but, since its introduction, a lot of abstractions were introduced, complicating the picture.

Thanks in advance.

reddit.com
u/LoganHX — 12 days ago

C bound checking

Hi,

How does bound checking and such work on a lower level?

Why is snprintf needed when a normal say normal signed ints don't need bound checking?

Today i got a reality check on stack is also not bound checked, how does it actually work, heap or stack?

Any books, videos and other material specifically on the asm level of it all the compilers story on it?

reddit.com
u/Yha_Boiii — 13 days ago

Microarchitecture Assessment and Critique

Hey guys,

I’ve recently began to wrap up development of my most recent CPU core, Anvil-Pro. While this has mostly been an educational endeavor, I’ve ended up with what may be a fairly solid FPGA softcore. Through development, I’ve attempted not just to “implement” but also, within reason, to “perform”. As such, the microarchitecture and decisions underlining it have been made with the explicit goal of high IPC/LUT.

My rationale was, worst case scenario, I learn strong fundamentals and end up with a resume item. Best case scenario, I may create something that could carve out a legitimate use case (however marginal) within the spectrum of demand.

Since, ultimately, the project is educational, I’ve chosen to make every decision top down from principle rather than from textbook or convention. This comes with the caveat that, ultimately, I will make suboptimal and poor decisions. I’ve implemented or reinvented many standard internal CPU structures, but have combined them in a way that is less commonly done. Importantly, the overall architecture was from what I reasoned to be effective, rather than from a model CPU.

This design philosophy has pros and cons. As to the cons, I am ignoring the previously discovered wisdom of everyone prior to me. I am also putting into practice something untested rather than proven optimal. As to the pros, I am creating something slightly interesting and perhaps less treaded. I also get to learn stronger architectural principles from the additional accountability.

Given all this, I hope to have established that Anvil-Pro is somewhat different from other softcores. While this difference is marginal, it is still worth noting. A question in my mind now remains: “Is this actually any good?”.

This is the reason for this post. If anyone is interested and has sufficient time to waste, could you evaluate my microarchitecture and tell me how it compares to convention inside its own performance class. Is my performance good, poor, decent? Are my decisions justified, is my architecture sane? I’ve yet to have someone actually look at this other than myself. To be completely honest I really don’t know, if i could do it all over again, what I would change.

If interested, there’s an architecture.md document detailing design philosophy. There are also several diagrams I’ve put together to illustrate my thoughts. You may also look through the verilog, but I would recommend against it. My coding style is rather convoluted in all honesty, especially given that this was built solo rather than as part of a team.

Please let me know thoughts and critiques, of which I’m happy to hear.

https://github.com/JohnH2448/Anvil-Pro

Note: I have not run timing analysis or FPGA resource usage estimates. I certainly plan to, but at this point I have not yet gotten to it. This is a first functional prototype.

u/No_Experience_2282 — 12 days ago