u/AdmirableProject1575

Founder here, stealth AI hardware startup in the Bay Area. Looking for a CPU/GPU Lead Architect to own the microarchitecture roadmap for low-power datacenter AI inference silicon.
Quick context on us: ex-Google and Broadcom team, working across the full stack — I personally lead scale-up/out interconnect architecture. We’re past the whiteboard stage.
What we need from you:
• 10+ years in CPU or GPU microarchitecture
• Hands-on experience shipping silicon (tape-out, not just RTL)
• Opinions on where inference hardware is broken and how to fix it
Before you DM: this is a founding-team role, not a staff eng position. If you’re exploring options casually or want to pick my brain about the space, this isn’t the right thread. If you’re ready to go deep on a specific architecture problem, I want to talk.
DM with your background and one sentence on what you think the biggest architectural mistake in current AI inference chips is.

I genuinely think CIM has more promise and future than the SIMT architecture that is dominating the market space right now. Yet, CIM narrative has gotten stuck on the narrative — eliminate data movement, co-locate compute with memory, show a power efficiency chart. Unfortunately, a lot of these claims do not scale as the performance required increases to enterprise grade. It’s not sufficient for a product, with 2 of the three capping out - memory size, bandwdith or TFLOPs.
I’ve spent significant time working through what it actually takes to make CIM a first-class compute macro in an enterprise datapath. Something that can handle mixed precision, scale with data bandwidth, tensor size etc — it sits alongside a CPU or GPU tile, exposes a clean interface to a compiler stack, and meets the reliability bar that production workloads demand.

Here are some problems that are living rent-free in my head and worth actually debating about:
The macro interface is still an open problem. Memory-mapped, tensor-core-like, or something purpose-built for dataflow — each choice has deep implications for how a workload scheduler sees the device and how much you’re asking a compiler team to build from scratch. Unfortunately, most CIM architectures punt on this and call it a software problem.
How do you architect a CPU or GPU that actually harnesses CIM at scale? The interesting question is how you redesign the memory hierarchy, execution units, and dataflow control so CIM becomes a native citizen of the compute fabric rather than an accelerator bolted on the side. What does the ISA surface look like? How does the scheduler reason about CIM availability without destroying pipeline efficiency?
Datacenter-level deployment is a network fabric problem as much as a silicon problem. A CIM macro that wins on a single chip means little if the inference serving architecture can’t distribute workloads across a rack or pod efficiently. How do you design the interconnect and topology so that CIM’s power efficiency advantage isn’t eaten by communication overhead? What does a CIM-native inference cluster actually look like?
These are the conversations I find most scarce — people who’ve thought past the device level into the full system stack.

Particularly interested in hearing from anyone who’s seriously engaged with the architecture above the macro.

Looking for CPU Lead as Cofounder / Chief Architect for AI startup

CIM as a compute macro