Ansible for large compute cluster
So I have mostly worked on Ansible-based node bring-up for smaller environments (100-200 servers) and I am comfortable with Ansible playbooks, roles, Molecule testing, ansible-lint/rules, CI pipelines, etc.
Now I have been thrown into a very different scale problem:
We’re building a onprem bare-metal CPU compute cluster starting at ~10,000 nodes (mostly AMD EPYC nodes) with plans to scale toward 20k–30k nodes.
Think large HPC-style infrastructure / compute farm setup.
Current thinking is:
- Initial provisioning via Kickstart/PXE/iPXE
- Then handoff to Ansible for configuration and lifecycle management
- Mostly bare metal
- Need fast, repeatable node bring-up and recovery
- Scale matters more than “traditional enterprise Ansible”
I’d really like opinions from people who’ve actually operated infrastructure at this scale.
Some areas I’m trying to think through:
- Would you still use “push-based” Ansible at this scale?
- Would you move toward Ansible Pull?
- Multiple/decentralized control nodes?
- Event-driven orchestration?
- How do you avoid SSH/control-node bottlenecks?
- Golden images vs fully dynamic provisioning?
- How much should happen in Kickstart vs post-provision Ansible?
- How do you handle inventory at this scale?
- Any lessons around idempotency/performance becoming painful?
Is Ansible Automation Platform/Tower worth it at this scale, or does it become more of an operational overhead?
What would you absolutely avoid after learning the hard way?
Would especially love responses from people running:
- HPC clusters
- AI/ML farms
- Large on-prem compute fleets
- Bare-metal Kubernetes worker farms
Interested in architectural patterns, lessons, scaling bottlenecks, recovery strategies, and “things you only learn after production pain.”
We are also exploring tools like TinkerBell, Canonical MaaS etc instead of Kickstart, would love opinions on that as well.