u/Infamous-Tea-4169

DevOps engineer wanting to move deeper into HPC/systems infrastructure — what path makes sense?

Hi all,

This might be a bit of a vague question, but I’m hoping to get some advice from people who have been in the HPC / systems / infrastructure space for a while.

I was recently looking at courses and certifications to upskill and add something useful to my resume. For a long time, I thought I wanted to do something like AWS Solutions Architect, but then I stopped and thought about it properly and realised that is not really what I want to become.

What I actually want is to become the kind of engineer who can understand and own infrastructure end to end — especially HPC / systems engineering type environments.

In my previous role, I worked at a biomedical cancer research organisation and got the chance to work with some really smart people. We had our own data centre, HPC cluster, storage systems, VMs, Kubernetes, hybrid cloud capability, pipelines, monitoring, etc. So I got exposure to a lot of things from hardware all the way up to software and workflows.

The thing is, I only have around 4 years of experience and I’m 27, so I know I’m still early in my career. But working there made me realise what kind of engineer I want to become.

There was an infrastructure engineer there who was honestly brilliant. He could look after the hardware side of the HPC, understand storage and networking, maintain pipelines, deal with Kubernetes, think about hybrid cloud, and generally just connect all the dots. I know you can’t master everything, but he had that broad systems-level understanding that I really want to build towards.

I’ve now moved into a government role as an HPC Specialist, but a lot of the actual implementation work is being contracted out. So at the moment it feels more like systems management / coordination than deep hands-on engineering. I don’t necessarily want to leave this role, but I also don’t want to drift away from the technical path.

To be honest, the technical side can feel pretty daunting sometimes. Especially when it comes to ownership and accountability. I do want to step up, but I also get scared because I don’t yet feel confident enough to fully own an HPC solution from end to end.

I guess what I’m trying to understand is: how do I start building towards that level?

How do I get better at understanding things like:

  • what servers fit what use cases
  • racks, cabling, power, cooling, and data centre basics
  • storage and network design for HPC
  • when HCI makes sense vs composable infrastructure vs normal rack-and-stack
  • how to think about redundancy, performance, scalability, and supportability
  • how all of this connects back to Linux, Slurm, Kubernetes, automation, pipelines, and users

I’ve worked with some of these things, but I still struggle with the bigger picture sometimes — like why one design is good, another is risky, and another is just overkill. I assume a lot of that comes with experience, but I’m trying to be more deliberate about how I build that experience.

For context, my background is DevOps / systems engineering across hybrid on-prem HPC environments, Kubernetes, VMs, Linux, software-defined storage, and research infrastructure.

For those of you who are more experienced in this space, what would you recommend?

Are there any courses, books, certifications, home lab projects, vendor training, or specific areas I should focus on?

And how do you actually build the judgement to eventually become “that person” who can own and guide infrastructure properly?

reddit.com
u/Infamous-Tea-4169 — 15 days ago
▲ 8 r/HPC

How do I build towards becoming an end-to-end HPC / systems infrastructure engineer?

Hi all,

This might be a bit of a vague question, but I’m hoping to get some advice from people who have been in the HPC / systems / infrastructure space for a while.

I was recently looking at courses and certifications to upskill and add something useful to my resume. For a long time, I thought I wanted to do something like AWS Solutions Architect, but then I stopped and thought about it properly and realised that is not really what I want to become.

What I actually want is to become the kind of engineer who can understand and own infrastructure end to end — especially HPC / systems engineering type environments.

In my previous role, I worked at a biomedical cancer research organisation and got the chance to work with some really smart people. We had our own data centre, HPC cluster, storage systems, VMs, Kubernetes, hybrid cloud capability, pipelines, monitoring, etc. So I got exposure to a lot of things from hardware all the way up to software and workflows.

The thing is, I only have around 4 years of experience and I’m 27, so I know I’m still early in my career. But working there made me realise what kind of engineer I want to become.

There was an infrastructure engineer there who was honestly brilliant. He could look after the hardware side of the HPC, understand storage and networking, maintain pipelines, deal with Kubernetes, think about hybrid cloud, and generally just connect all the dots. I know you can’t master everything, but he had that broad systems-level understanding that I really want to build towards.

I’ve now moved into a government role as an HPC Specialist, but a lot of the actual implementation work is being contracted out. So at the moment it feels more like systems management / coordination than deep hands-on engineering. I don’t necessarily want to leave this role, but I also don’t want to drift away from the technical path.

To be honest, the technical side can feel pretty daunting sometimes. Especially when it comes to ownership and accountability. I do want to step up, but I also get scared because I don’t yet feel confident enough to fully own an HPC solution from end to end.

I guess what I’m trying to understand is: how do I start building towards that level?

How do I get better at understanding things like:

  • what servers fit what use cases
  • racks, cabling, power, cooling, and data centre basics
  • storage and network design for HPC
  • when HCI makes sense vs composable infrastructure vs normal rack-and-stack
  • how to think about redundancy, performance, scalability, and supportability
  • how all of this connects back to Linux, Slurm, Kubernetes, automation, pipelines, and users

I’ve worked with some of these things, but I still struggle with the bigger picture sometimes — like why one design is good, another is risky, and another is just overkill. I assume a lot of that comes with experience, but I’m trying to be more deliberate about how I build that experience.

For context, my background is DevOps / systems engineering across hybrid on-prem HPC environments, Kubernetes, VMs, Linux, software-defined storage, and research infrastructure.

For those of you who are more experienced in this space, what would you recommend?

Are there any courses, books, certifications, home lab projects, vendor training, or specific areas I should focus on?

And how do you actually build the judgement to eventually become “that person” who can own and guide infrastructure properly?

reddit.com
u/Infamous-Tea-4169 — 15 days ago