r/HPC

▲ 2 r/HPC

How to delete slurm output and error files from within the slurm script?

I often have to submit a job many times over and over again. Each time I need to delete the previous run's output files as below. If I include that in my slurm script it will delete the current job's output/error files which I don't want.

[me]$ rm *.out *.err

[me]$ sbatch slurm.sh 

reddit.com
▲ 35 r/HPC

Newly hired in HPC user support in academia - seeking guidance.

Hi all,

I recently made a lateral career move coming from a physics PhD research background to an HPC user support role in academia. I managed to get interviews with national labs (remote) and two major R1 universities (remote and on-site) and one of them gave me a chance. Unfortunately the job I got is on-site in a place I really don't want to live in, but after a year unemployed I couldn't afford to be picky.

I'm hoping to make the most of my time at this role and learn enough to position myself for a similar or better role that is either remote or in a more favorable location for my family in hopefully a year's time. I will be the only trained scientist in a small group and from what I've gathered, I presumably will be having to wear many hats and learn a lot of new things outside my wheelhouse, while also teaching faculty/students how to best use batch schedulers, parallelize tasks and debug performance issues - which I did a lot of in my research career.

For those of you employed in this area, what are absolute musts that a physicist like myself must learn to broaden their resume and be more marketable? The school will pay for certifications which helps, and I will have some ability to conduct my independent research and help with grant-writing (for whatever that's worth now...). I am currently clueless about emerging technologies with HPC, I'm old-school and mostly worked with a lot of massively-parallelized Fortran fluid codes on largely just compute nodes with MPI in my academic career, with very little GPU stuff so that's low hanging fruit. What else?

reddit.com
u/leisuresuitlerdo — 3 days ago
▲ 85 r/HPC+1 crossposts

SoftMig – software GPU slicing for SLURM (no hardware MIG needed, works on any CUDA 12+ GPU)

We built this at the University of Alberta because we had a pile of L40S, A40, and other GPUs that SLURM couldn't meaningfully slice. Hardware MIG only covers a handful of models, requires draining nodes to reconfigure, and locks you into rigid layouts. Result: full 48GB cards going out for jobs that needed 12GB. Classic HPC waste.

SoftMig is a SLURM-native software slicing layer — a fork of HAMi-core adapted for cluster environments. It enforces per-job memory ceilings and compute throttling via LD_PRELOAD, with prolog/epilog hooks handling the job lifecycle. Works on any CUDA 12+ GPU.

A 48GB L40S becomes:

  • 1 full GPU
  • 2 × 24GB half-slices
  • 4 × 12GB quarter-slices
  • ...or whatever layout your site defines

Change layouts through SLURM policy. No node drain, no reboot.

A few things it does that hardware MIG can't:

  • Mix slice sizes on the same GPU (e.g. a half + two quarters on one card)
  • No lost capacity — hardware MIG burns memory to its own infrastructure; SoftMig slices the full pool
  • Compute is sliced too, not just memory — SM access is throttled proportionally per job

Heads up on build/install: The docs are written for Digital Research Alliance of Canada / Compute Canada cluster environments, so if you're deploying elsewhere you may need to adapt things. Claude Code or Cursor work well for navigating the compilation and integration steps if you're not in that ecosystem.

MIT licensed. GitHub: https://github.com/ualberta-rcg/softmig

Happy to answer questions — we've been running v1 in production on Vulcan and v2 is now in testing.

u/VanRahim — 10 days ago
▲ 26 r/HPC

HPC/AI infra: career advice

Hi all

I’m looking for some honest career advice from people working in HPC/AI infrastructure.

Background:

  • ~10 years working with Linux infrastructure, HPC and cloud environments
  • Experience with HPC clusters, schedulers, OpenStack, Kubernetes, Terraform, automation, hybrid cloud, cloudbursting, NVIDIA GPUs (not at scale), etc.
  • Mostly in research/scientific environments
  • Last ~5 years working in consulting, which meant pivoting frequently between projects and technologies depending on customer needs

Because of that, my profile evolved into a mix of:

  • HPC systems
  • cloud/platform engineering
  • Kubernetes/OpenStack infrastructure
  • automation and distributed systems

Rather than being deeply specialized in a single area like GPU, networking or schedulers.

Recently I’ve been trying to move more toward AI infrastructure/platform engineering roles, to companies product focused, and over the last months I interviewed some companies like NVIDIA, Mistral AI, NSCALE, etc.

However, I’ve consistently failed either during HR stages or technical rounds (mostly the 2nd).

One thing I’m struggling with is understanding whether:

  • my profile is actually relevant for the current AI infrastructure market,
  • or if my background is too “consulting-oriented (lack of deep knowledge)” compared to what these companies expect.

My recent work has been more Kubernetes/OpenStack/platform-oriented rather than pure bare-metal HPC, although the workloads and environments are still performance-sensitive and research-focused.

I’d appreciate honest feedback from people in similar domains:

  • What gaps do you usually see in profiles like mine?
  • What would you study or build next? (ofc, having access to GPUs at scale is not always easy)
  • Is HPC still a strong niche in the AI era, or should I reposition more aggressively toward cloud/platform engineering?
  • Is breadth from consulting perceived negatively compared to deeper specialization?

I’m especially interested in advice from people working in:

  • AI infrastructure
  • GPU clusters
  • platform engineering
  • large-scale Kubernetes/HPC environments

Thanks!

reddit.com
u/9d0cd7d2 — 11 days ago
▲ 10 r/HPC

I took a postgraduate applied HPC course from my Physics department. It included running code on my university's system, I've done parallelisation (OpenMP, MPI) in C and machine learning (PyTorch etc.). How to market this properly for the job market? So far I've only gotten interest from 2 job opportunities so I'm guessing I should do a project or such involving distributed data analysis or such ?

reddit.com
u/EconomistAdmirable26 — 13 days ago
▲ 0 r/HPC

Hi all,

I am very new to the world of HPC, I just want a resource that will let me run some Jupyter notebooks that I'm using for my research faster. I've requested and gotten access to my university's free system but when I try to open a Jupyter Notebook server (with just the basic settings) I'm getting the following error message:

sbatch: error: Batch job submission failed: Unexpected message received

I can't find this error on any forums and I'm not sure why I'm getting it-- I think the connection might be timing out (it takes about a minute before giving me the error) but I've tried it on a couple of different wifi networks and it isn't helping. Has anyone else had this issue?

reddit.com
u/Aware_Inflation7136 — 14 days ago
▲ 30 r/HPC

Hi, this was reported to me today

https://github.com/V4bel/dirtyfrag

Currently the systems which are vulnerable are advised to blacklist:

esp4, esp6, and rxrpc (obviously if it makes sense to do so in your environment)

After the module unload, you also would have to drop page-cache

u/walee1 — 14 days ago