u/ImGallo — reddlx

Hey everyone,

I'm working on projecting cancer incidence and prevalence for a health insurance company. The overall approach is PIAMOD-like: I need to estimate future cancer cases by combining incidence rates with population at risk and survival data.

The thing is, before I can even get to the cancer modeling part, I need to project how many people will be enrolled in my institution over the next 10 years. That's my actual bottleneck right now.

I have annual enrollment data from 2016 to 2024, so 9 data points total. The plan is to use national population projections for my country as a reference and then model how my institutional population relates to that. But here's where it gets tricky:

2020-2021 has a noticeable COVID dip (fewer enrollees, disrupted trends)
2023-2024 show a slight flattening in the growth trend compared to previous years. It's not huge but it's there.
If I hold out those last 2 years for validation, I'm left training on 7 points (or 5 if I also exclude the COVID years), which feels like too little
And I need to project all the way to 2034, which is longer than the series itself

So my question is basically: how would you approach modeling a short time series like this, where you can't really afford a proper train/test split, there's a known disruption in the middle, and the projection horizon is longer than the observed data?

Any suggestions on model selection strategies, how to handle the COVID effect, or just general advice on whether this is even reasonable would be really appreciated. Thanks!