u/DrSpacemnn

Hi all,

I have 3 questions, (i) is about selecting the ideal method and (ii) is how to report the optimism, discrimination and validation of the approach. Ideally I would also like to report OR, CI, and p-values that meaningfully reflect my selection strategy (iii) . I am working using R. I am ok with this being an exploratory / early look needing further validation.

I'm working on a prediction project. My original plan was to use a penalised regression system, ideally LASSO in order to have a select number of variables to report on as the most "unambiguously" predictive for outcome x. However I've received the data and there are a very small number of events (9 out of n = 90), and 65 variables of interest.

I appreciate that (i) with such small event numbers there is the risk of loss to noise,(ii) there is a significant risk of collinearity in the variables further compounding loss.

(i) Is LASSO (or alt penalised regression) still useable with these numbers? 9 seems very small and 65 variables is a lot. I am working with the team to reduce these numbers in a sensible fashion.

If it adjusted this at all, technically the DV can be 1,0 for the outcome of interest, or I can set as 0,1,2,3 (0 = undiagnosed, 1 = diagnosis x, 2 = diagnosis y, etc). The other groups have more events but certainly no more than 20 each max.

(ii) If a penalised regression method still holds, then would bootstrapping to assess the stability of the selected variables (selected >90% of the time considered stable) be suitable coupled with n/2 subsampling for internal validation (>50% stable) of the final model be appropriate (or even doable, given the small event numbers)

(iii) Finally to use a package like hdi in order to obtain OR, CI, and p-values that are aware of the original selection method / n of variables

Many thanks!

Hi all,

I have 2 sets of questions, (i) is about selecting the ideal method and (ii) is how to report the optimism, discrimination and validation of the approach. Ideally I would also like to report OR, CI, and p-values that meaningfully reflect my selection strategy (i) . I am working using R. I am ok with this being an exploratory / early look needing further validation.

I'm working on a prediction project. My original plan was to use a penalised regression system, ideally LASSO in order to have a select number of variables to report on as the most "unambiguously" predictive. However I've received the data and there are a very small number of events (9 out of n = 90), and 65 variables of interest.

I appreciate that (i) with such small event numbers there is the risk of loss to noise,(ii) there is a significant risk of collinearity in the variables further compounding loss.

(i) Is LASSO (or alt penalised regression) still useable with these numbers? 9 seems very small and 65 variables is a lot. I am working with the team to reduce these numbers in a sensible fashion

(iii) Finally to use a package like hdi in order to obtain OR, CI, and p-values that are aware of the original selection method / n of variables

Many thanks!

[Q][E][D] Penalised regression vs other

Penalised regression vs alt for rare events in a small dataset