r/statistics

[discussion] golf and dispersion

hello,

I am an avid golfer who wants to learn his dispersion pattern with my clubs.

I have access to a launch monitor at my club called a trackman that is the top of line LM on the market.

My question is how many shots would create a statistically relevant sample size to get a good idea of my average dispersion?

reddit.com
u/Buckeye_47 — 1 day ago

[R][Q] Non-parametric finite-sample credible intervals with one-dimensional priors: a middle ground between Bayesian and frequentist intervals

See the manuscript here: https://arxiv.org/abs/2601.17621

Hi all! After obtaining my PhD I have been out of academia for a few years, but I have kept interest in the fundamentals of statistical methodology and have recently written up this manuscript.

In it I propose a middle ground between frequentist and Bayesian statistics. I belief, besides being of theoretical interest, this may provide useful practical statistical tools down the line.

Naturally I would like to spread my work, and receive feedback on whether others agree with its usefulness and novelty.

I had submitted the manuscript to Communications in Statistics – Theory and Methods. The editor handling my paper has sent out 11 reviewer invitations, but none agreed to review - presumably because the subject doesn't quite fit in common fields of work.

I was requested to suggest 4-6 potential reviewers myself instead. However, since I am no longer in academia, I find it difficult to think of potential reviewers. Do you guys have ideas on who I could suggest, or an approach to finding potential reviewers?

Any ideas on steps forward and general feedback on the paper and the ideas in it are very welcome as well!

Kind regards,

Tim

reddit.com
u/freemath — 1 day ago

[QUESTION] Multiple regression assumptions not met - implications and solutions?

Hi everyone,

I’m currently working on my undergraduate thesis using multiple linear regression. After running the assumption tests, I found that 3 assumptions are not met:

1.Linearity

2.Normality

3.Homoscedasticity

Now I’m a bit confused about how serious this is for the validity of my analysis and what does this implied for my research, and the overall quality of the research.

Also, I’m still unsure:

1.How severe are these violations in practice for multiple regression?

  1. Can the regression results still be interpreted if several assumptions fail simultaneously?

  2. What are the best solutions or alternatives usually recommended in academic research

Some possible solutions I’ve read about hasn't been taught and really complex

Has anyone dealt with a similar situation in their thesis/research? What did your supervisor or examiner usually recommend?

Thanks a lot!

reddit.com

[C] Am I crazy for wanting to quit my teaching job to do a full time Stats masters?

I just really can’t do both, at least not well. We can live off my husband’s income, although it will be tight. I just know teaching isn’t the right long-term fit for me, and I want to make the transition before we start a family.

reddit.com
u/Ikosnyg55 — 2 days ago

should i take statistics at uni (undergrad)? [Q]

i like doing geometry, especially linear algebra, and also polar geometry, calculus, differential eqs, dynamical systems. im interested in learning more geometry, such as differential geometry, topology in uni. ive always been interested in observing abstract maths manifesting in real life.

i am accepted in doing theoretical physics in a few months, but i fear physics will be too observation first, and labs pmo because of its imperfection and the limits of measurement, hence why im considering switching course.

i also dont like logic for the sake of logic, especially number theory, whenever a question asks abt fibonacci numbers, prime numbers, roots of polynomials it pmo too, similarly w newtonian mechanics, its just the way they phrase hinges and levers and rods and how discontinuous the math in newtonian mechanics is, and how much idc abt applications of physics and engineering.

do yall think stats would fit my interests in geometry more? (pls ask me wtv tho i feel like ive not mentioned a lot of background but i wanna keep the post concise)

also idk how to tag, im guessing i write the tag in title?

reddit.com
u/Ceramidee — 2 days ago

[Education] Good US PhD programs for geospatial data analysis and public policy applications?

Hi everyone,

I'm about halfway done with a Master's in Applied Mathematics (where most of my coursework involves statistics and numerical methods), and I'm interested in applying to PhD programs once I finish. The two research topics I'm currently interested in are geospatial data analysis, since a lot of my work experience has involved geospatial data and GIS, and public policy, particularly quantitative policy analysis, survey design, and causal inference. Which universities in the United States have faculty who work on these topics or connections to government agencies and policy institutes? I have considered a PhD in public policy, but I think that pursuing one may make my math & CS background go to waste.

reddit.com
u/kyaputenorima — 2 days ago

[Question] Estimate 1-year survival based on 4-year survival assuming equal survival across time

I have a survival estimate over a 4-year time period, 35% survival. However, what I need is an estimate of survival over 1-year. I do not have any more specific data, just total population at time 0 (162) and total population at time 4 (57). If I assume that survival is equal across years, is it possible to estimate survival for 1 year? I am working in R, if that's needed.

reddit.com
u/v838monoceros — 2 days ago

Career Question: Degree or Certificate in Data Science or Statistics [C] [Q]

I am a counseling professor (so I have an MA and a PhD in Counseling) and have obviously taken a few stats classes along the way but by no means have "mastered" the content. I am wanting to get sharper on these things and, being a professor, it would make most sense if I can add another degree to the mix to " better qualify" to teach our own research stats class (it would just look better for accreditation purposes). I have been looking through options and there seems to be some strong (and justified) opinions about different options.

I am primarily interested in improving skills in research design and quantitative/qualitative analysis of data so that I can better supervise Counseling PhD dissertations and be able to point out poorly interpreted data. We are generally seeing a lot of multiple regression analysis oriented dissertations but I want to be able to push students deeper into the data and even design more complex research projects to explore implications of different variables, interventions, and comorbidities in mental illnesses.

This being the case, what would you great-minds-of-reddit suggest? Should I look more towards MS in Statistics or Data Science? Is a certification a better place to just jump in? There is a strong chance my school pays for the training/degree as well if I ask nicely enough.

I am not really interested in the machine learning or AI modeling stuff and I see that this is advertised in a lot of programs but I understand if it is unavoidable at this point.

reddit.com
u/Cheezit_504 — 2 days ago

What to do while in Masters? [Career]

Hey peeps!

I'm in an Applied Stats Masters at the moment (part time) while also working full time in a Supply Chain role. After I graduate, I want to work somewhere in healthcare, public policy, or economics as a statistician or data scientist in the US. I'm loving the program, however am wondering if I should take the opportunity to get a bridge role like a data analyst in the meantime and if that would make the post master's job search any easier?

I'm good with R, SAS, Excel, and PowerBI already but don't really use them much in my current role which is Supply Chain mixed in with a bit of analytics.

Anyone been in a similar place? Thanks!

reddit.com
u/thelongestbird — 3 days ago

How much does ranking of PhD Statistics programs matter for academia? [E]

Does going to a top 20 vs top 50 vs top 75 program make a tangible difference to your likelihood of publishing better papers and, ultimately, getting a tenure-track faculty position?

I know in Economics they are super elitist about this kind of thing, but I heard it's less so in more math-y fields.

reddit.com
u/GayTwink-69 — 4 days ago

Dice testing [Discussion]

Hello,

I bought a relatively expensive D20 because it's engraved in crystal. I rolled it 907 times and recorded the results.

For numbers 1 to 20: (44, 39, 56, 36, 50, 54, 36, 41, 42, 34, 46, 57, 37, 49, 61, 57, 35, 43, 53, 37).

Applying a chi-square test, the statistic gives me 31.46. This means I reject the hypothesis that the die is balanced at the 5% significance level (the critical value is 30.14).

Not happy with this result, I start to think: if the die is unbalanced, we should observe an imbalance in opposite face pairs. In other words, we theorize that we uniformly and randomly draw a random variable from ten Bernoulli variables with probabilities pᵢ. If the die is unbalanced, then one of the pᵢ is large.

It seems that under this analysis, the die appears a bit more fair.

Now more generally, we can also assume the die is unbalanced in a physical sense: that one side of the die is heavier than the other. This justifies the following reasoning:

Let's create a mapping that sends each face to itself and adds the adjacent rolls. This should highlight more subtle and discrete imbalances in the die.

What do you think of this type of reasoning?

reddit.com
u/n_petit_1 — 4 days ago

Why can't I discard natural outliers? [Q]

Say that I have a height dataset given some other variables and there's a guy who's like 8 feet tall. Sure there are people who are 8 feet tall, why should I worsen my prediction of everyone else if the prediction of the guy will be off by a lot as well? It's literally lose lose and if my error metric is quadratic it's gonna skew significantly with outliers.

reddit.com
u/I-AM-LEAVING-2024 — 4 days ago

[E] Best practices for teaching intro statistics

I’m taking over a medium sized (~50 students) statistics class. The current practice is to offer weekly homeworks via an online platform that also gives students access to the ebook. In addition there are weekly Excel exercises.

I am torn. Part of me wants to keep the current structure. Knowing how students do homework, not having to grade it is a huge advantage. On the other hand, l want my students to get as much out of this course as possible, which means (given all I know about teaching) requiring paper textbooks and assigning homework on paper.

What would you do? What would you recommend? Thanks!

reddit.com
u/il__dottore — 4 days ago
▲ 12 r/statistics+2 crossposts

[Education] Resources for self study?

Hi y'all! A bit wild to say I want to learn statistics "for fun", but I have never had the opportunity to study it, and it's good to have statistical literacy regardless. I unfortunately do not have the time nor money to apply for a course in uni or a college, but I want to try my hand at studying alone. Obviously I'm not going for any data science job.

Can you recommend any resources that also include practice? Preferably free, but reasonably affordable would also be great.

Thanks a lot!

reddit.com
u/joeisajellyfish — 4 days ago

How to generate a set of random covariance matrices with specific covariances? [Q]

For a Monte Carlo study I'm trying to generate a series of covariance matrices that have a specific range of covariances. I'm sampling the individual covariances and marginals from a set of theoretically likely covariances but I'm running into the problem that the combination of those does not result in a (semi-) positive definite covariance matrix. The R script I've set up returns to draw a new set of covariances and construct a new covariance matrix but even after 10000 attempts it does not seem to find a proper covariance matrix. This tells me I must be doing something wrong. I read that I might need to do a Cholesky decomposition, which would require me to rewrite and restructure my script. What's the best way to move forward?

Edit: I see now that a Cholesky decomposition itself requires a positive definite matrix..

reddit.com
u/DeliberateDendrite — 4 days ago

Is a bonferroni-adjusted p-value (or some other adjusted version) needed anytime you do more than 1 hypothesis test? [Q]

To be theoretically valid, basically. Cause the size of the test increases with more than 1 hypothesis test if I understood correctly.

reddit.com
u/GayTwink-69 — 5 days ago

[Research] Construct validity of MLB's breaking ball taxonomy: is the curveball/slider/sweeper distinction statistically justified?

Applied a three-stage construct validity framework to evaluate whether MLB's breaking ball taxonomy (curveball, slider, sweeper) reflects discrete pitch types or a continuous movement spectrum

Full writeup: https://rpubs.com/dsmi313/breakingball

**Background:** Statcast assigns discrete pitch-type labels via a proprietary classifier that uses movement variables as inputs. This creates a circularity problem — any analysis regressing movement features against labels is partly circular. The goal here is to characterize the geometry of the movement space underlying the taxonomy rather than independently validate the labels.

**Data:** ~800k pitches (2020–2025), five year-residualised features: horizontal break (handedness-adjusted), vertical break, velocity, spin rate, and spin axis (handedness-adjusted).

**Stage 1 — PCA** (vegan::rda): PC1 explains 50.8% of variance and captures a horizontal/vertical break gradient. The three label distributions show substantial core overlap rather than clean separation.

**Formal continuum test — LDA** (MASS::lda, LOO-CV): Used as a formal test of whether the five movement features recover the three-category taxonomy. Poor accuracy and systematic SL↔ST confusion support the continuum interpretation.

**Stage 2 — GMM** (mclust, BIC model selection on subsample, full-data fit at G=6): BIC elbow at G=5–6, not G=3. ARI = 0.27 against Statcast labels. Sliders fragment across three components; curveballs partially recovered; sweepers contaminated with sliders.

**Stage 3 — Bayesian hierarchical logistic models** (JAGS, pitcher random intercepts, ST reference, stratified sample ~50k pitches): Two outcomes — whiff rate and chase rate. After adjusting for all five movement features:

- β_CU vs SL: −0.030 [−0.172, 0.111] whiff, 0.029 [−0.095, 0.153] chase — both include zero

- β_CU vs ST and β_SL vs ST both exclude zero but are likely confounded by pitcher archetypes and usage context

**Main finding:** Curveballs and sliders are statistically indistinguishable on both outcomes once movement is controlled. The sweeper occupies one extreme of a continuous horizontal break gradient. The emergence of the sweeper label may reflect refinement of this continuum rather than a genuinely novel pitch type.

Interested in feedback on: the GMM elbow justification, the LDA as a continuum test, and whether the circularity caveat is handled adequately.

reddit.com
u/Spiritual_Pen_7723 — 4 days ago

[Q]Is this a random sample

I have to do an experiment for an AP Stats project, does it count as a random sample if I pick 30 random people from my contacts lists using a random number generator?

reddit.com
u/ARedditorOnHisOwn — 5 days ago

[Question] Confused about interpretability under model misspecification

Hi.

I’ve been told all the time since intro stat that all models are wrong but some are useful, but never about how what happens to interpretability when the model is wrong. (I trust the mathematical statisticians 100% with the mathematical details of what I’m about to ask, Im concerned more so about the practicalities. Forgive any errors in understanding for I am a noob).

Specifically, with likelihood based methods, suppose the distributional assumptions are wrong (I presume they always are because the world is too damn complicated for me to be able to specify them correctly), then (correct me if I’m wrong), the parameters in the model still converge to “something” under certain assumptions about the likelihood. This pseudo true parameter is the parameter that minimizes the KL-divergence between the true distribution and our assumed distribution. Also, under certain assumptions, it will be asymptotically normally distributed and it’s recommended to use the sandwich estimator of its variance.

For the sake of not fooling myself every-time I use a model, I will presume it is always the case that I am estimating a pseudo true parameter (diagnostics only go so far). How am I supposed to interpret this pseudo parameter? My estimators? regression betas and odds ratios? What do they mean now?

I understand that to deal with these problems there are other techniques like estimating equations and the like (I don’t understand that part of the theory yet). How to they help with this issue?

What are some practical alternatives ?

Thanks.

reddit.com
u/Able-Fennel-1228 — 5 days ago

[Question] Is this residuals graph random?

Image is linked here. I always have trouble deciding whether a residual graph is random or not. I can sort of see a downward funnel, but also maybe not?

Any help is appreciated. Thank you very much.

u/nectarxx — 5 days ago