r/statistics

▲ 3 r/statistics

[career] [discussion] Bachelor of statistics and clueless about what to do

Hey guys, I'm doing a double major in math and stats at the University of Toronto, and will most likely finish the degree by next April. I'll be honest, when I picked the degree I wasn't really thinking beyond university. I entered initially for UofT computer science, didn't make post in my first year, and then pivoted to math and stats for ego reasons. Ie "at least it's a hard major, shouldn't feel like too much of a bum". Now as time has passed that ego has pretty much disappeared, and the worry of homelessness is seeping into my thoughts.

For context I'm based on Toronto, and ever since second year I've been trying and failing to get jobs in software engineering, data analysis, banking, etc. basically wasting away 4 years in school as opposed to job experience.

Which is why I come here. What careers can I as a bachelor of science in math and stats even dream of breaking into? Should I consider going the masters route? If so, which masters should I pick that will allow me to break into a career easily? I was looking into biostats/bioinformatics and that subreddit's doom and gloom shocked me.

Also for those who studied at UofT, I have the option to switch into stats specialist and math minor with no changes being made to my final year schedule. My courses are already super stats heavy, so I was wondering if this switch is worth it or not?

u/HeightFluffy1767 — 6 hours ago

▲ 4 r/statistics

[Question] Alternatives for one-way ANOVA with failed independence (multiple group membership)

Participants	Football	Baseball	Tennis	Result
1	Yes	No	No	0
2	No	Yes	No	1
3	No	No	Yes	-1
4	Yes	No	Yes	3
5	No	Yes	Yes	-2

Here I have a list of participants (1-5) who did a survey and produced "results". Group membership is my independent variable, and the results column is my dependent. If there was no group overlap I would simply use an ANOVA and be done with it, but because I have participants in multiple groups (4 and 5) I fail the independence assumption.

I could create new "combo" categories for the cases in which there is multiple group membership and only count those participants in those new categories, but I was wondering if something else could be used instead.

What is the right stat to use here? Running in Jasp, but can use SPSS too.

u/adankishmeme — 3 days ago

▲ 8 r/statistics

[Question] How important are assumptions in hypothesis tests?

Certain statistical tests, such as the Z-test for an equality of a mean, chi squared test for cont. tables and the significance of the correlation coefficient are often based on certain assumptions, such as data that is normally distributed. However, often i seem to not see any visual description of the data that is being tested (for example histograms) or any tests (like the Kolmoforov-Smirnov test) being showcased for the distribution of the data. I understand that the test assumptions might be sattisfied or differ insignificantly when the data follows a distribution similar to a normal one, such as the student distribution, however, why are these tests often preformed even on data that is not shown to be normaly distributed? Are these assumptions strict enough that even when a non normaly distributed data satisfies or rejects the null hypothesis, we can be satisfied with the result and accept it as a probable fact? The same question follows on other statistical tests, when they are being preformed without testing whether these assumptions are satisfied.

u/vickyy01123581321 — 4 days ago

▲ 19 r/statistics

[E] [D] Transitioning from CS/AI to an MSc in Statistics

Im a bit of mess right now i just need someone to guide me in the right way

I recently graduated with undergrad degree in Computer Science and Artificial Intelligence. I liked some parts of it and got good grasp of programming and basic AI algorithms (especially the linear algebra related to ML Optimization and NLP). I realised halfway through that stuff liike software engineering and coding do not interest me whatsoever. I have always had a very sharp mind for numbers and logic. My true passion is the crisp absolute certainty of mathematics and rigorous proofs.I achieved the highest grade in math in school and it was the only subject I actually enjoyed. I foolishly fell into the trap during high school of thinking that a math degree meant I could "only become a school math teacher" so I chose CS 😭. I definitely regret that now

so eventhually I’ve accepted an offer for an MSc in Statistics starting this September. My ultimate goal after the Master's is fully funded PhD path to become a theoretical statistician or mathematician working on foundational problems or whatever project that requires advanced statistical theory

I have built a curriculum selfstudy roadmap for this summer to make sure my foundations are solid before starting msc statistics. My current list covers:

Formal proof writing and logic

Calculus

Linear Algebra

Foundations mainly focus core probability theory and mathematical statistical inference

Learning R and RStudio.

Does my summer roadmap sound realistic or am I missing any major blind spots let me know

i feel I want to explore the wider world of mathematics beyond just pure statistics like I am deeply fascinated by topics like real analysis, measure theory, convex optimization and many others

tbh writing this out makes me think that maybe its just not the time to focus on those abstract pure math fields quite yet. I think I’m going to keep my immediate focus strictly on advanced statistics and the directly related prerequisites to make sure I hit the ground running and stay on the right path

At the end of the day, I just want to learn math and figure out what my true area of specialization should be. I love the subject I've always been highly analytical and I am completely driven by logical curiosity. I’m hoping this masters degree will give me the exposure I need to uncover which specific branch of advanced mathematics I'm meant to dedicate my research career to

u/BasicallyImDeaf — 3 days ago

▲ 12 r/statistics

Which tools should I learn to advance my statistical career [Discussion]

So far, after finishing my freshman year in University, I've learned Excel and Python mainly, but I wish to advance more and have a stronger knowledge/foundation on other statistical applications. I'm wondering if I should start learning the R programming language or SQL first? Thank you very much!

u/Dense-Dirt3102 — 4 days ago

▲ 0 r/statistics

[Question] Standard deviation for fixed effects and random effects? (zero-inflated GLMM)

ChatGPT (don't come at me for AI use- I'm not good at stats) is telling me to calc SE for fixed effects and SD for random effects....is this correct? It's stating it's not appropriate to calc SD for fixed effects. Thanks! [Question]

u/heyhihello88888 — 4 days ago

▲ 1 r/statistics+1 crossposts

[ Removed by moderator ]

[removed]

u/Influence_Only — 5 days ago

▲ 8 r/statistics

[ E ] [Q] Summer before MSc in Statistics: help me define in which order should I self study these topics

Hi! while completing my thesis, I would like to spend July and August to self-study some topics before starting a MSc in Statistics, since I come from an economics BSc (with basic analysis and linear algebra courses, statistics, econometrics, and discrete structures). I would love to hear your advice about my plan.

I know that measure theory and probability theory are very important backbones of statistics. Since I will take both during my MSc, perhaps I will read some lecture notes in advance. I already followed a measure theory course for the sake of it, but felt like I could not grasp all of it. For this reason, I thought that this summer I will need to self-study the right foundational tools and prerequired knowledge to understand the advanced courses of my MSc in a deeper way. I would love to just bridge a bit the gap I have compared to a Maths BSc in a smart way.

First of all, I have never had real analysis courses. I read it is useful to understand measure theory, so I guess it will be an important gap to bridge before the Master's. I don't understand, however, how difficult and time demanding it will be.

Linear algebra: already taken during my BSc, but in a very non rigorous way. I would love to read it in a more formal way (my professor suggested Strang), but I wouldn't spend too much weeks on it because of time constraint.

My statistics professor also suggested to grasp concepts of functional analysis, convex optimization, and stochastic calculus. I guess this will be the longest part to self study. It would be beneficial to understand if they need some additional prerequisites, so If I should back up and study other foundational topics before delving into those ones.

There are plenty of other topics I haven't touched, e.g. topology, on the applied side it would also be beneficial to get a grasp of algo and DS on my own, but I have time constraints and, most importantly, I would like to learn things in the right order, so to get the right foundations to then understand better more advanced topics during my MSc, so I would really love your advice on what is deeply important to learn during this summer, and in which order would you suggest to go. Thanks!

u/ericuzza — 6 days ago

▲ 3 r/statistics+1 crossposts

Why you’re losing money on Jacks or Better (The 3 math traps most players fall for)

Most people treat video poker like a slot machine with a skin. They sit down, push buttons based on "gut feeling," and wonder why their bankroll vanishes.

The reality? 9/6 Jacks or Better is a solved mathematical puzzle. If you play with absolute, perfect strategy, the Expected Return (RTP) is 99.54%. The house edge is a microscopic 0.46%. But almost every casual player drops that down to 95% or worse because they make the same three emotional mistakes.

If you want to actually grind out a mathematical edge, stop making these three strategy blunders:

Breaking a Low Pair to Chase a Flush/Straight Draw

This is the most common mistake on the floor. You hold a pair of 4s, but you also have three cards to a Flush.

The Trap: Chasing the bigger payout because pairs of 4s "only gets your money back."

The Math: A low pair has an expected value (EV) of about 0.82 credits per coin played, because it has a realistic path to Trips, a Full House, or Four of a Kind. Breaking it to chase a standard 4-card Flush draw drops your EV to roughly 0.74 credits.

The Rule: Never break a low pair for a 4-card Straight or Flush draw. The only time you break a low pair is if you are four cards to a Royal Flush.

Keeping the "Kicker"

You get dealt a Pair of Jacks and an Ace.

The Trap: Holding the Jack pair and holding the Ace (the kicker) hoping to hit a high two-pair or three-of-a-kind with a nice backup.

The Math: Holding that extra Ace actively destroys your chances of drawing into a Full House or Four Jacks. It reduces the number of winning cards left in the virtual deck.

The Rule: Never keep a kicker. If you have a pair, hold the pair and dump the other three cards. Period.

Playing Less Than Max Coins (5)

You’re trying to stretch your $100 bankroll, so you play 1 or 2 coins per hand instead of hitting "Max Bet."

The Trap: Thinking you're playing conservatively to survive longer.

The Math: The entire 99.54% return rate relies heavily on the Royal Flush payout. If you play 1 to 4 coins, a Royal Flush pays 250-to-1. But the moment you hit 5 coins, the Royal Flush payout jumps exponentially to 800-to-1 (4,000 coins).

The Rule: If 5 coins is too expensive for your budget at the $1 level, drop down to a quarter machine or a nickel machine, but always play max coins. Otherwise, you are actively giving the casino an extra 1.5% edge.

Quick Reference Strategy Hierarchy (The Top 5)

If you get confused on a hand, memorize just this top order of operations. Always hold the highest option on this list:

Four of a kind, Straight Flush, Royal Flush
4 cards to a Royal Flush
Three of a kind, Straight, Flush, Full House
4 cards to a Straight Flush
Two Pair or a Pair of Jacks or Better

Stop playing on vibes. Treat the machine like a calculator.

u/AllenOneDC — 6 days ago

▲ 13 r/statistics

Best Intermediate Statistics Playlists for Applied ML?[D]

I’m currently working as an AI Engineer, mostly on LLM-related work (fine-tuning, LangChain workflows, evaluation, FastAPI, and some cloud). Although I graduated with an ML background, I haven’t actively worked on classical ML or statistics for about a year.
I want to revisit ML and strengthen my statistics, especially the practical side. I’m not looking for beginner playlists or derivations. I’m looking for intermediate-level resources that focus on applying statistics to real datasets—hypothesis testing (t-tests, ANOVA/F-tests, etc.), assumptions, inference, forecasting, and choosing the right statistical methods in practice.

Any recommendations for YouTube playlists, courses, or books that are practical and application-oriented?

u/aspiring_aiengineer — 6 days ago

▲ 2 r/statistics

[Q] Variable selection for zero-inflated negative binomial model

Hi all. I am using a zero-inflated negative binomial model to evaluate the change in the number of prescriptions for drug A following a treatment. The treatment is modeled as a time-varying covariate and patients initiate treatment at different times during follow-up. All patients have received this treatment so each patient contributes both unexposed and exposed person-time.

My main confusion is about the zero-inflation component of the model. I understand that the count component should include the exposure and confounders of interest. I couldn't find accurate literature about variable selection for the zero-inflation part.

My model is like:

fit <- zeroinfl(n_prescriptions ~ treatment + age + sex + poverty+ education+ offset(log(follow_up_time)) | treatment + age + sex + poverty+ education, data = df, dist = "negbin")

Is there any general principle for selecting variables for the zero-inflation component? Should it contain the same covariates as the count component, or only exposure variables? Thank you.

u/Walkill996 — 6 days ago

▲ 11 r/statistics

[Question] How screwed am I applying to European PhDs without an MS thesis?

Hi all,

I'm in a bit of a tricky spot and could use some advice.

I recently finished an MS in Statistics from a T25 program. I applied to PhD programs last cycle and unfortunately didn't get any offers, so I'm now looking at PhD positions in Northern Europe. The catch is that most of them require an MS thesis, which I don't have.

Here's why. My program offered two paths: a qualifying exam covering the first-year MS/PhD sequence, or a thesis option. Since I was planning to continue into the PhD at the same school (where I'd be writing a full dissertation anyway), I was advised to take the quals instead of doing a thesis. I passed the quals, but I wasn't ultimately able to continue into the PhD program there.

So now I'm stuck without a thesis, which a lot of European programs seem to expect. I'd strongly prefer doing my PhD in Europe. Between the funding cuts and the visa uncertainty, the US feels like a difficult place to commit to a PhD right now.

Given all this, how would you suggest I approach landing a PhD position in Europe? Any advice is appreciated. Thanks!

u/vv-97 — 8 days ago

▲ 3 r/statistics

[Q] looking for a specific term about bias in a study

i remember learning about this bias in school but for the love of me i cant remember or find what its called.

here how it was explained to me.

if i make a study and want to know how much of the population drink on the regular. during the sample collection i go on the street and ask people about their drinking habit in one spot it could be bias because of the environment.

obvious example would be me being in front of a bar. obviously people who go to the bar are more likely to drink alcohol making for bad data

less obviously but for the same bias. if im in front of a sea food store i might not be aware of a correlation between seafood and alcoholism(fictional example i don't know about that) this would taint my data.

other less obvious example if im in front of a trekking mountain people who go trekking might drink less.

every search im making bring me to participation bias but i know its not quit the same.

context why im looking into this?
i have a theory that most data about pitbull being agressive is skewed by the owners. any dog owner who would create an agressive dog will look into breed like pitbull gsd or other scary looking dog. so looking at the pitbull population as a whole is like if i made my study in front of a dog fighting club. making the sample useless.

u/chibugamo — 8 days ago

▲ 120 r/statistics+4 crossposts

My machine learning notes

In this age where people are learning from AI, I still believe there’s something powerful about 17 years of relentlessly writing and refining my own machine learning notes:
https://github.com/roboticcam/machine-learning-notes

u/Delicious_Screen_789 — 11 days ago

▲ 9 r/statistics

[Education] Trying to get my head around the basics (late in life) - brought on by a simple discussion about solstices. Explain like I’m 5 year old not 65

I was talking with a group of friends about the winter solstice and someone commented that the days will thankfully start getting longer.
One of us then added “and they’ll start getting warmer”
To which a third said, yes, “but we will still get very cold days along the way”.

This has had me thinking ever since. My schooling only covered how to work out some pretty basic averages.

I expect that the days getting longer is an exact amount every day, with no ups and downs along the way. A straight line from shortest day to longest day.

However;’the days getting warmer’ definitely isn’t. It will have some major highs and lows, but there will still generally be an upward trend.
* Is there a name for that trend?.
* Is there a specific term or description for how much over that line or how much under that line a specific day is?
* can an average be adjusted for particularly large abnormal swings - perhaps changing the example might be better here - for example “average income” where there are some insanely wealthy people and some insanely poor people, so an average income can look nothing like what the true average person earns - is there such a thing as an “average average” - one that accounts for those big figures skewing the results?

I have no idea why I’ve suddenly decided to start learning about this all because of some chat about the weather, but hopefully it’s never too late to learn something new. Just go easy on this “old dog” learning his “new tricks”
Like how to add flair when there’s no option for flair like I normally get.

u/The_first_Ezookiel — 9 days ago

▲ 9 r/statistics

[Q] Is my intepretation of Zero-inflation is correct?

Hello,

I'm reaching out because I'd like to make sure that I'm interpreting my results correctly.

In brief, I'm studying the effect of seasonal changes in a waterbird colony on the density of soil mites. Each observation represents the number of individuals of a given species found in a single soil core sample. Since some species are relatively rare, many of my samples contain zero counts (i.e., the species was not detected in that particular soil sample).

A statistician suggested fitting a zero-inflated model with:

ziformula = ~ Exposure

where Exposure represents the bird breeding season versus the non-breeding season.

Am I correct in understanding that if the zero-inflation part of the model is statistically significant (example below), this means that Exposure significantly affects the probability that a sample is a structural zero (i.e., a sample in which the species is absent for reasons beyond the count process)?

If so, would it be correct to conclude that, for the season with the higher probability of structural zeros, the species is less likely to occur in soil samples and therefore has a lower density during that period? Or is that an incorrect interpretation of the zero-inflation component?Hello,
I'm reaching out because I'd like to make sure that I'm interpreting my results correctly.
In brief, I'm studying the effect of seasonal changes in a waterbird colony on the density of soil mites. Each observation represents the number of individuals of a given species found in a single soil core sample. Since some species are relatively rare, many of my samples contain zero counts (i.e., the species was not detected in that particular soil sample).
A statistician suggested fitting a zero-inflated model with:
ziformula = ~ Exposure
where Exposure represents the bird breeding season versus the non-breeding season.
Am I correct in understanding that if the zero-inflation part of the model is statistically significant (example below), this means that Exposure significantly affects the probability that a sample is a structural zero (i.e., a sample in which the species is absent for reasons beyond the count process)?
If so, would it be correct to conclude that, for the season with the higher probability of structural zeros, the species is less likely to occur in soil samples and therefore has a lower density during that period? Or is that an incorrect interpretation of the zero-inflation component?
Example:

Zero-inflation model:

Estimate Std. Error z value Pr(&gt;|z|)

(Intercept) -1.0647 0.2593 -4.106 4.03e-05 ***

ExposureBreeding -0.8812 0.4261 -2.068 0.0386 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

u/Big_Share_6599 — 10 days ago

▲ 0 r/statistics

[D] Challenging the use of T-statistic over Z-statistic

Most people reason that the t-statistic should be used over the z-statistic, since the z-statistic requires the knowledge of the population's variance. I want to challenge this notion:

Let's call the arithmetic average of your random variable, X_bar. If you have determined your sample size to be small, then X_bar is not normally distributed. This is the Central Limit Theorem. If your random variable is not normally distributed, then you can't use the t-statistic.

It naturally follows that if you're assuming X_bar is normally distributed, then you are also assuming that your sample size is large. If your sample size is large, then the sample variance of your sample, with the correction, should reasonably equal the population variance.

u/Anonymous_299912 — 10 days ago

▲ 1 r/statistics

[Q] Is my supervisor doing ANOVA testing in the correct way?

Every 6 months we have to run instrument comparisons on our 4 LC/MS instruments.

We do this by running 2 blanks, 5 low controls, 5 high controls, and 10 randomized samples on all 4 instruments.

My supervisor then takes the data from these runs, puts them side by side in excel and does ANOVA: Single Factor to get a p-value.

My concern is that I thought anova testing was meant to be done when the sample type in the data sets are the same. But here, there's 4 data sample types getting all bunched together so the variance is wild. My supervisor is a phD and he's not exactly great about certain "prying" type questions so I have been a little nervous to ask.

Am I overthinking this? I am certainly no stats pro, but I am always looking for ways to improve the integrity of our data.

Thanks!

u/Slickity — 9 days ago

▲ 5 r/statistics

[Career] Skills required to conduct Survival Analysis in professional projects

Hi everyone, for context, I work in HR analytics and with the help of Gemini, I get to know the concept of Survival Analysis and its application in employee turnover analysis. I find it quite fascinating and really want to apply it at work. About myself, I know python, sql, basic stastistic, but don't have an advanced stasitics background. Although Gemini offers to generate the code and interpret the output for me (very kind of him lol) and I can pull and process the required data, I don't feel confident at all running the project at a formal work setting.

With that, my question is: Is it realistic for someone like me who doesnt have a formal stasitics education to build the skills to run this analysis one day? If so, how do I gain the capability to run such analysis, are there any books or online courses you would recommend for this?

Also if you are running Survival Analysis at professional setting, I would love to know how much time it took you to become competent in this area and your business title in your company. Thank you so so much in advance!

u/Professional-Sea7103 — 12 days ago

▲ 5 r/statistics

[Discussion] how best to test a running improvement?

I am a run director at a local parkrun, which is a weekly free time. 5 km run around a local park, we’re all are welcome.

We are soon to add kilometre markers along the route, and I believe that this will make people’s runs faster by a small amount.

I’m wondering how I could test or prove my hypothesis using data which is freely available. For context, every single runner has their position and time logged each week, so I was wondering if I could track some runners before and after?

I would love some input, thoughts and suggestions regarding this challenge.

u/Popular_Sell_8980 — 12 days ago