r/rstats

What is better count regression or t-tests for cell proliferation data: I had to know
▲ 19 r/rstats+2 crossposts

What is better count regression or t-tests for cell proliferation data: I had to know

In biology you often count things: cells of type A out of total cells of type B, mutant flies out of total flies, etc. The most common move in papers is to compute a ratio per animal and run a t-test on the ratios. This throws away how many cells you actually counted: "5/100" and "50/1000” becomes same, and feeds strictly [0,1] bound data to t-test. The principled alternative is count regression with offset(log(N)): model the raw count directly, bring the total in as a statistical weight, respect the non-Gaussian nature of count data. This week I decided to test this assumption in practice:

Setup. Four methods across two pipelines:

  • Animal-level: Welch's t-test on ratios vs CMP GLM (glmmTMB(..., family = compois()))
  • Field-level: LMM with (1 | EmbryoID) vs CMP GLMM with the same RE

Three metrics: Type-I error, size-adjusted power (Lloyd correction), median 95% CI width.

The interesting bit. Instead of running ~10k sims at one design, I sampled 300 designs over a 6-dim space with Latin hypercube (log-uniform on multiplicative knobs, linear on CV, discrete on n_animals), ran 200-500 sims per design × method, then fit GP emulators (hetGP, Matérn 5/2 + ARD) on the point estimates. (I try to run and hide but come back to GAMs one way or another :)). LOOCV verified they generalize. Sobol decomposition tells me which design knobs drive each method's response; Monte Carlo marginalization over nuisance knobs gives clean 2D heatmaps of power and CI width on (n_animals, CV).

Findings.

  • Both methods hit 80% power at essentially the same (n_animals, CV) spot. Below that threshold, in the underpowered regime where most real experiments live, count regression beats the ratio approach.
  • CMP GLMM produces narrower CIs than LMM at essentially 100% of designs (median ~12% narrower). CMP GLM beats Welch at ~97% (~7% narrower).
  • Adding random effects shifts the 80% power contour to the left: fewer animals for the same power.
  • Sobol shows all four methods have nearly identical sensitivity profiles. The precision advantage isn't about one method responding to a knob the others ignore; it's about how efficiently each one extracts information from the same drivers.

Practical takeaway. Default to glmmTMB(Y ~ Group + offset(log(N)) + (1 | EmbryoID), family = compois()). The CMP advantage is real and lives in the small-n regime. If you have huge n, all four agree.

Full reproducible post with code:

u/rrytas — 1 day ago
▲ 5 r/rstats

Architecture advice for a lab website (quarto and shiny server)

Hello everyone,

I’m looking for some advice/validation regarding the web infrastructure for our research lab. We are building two things using R-based frameworks:

  1. A static lab website built with Quarto.
  2. Several dynamic web apps built with Shiny.

Like many academic labs, we are on a tight budget. Paid solutions like Posit Cloud/shinyapps.io ($20+/month) are too expensive for our use, and we only want to pay for a custom domain (~$10/year).

Here is the architecture we are planning:

  • Host the Quarto static site on GitHub Pages (free) and link it to our root domain (e.g., lab.com).
  • We have a dedicated PC in the university running open-source Shiny Server. The apps are currently running fine, but they are only accessible via the university intranet.
  • University IT is usually unresponsive and won't open ports or configure firewalls for us.
  • We plan to use Cloudflare Tunnels on the local PC. This would expose the Shiny Server to the internet securely without opening inbound ports or setting up a reverse proxy (Nginx) ourselves. We would route this to a subdomain (e.g., tools.lab.com/app1).
  1. Is this a sound approach, or am I overcomplicating things?
  2. Is the subdomain approach (tools.lab.com) the best way to integrate this, or is there a simple way to have everything under the root domain (lab.com/tools) without causing routing conflicts with GitHub Pages?
  3. Has anyone deployed a similar stack in an academic/strict IT environment? Any caveats regarding Cloudflare Tunnels and university firewalls I should be aware of?

Thanks in advance for your insights!

reddit.com
u/xTrew — 2 days ago
▲ 1 r/rstats

Tried to create a histogram with sentiment scores and it came up empty, how can I fix this?

Good Afternoon, I'm a complete R-Studio newb, I'm doing this for social data science assignment due in two days. Tried to create a sentiment score histogram, and I was following a tutorial document by my module teacher, but my histogram came up empty. I realise it's because the file "doc_sentiment_filtered" came up empty, but I don't know why that happened (I suspect it might be because his tutorial was meant to have two dfm_subsets for another thingy?? but I didn't need that for the purpose of my own investigation, so I just used the original doc.dfm.final file instead)

#convert sentiment scores to data frame and add docvars

doc_sentiment_df1 <- cbind(convert(sentiment_scores1, to = "data.frame"), docvars(doc.dfm.final))

#calculate document lengths

doc_length <- ntoken(doc.dfm.final)

#Harmonise document names across both frames/files

names(doc_length) <- basename(names(doc_length))

#Add document lengths, aligning names

doc_sentiment_df1$doc_length <- doc_length[match(rownames(doc_sentiment_df1), names(doc_length))]

#Filter documents with positive length

doc_sentiment_filtered <- doc_sentiment_df1 %>% filter(doc_length > 0)

#calculate raw and normalised sentiment scores

doc_sentiment_filtered$raw_sentiment_score <- doc_sentiment_filtered$positve - doc_sentiment_filtered$negative

doc_sentiment_filtered$normalized_sentiment_score <- (doc_sentiment_filtered$raw_sentiment_score / doc_sentiment_filtered$doc_length)*100

#Now summary should be meaningful

summary(doc_sentiment_filtered$normalized_sentiment_score)

#Histogram of normalised sentiment scores

ggplot(doc_sentiment_filtered, aes(x = normalized_sentiment_score)) + geom_histogram(binwidth = 1, fill = "skyblue", colour = "black") + labs(title = "Distribution of Normalised Sentiment Scores per 100 Words",

x = "Normalised Sentiment Score", y = "Number of Documents")

reddit.com
u/Moosiebwerry — 3 days ago
▲ 111 r/rstats

What do you want to know about AI + R and data science?

I'm bringing my substack back to life to talk about AI and data science. I have conflicted feelings about both AI and writing about AI but I want to try and work through them in the open. I'd love to know what y'all would like to hear about in future posts! 😀

u/hadley — 4 days ago
▲ 5 r/rstats

What package would you suggest for isotopic mixing of individual samples?

I have a collection of samples (n ~ 20) that I have measured 2 isotopic values of and I want to calculate the likely % contribution of 4 source endmembers for each sample (eg sample 1 is 25% source 1, 12% source 2, 40% source 3, 23% source 4 +/- what ever; sample 2 is X% 1, Y% 2, Z% 3, A% 4, and so on). What package would you recoomend using? I am aware of Mixsiar, but I am not interested in the source decomposition of populations of samples; I want to know the breakdown on a sample by sample basis (within uncertainty of course)

Thank you

reddit.com
u/Pohatu5 — 3 days ago
▲ 7 r/rstats

what's the null hypothesis

this is kinda a dumb question but if the statement is: "the average salary is less than 500. test this claim", what's the null hypothesis and the alternative hypothesis?

reddit.com
u/Far-Bad-1441 — 4 days ago
▲ 20 r/rstats+2 crossposts

ACTUNEO – Open Source African Actuarial Python Library | Looking for Contributors

Hi everyone,

I am currently building ACTUNEO, an open-source actuarial Python library focused on African and emerging market actuarial applications.

The goal is to create localized actuarial infrastructure that bridges traditional actuarial science with modern data science tools while addressing African market realities such as multi-currency environments, localized mortality assumptions, pension analytics, and insurance modeling.

ACTUNEO is built in Python and integrates with:

  • Pandas
  • NumPy
  • SciPy
  • Plotly
  • Scikit-learn

Why this project exists

Many actuarial libraries are built primarily around European or North American assumptions and datasets. African actuaries often adapt foreign assumptions due to limited localized tooling and open actuarial datasets.

ACTUNEO aims to help address this gap through open-source collaboration.

Current Progress

The project is currently building foundational modules including:

  • Mortality tables and survival models
  • Interest theory and financial mathematics
  • Life contingencies
  • Pension calculations
  • African macroeconomic data integration

Looking for Contributors

I am looking for:

  • Actuaries and actuarial students to validate formulas and assumptions
  • Python developers to improve architecture and testing
  • Data scientists interested in actuarial modeling
  • Technical writers to improve documentation and tutorials

Repository:
ACTUNEO GitHub Repository

Several issues will be tagged with:

  • good first issue
  • help wanted
  • documentation
reddit.com
u/Aggravating_Bat_2009 — 4 days ago
▲ 0 r/rstats

geom_col() messing up the age variable

Hi! I'm new to R and I'm trying to plot mutation subtypes with the age variable for a melanoma dataset. The code runs perfectly fine but I don't understand why the the geom_col() function keeps plotting weird numbers for age? especially since I plotted this for a subset specifically. I tried using the geom_bar() function and it worked but I think it plotted the number of observations I had over the actual age as a variable.

Can anyone help with this? Thank you!

https://preview.redd.it/ciiqjh56rt1h1.png?width=1758&format=png&auto=webp&s=ec0a8410cbfbdaa503d216646df3761c9f5d0062

reddit.com
u/Pretend-Gap8764 — 4 days ago
▲ 14 r/rstats+1 crossposts

qol 1.3.1 & printify 1.0.1 - Update with detailed refinements

qol is a package which can be used as its own ecosystem concerning descriptive evaluations, data wrangling, tabulation and much more. It offers over a hundret high level functions which make the coding life easier. While the last updates implemented many entirely new functions, this update focuses more on refining the existing ones.

printify is the base R zero dependency message system which is directly implemented in qol, but can also be used as a stand alone lightweight package.

A detailed overview for both packages can be seen here:

qol: https://github.com/s3rdia/qol

printify: https://github.com/s3rdia/printify

So what is in the update?

Renamed functions

compute() and recode() have been renamed and now have a "." at the end (compute.() and recode.()) to prevent masking errors in combination with dplyr. This means existing code will break, if these functions where used.

Mesage system

* set_no_color(): Suppresses the color codes so that messages can be printed clean. The option is auto controlled on load via the system variable `NO_COLOR` but can also be set individually by this function. Console output in e.g. RStudio vs. output to a logging system should be handled automatically right now.

* set_up_custom_message(): Waiting symbols as well as the color of the time stamps can now be customized.

* print_step(): Now has a new `in_place` parameter, which prints the message on the same line as before, instead of in the next line. This can e.g. be used inside loops as follows.

new_in_place_steps <- function(){
    print_start_message()
    
    print_step("MAJOR", "Let's get started...")
    
    for (i in seq_len(10)){
        print_step("Minor", "This is in place step [i] of 10", i = i, in_place = TRUE)
        Sys.sleep(0.25)
    }
    
    print_step("MAJOR", "Loop has ended")
    
    print_closing()
}

new_in_place_steps()

Tabulation workflow

any_table() and export_with_style(): If the whole result list from these functions is passed for the `workbook` parameter, the functions now are able to extract the actual workbook from the list and run without error. Additionally if a list is passed, which is not a result list containing the workbook, the functions error and abort execution.

any_table(), frequencies(), crosstabs(): If 'csv' is specified as extension in the `file name` set in the global options or the style parameter the result table will then be exported as 'csv'. Otherwise the actual workbook will be exported as `xlsx` as normal.

New way to transpose data

transpose_plus() can now in a wide to long transposition not only put results below each other, but also side by side.

# Example formats
age. <- discrete_format(
    "Total"          = 0:100,
    "under 18"       = 0:17,
    "18 to under 25" = 18:24,
    "25 to under 55" = 25:54,
    "55 to under 65" = 55:64,
    "65 and older"   = 65:100)

sex. <- discrete_format(
    "Total"  = 1:2,
    "Male"   = 1,
    "Female" = 2)

# Example data frame
my_data <- dummy_data(1000)

# Transpose from long to wide and use a multilabel to generate additional categories
long_to_wide <- my_data |>
    transpose_plus(preserve = c(year, age),
                   pivot    = "sex",
                   values   = c(income, weight),
                   formats  = list(sex = sex., age = age.),
                   weight   = weight,
                   na.rm    = TRUE) |>
    rename_multi("income_Total"  = "Total",
                 "income_Male"   = "Male",
                 "income_Female" = "Female")

# Transpose back from wide to long but this time put results side by side.
# To do that every list entry has to have the same name. The values parameter
# is then used to give the new value variables a name. For the expressions of
# the new categorical variable the variable names from the first pivot list
# entry are used.
wide_to_long <- long_to_wide |>
    transpose_plus(preserve = c(year, age),
                   values   = c(income, weight),
                   pivot    = list(sex = c("Total", "Male", "Female"),
                                   sex = c("weight_Total", "weight_Male", "weight_Female")))

if.() can now explicitly delete

If the new `delete` keyword is passed instead of a variable assignment, the provided condition deletes observations instead of keeping them.

subset_df <- my_data |> if.(sex == 1, delete)

# Is the same as
subset_df <- my_data |> if.(sex != 1)
u/qol_package — 5 days ago
▲ 4 r/rstats

Most common stats used in trading applications, for modeling confidence?

Hi, what would you say are the most common or best ways to model confidence levels, estimates for things like theories or scenarios, for market analysis?

reddit.com
u/mrsockpicks — 5 days ago
▲ 64 r/rstats

How do I make sure I'm not off-loading valuable skills to AI while learning R? What experiences should and definitely shouldn't be automated?

Hi! So for context, I've been learning R for a few months now and getting the hang of it, but since I'm doing a lot of work in computational biology, I frequently use a lot of niche packages and handle large amounts of complex data with steep learning curves.

I've been trying to learn R the "natural" way as much as I can (reading documentation, stack overflow, debugging, etc), but when that stops working (or is very time consuming) I sometimes fold and ask AI for to explain a package program to me, or why my script isn't working. It has made my learning process much faster, but since I'm not an experienced data analyst, I fear that I'm not gaining the valuable skills of struggling through these things.

That being said, are there any concepts/workflows/tedious things that are valuable learning experiences that I shouldn't off-load to AI? And conversely, what are the things that you think getting AI to automate isn't a bad thing for learning R? Any input is appreciated!

reddit.com
u/JerryChen06 — 8 days ago
▲ 0 r/rstats

I built a free website with 90+ statistical calculators (and non-parametric tests, probability distributions, etc.) - completely free, no account needed.

Hey r/stats,

I wanted to share a personal project I’ve been working on: statistical-calculators.site.

It’s a web-based platform with over 90 free statistical calculators designed to make life easier for students, researchers, and data analysts.

What’s inside?

  • Hypothesis Testing: One/Two sample t-tests, One-Way & Two-Way ANOVA, Chi-Square (Independence & Goodness of fit).
  • Probability Distributions: Normal, Binomial, Poisson, Geometric, Exponential, and Uniform distributions.
  • Non-Parametric Tests: Wilcoxon Signed Rank, Mann-Whitney, Kruskal-Wallis, Fisher's exact test, and McNemar.
  • Regression & Correlation: Simple & Multivariate Linear Regression, Pearson, Spearman, Kendall's rho, and Cramer's V.
  • Descriptive Stats & Intervals: Confidence intervals (proportions, means, variances), frequency tables, histograms, and basic plots.
  • Plus some extra daily tools like BMI, finance, and geometry calculators.
  • And much much more; in statistics, but not just statistics.

Why I made it: I wanted to create something straightforward that runs directly in the browser with zero barriers:

  • 100% Free
  • No registration / No account required
  • No software installation needed

Multi-language support: Most calculators are fully accessible in English and French, and I'm currently working on expanding the Spanish version as well (you can toggle languages right on the site) .

I would honestly love to get your feedback! Are there any specific features, UI tweaks, or additional calculators you think are missing?

Would love to hear what you think;

Thank you so much!

statistical-calculators.site

u/ArmPuzzleheaded9469 — 6 days ago
▲ 5 r/rstats+5 crossposts

PLS-SEM on seminr

I built a PLS-SEM GUI for the most famous r package “seminr”.
This is to make PLS-SEM more user friendly and accessible rather than have the 100 case cap on SmartPLS and subscription
Try it out at metis.emend.it.com.
Test it and let me know your feedback.
The feedback button is in the app.
We are also working on CB-SEM using Lavaan and inferential statistics so that no subscription for SPSS and academics becomes free..

Supports are welcome

metis.emend.it.com
u/Outrageous-Giraffe58 — 7 days ago
▲ 9 r/rstats+1 crossposts

How do I get a sensible output for a regression in R with many categorical variables

Hello everyone!

I hope this is the right thread, if not I‘m very sorry.

I am running a regression in R using lm that contains quite a few categorical variables. I‘m using factor() on all categorical variables. The problem is that when using summary() I get estimates for each combination of categorical variables, meaning that the output has over 300 lines. I‘ve been using drop1 (F-test) to solve this problem, but I‘ve been wondering whether ANOVA would be a better choice? Another issue with using drop1 is that I can‘t use robust errors, because drop1 doesn‘t work with lm_robust or lm2.

My supervisor can‘t help me (only knows STATA) which is why I‘m asking here.

Any help is much appreciated!

reddit.com
u/Sad_Treat_5285 — 8 days ago
▲ 5 r/rstats

Examples for significance asterisk and P-value for line chart

Hello guys, I’m fairly new to the research game and need your advice. For my medical research I have line charts. You can see the example on the picture. On the x Axis a different time stamps and Im comparing the time stamps to one another. My supervisor wants me to add significance asterisk and p-Values to the line charts. What is the most common depiction for that? Do you have examples how it should look?
(My supervisor is sadly not very helpful and expects me to figure is our by myself.)
PS: English is not my first language

u/ssaaeecc — 8 days ago