r/bioinformatics

How to Utilize AI Tools In Clinical Settings?

Hi everyone,
I work as a bioinformatian in a hospital setting where data privacy is of great concern and rules are very strict.

Because of that my use of AI and agentic tools like Claude code or biomni are very limited.

I was wondering if other people who work in similar clinical or hospital setting have the same issue.

Do most people just use a browser version of Claude or ChatGPT for code generation?

Does anyone know of any solutions or tools where you can utilize AI integrate with your data, think through research questions and in general work in a more streamline fashion than just using browser version AI tools?

Thanks!

reddit.com
u/LastKnee9324 — 9 hours ago

If you could magically become a physician in an instant but you are no longer able to do bioinformatics, would you become a physician or stay a bioinformatician?

As the title says, suppose a magic wand can make you instantly either a physician (any specialty and no longer able to do bionformatics) or stay a bioinformatician (with any skillset you want) which one would you pick and why?

reddit.com
u/mbtithroaway — 13 hours ago

Stress test: ~1,000,000 DNA reads, 60 genomes, 2 minutes. On a laptop. But only 86% mapping rate.

A question about mapping rate

A few days ago I posted asking for help with evo_* strain disambiguation. Got great feedback, learned a lot, and kept going.

Latest stress test: ~1,000,000 reads, 60 genomes, 136 seconds on a laptop (i5, no GPU).

Results:

- 86.2% mapping rate

- 86.48% accuracy

=== Per-Genome Breakdown ===
Genome Total Correct Accuracy
---------------------------------------------------------------------------
1030752 67182 67119 99.91%
1030755 5545 5494 99.08%
1030836 10369 10331 99.63%
1030878 1848 1815 98.21%
1035900 79803 79794 99.99%
1035930 3861 458 11.86%
1036539 6333 5674 89.59%
1036554 149149 149141 99.99%
1036608 2007 1993 99.30%
1036641 3392 3391 99.97%
1036707 1381 1374 99.49%
1036728 635 633 99.69%
1036743 1370 1369 99.93%
1036755 23623 23616 99.97%
1048783 1940 1940 100.00%
1048993 812 812 100.00%
1049005 22075 21982 99.58%
1049056 28905 15495 53.61%
1049089 2424 2331 96.16%
1052944 4171 942 22.58%
1052947 12087 9242 76.46%
1053058 16611 9590 57.73%
1139_AG 97325 96644 99.30%
1220_AD 91094 91038 99.94%
1220_AJ 288 280 97.22%
1285_BH 9250 9203 99.49%
1286_AP 2173 122 5.61%
1365_A 1508 1200 79.58%
Sample15_97 6 6 100.00%
Sample16_19 50 50 100.00%
Sample18_57 370 370 100.00%
Sample18_8 233 233 100.00%
Sample19_20 1516 1516 100.00%
Sample19_52 94 94 100.00%
Sample19_56 14 14 100.00%
Sample22_283 12 12 100.00%
Sample22_57 189 189 100.00%
Sample22_89 392 392 100.00%
Sample23_271 4618 4618 100.00%
Sample23_273 7 7 100.00%
Sample23_288 89 89 100.00%
Sample6_289 12 12 100.00%
Sample6_476 1 1 100.00%
Sample6_49 82 82 100.00%
Sample6_527 227 227 100.00%
Sample6_722 12 12 100.00%
Sample9_2 48 48 100.00%
Sample9_65 4 4 100.00%
evo_1035930.011 2026 486 23.99%
evo_1035930.029 35012 33754 96.41%
evo_1035930.032 11645 563 4.83%
evo_1049056.011 55646 54197 97.40%
evo_1049056.013 11804 532 4.51%
evo_1049056.015 28553 2993 10.48%
evo_1049056.031 2666 187 7.01%
evo_1049056.039 413 15 3.63%
evo_1286_AP.008 7409 1552 20.95%
evo_1286_AP.026 26519 24620 92.84%
evo_1286_AP.033 12313 3416 27.74%
evo_1286_AP.037 9012 996 11.05%

=== Top Wrong Predictions ===
evo_1049056.013 -> evo_1049056.011(10290), evo_1049056.015(723), 1049056(174)
evo_1049056.015 -> evo_1049056.011(24862), 1049056(416), evo_1049056.013(142)
evo_1286_AP.008 -> evo_1286_AP.026(5331), evo_1286_AP.033(372), evo_1286_AP.037(136)
1052947 -> 1053058(1766), 1052944(841), 1049005(199)
evo_1286_AP.037 -> evo_1286_AP.026(5460), evo_1286_AP.033(2252), 1286_AP(213)
1049056 -> evo_1049056.011(8698), evo_1049056.015(3687), evo_1049056.039(501)
evo_1286_AP.026 -> evo_1286_AP.033(806), evo_1286_AP.037(527), evo_1286_AP.008(310)
1053058 -> 1052944(3504), 1052947(3244), 1049005(214)
evo_1035930.032 -> evo_1035930.029(10802), evo_1035930.011(156), 1035930(123)
1035930 -> evo_1035930.029(3201), evo_1035930.032(155), evo_1035930.011(47)

Video attached — real benchmark, no edits.

Now my question: 13.8% of reads don't map at all. Analysis shows it's systematic — larger, more complex genomes have ~19% unmapping rate vs ~9% for smaller genomes. My hypothesis: repetitive regions produce common k-mers with low uniqueness scores, which fall below my min-score threshold.

Has anyone dealt with this? Is there a standard approach for handling repetitive regions in FM-index based classifiers?

For context: I'm a CNC programmer who built this as a side project. Still learning the field — appreciate any pointers.

u/Individual_One_1793 — 1 day ago
▲ 2 r/bioinformatics+4 crossposts

online certificate course for bioinformatics along with progamming and AI

Please suggest me an online certificate course for bioinformatics along with programming and AI for this summer

I am a 2nd year biotech student and i want to get some knowledge about dry lab

reddit.com
u/Low_Health_8499 — 18 hours ago

Is Machine Learning just fancy correlation = causation??

In science all through our education we are told that correlation doesn't equal causation and then when it comes to machine learning we are taught to choose models by how they perform, how well they fit to data and can predict outcomes.

 

Is this not just a really fancy way of finding correlations?

 

It's obvious but I don't feel like this is reckoned with appropriately.

 

To be clear I am not anti ML or AI just a bit confused about how we are using these tools.

If anyone has some thoughts about this I would be very interested!

Or an example of how you have balanced using models and more mechanistic approaches.

 

Thank you 😄

reddit.com
u/good_hugs — 19 hours ago

Independent Research Project Ideas

I'm a data engineer, and I want to do independent research in computational biology. What are some projects that I could do by myself with public data and open-source software that could have enough impact for an arxiv paper?

reddit.com
u/SimpleDumbIdiot — 1 day ago

Distance matrix with HKY model

Hi!

I am working with a relatively large COI dataset (~3200 sequences). I just ran a ModelTest with my alignment file, and the best model according to BIC is the HKY+G4 (gamma shape=0.3274). My goal is to strictly get a distance matrix for downstream analysis, I'm not interested in building a phylogenetic tree. For this I'm using the ape R package, however in the dist.dna() function there is no HKY model, but there is a F84 model that apparently is equivalent (but still not the same). Is it recommendable to just run the calculations using the F84 model (and adjusting the gamma value) or is there a significant risk by doing this? Should I just use another model that is present in the ape package with a slightly worse score?

Thanks in advance for your insights.

reddit.com
▲ 0 r/bioinformatics+1 crossposts

help! guys im terrified

im thinking of doing ms bioinformatics and have done my bachelors in biotechnology with many tech electives like machine learning and all and also have learned much on my own too , i m like terrified after ppl saying on reddit about bioinformatics job market and all in USA

i wanna ask

  1. will i really wont get anyyy job even after my ms?

  2. if the job market as ppl are saying if the real condition is really that bad then if not even 50 percent of students from my ms college gets placed wont it be a problem for the college

And let me be clear i m talking about industry not research and not academia

hope u guys will help me! thank you

reddit.com
u/reddituser_fake — 1 day ago

Do you justify QC decisions in the supplement or just mention them in the text?

Up until now I've always worked with very clean data; I haven't had to make many hard decisions since the data looks as expected. However, I'm now working on a bit of a messy single-cell analysis that requires tough decisions. Stuff like removing a couple clusters due to high mt read % (easy to justify) but also one with inexplicably low mt read %. We also have very different library sizes, so there's some nuance to our analysis in what we can/cannot compare.

I'm usually in favour of adding too much to the supplement rather than too little. Is it typical to plot out these QC metrics in the supplement to explain why we made these decisions? Like a before and after removing poor quality clusters, or showing count distributions, etc. I see a lot of papers that just mention something like "after removing low quality cells, we..."

reddit.com

Identifying enhancers for a Transcription Factors in different cell types

Hello everyone,

I have a multi-ome data, and used scenicplus to identify different TF enrichment in my cell type, and I was wondering if it possibille to check the different enhancers that TF bind to, in the different cell type.

reddit.com
u/BiggusDikkusMorocos — 1 day ago

How to learn FBA for metabolic models

Hello, all. I'm a PhD student and my work involves designing metabolic cassettes for genomic integration in yeast to enhance production of metabolites.

I want to perform FBA analysis to evaluate the effect of gene deletion, incorporation or over expression. Kindly, help me with the sources from where I can learn FBA. I don't have any prior exposure to coding too so is there a way it can be a bit less complex to understand for FBA purpose only.

reddit.com
u/bokugo1 — 1 day ago

Stress-test my research thesis: feasibility from a bioinformatics POV?

Hi r/bioinformatics,

I am exploring a research thesis and would value sharp critique before committing to original data collection. Here is a quick recap of the idea.

The thesis

Oral mycobiome composition - combined with the chemical signals fungi produce - may carry individually-distinct information that correlates with interpersonal recognition, affection, attraction, bonding patterns. Currently unstudied at the fungal layer.

What the literature supports:

  • Beghini, Pullman, Christakis et al. (Nature, 2024) - microbiome strain-sharing in 1,787 adults predicts close social relationships better than wealth, religion, or education. Fungi were not measured.
  • Cornejo Ulloa, Krom et al. (Frontiers in Endocrinology, 2024)- oral tissue expresses SSH receptors; the authors explicitly name the SSH–oral microbiome interaction as an open research gap. Bennett et al. (MDPI Toxins, 2015) - fungi produce species-specific volatile organic compounds.
  • Hadrich et al. (Frontiers in Cellular Neuroscience, 2025) -oral mycobiome dysbiosis linked to serotonin/dopamine pathway disruption.

Where I would love this sub's input:

ITS1 vs ITS2 for oral mycobiome specifically - current state of the art? Resolution trade-offs for typical oral genera (Candida, Cladosporium, Aureobasidium)?

Existing public datasets - HMP fungal subset, oral cohorts - are there any where a within-vs-between-individual variance question on fungal composition could be tested before committing to original collection?

Multi-omic angle - if metabolomics (VOCs) gets layered in later, what's a credible integration strategy with ITS abundance at the individual level? Honest tear-down - what would invalidate this thesis at the data layer before we even talk about behavioral correlates?

I am ready to hear (and cry later)) what you consider unworkable in this thesis. or what could be the cleanest first feasibility test (fail-fast).

Happy to discuss further in DMs.

reddit.com
u/KathBoonBliss — 2 days ago
▲ 2 r/bioinformatics+2 crossposts

Career advice

I have bachelors in computer science, and i was interested to do masters in bioinformatics and later phd in US . How would you advice me to go to masters or directly take time and do phd. and i also need to build my foundation in bioinformatics.advice any books or playlists that can help me.

reddit.com
u/ProfessionalRush5204 — 2 days ago

Day 1 of posting unknown human DNA BLAST

​

GRCh38 has 603 documented assembly gaps. I pulled the real flanking DNA on both sides of each gap from UCSC, trained a 5th-order Markov model on those borders, and grew sequences outward until the k-mer profiles converged with the downstream border.

Results: 459 gaps filled, 46.3 million bp. 70/70 BLAST queries against the full nt database: zero matches. Not human, not primate, not anything.

Here's one — a 30-million-base heterochromatin gap on chrY that GRCh38 leaves as Ns:

\`\`\`

\>chrY:26,673,214-56,673,214 | BLAST NOVEL

AGGCCTAGTGCTGGCTGTGTGTGTGCCTGTGCTCCAGGCTGGTCTCGAGCTCA

AGCAATCCTCCCACCTCAGCCTCCCAAGTAGCTGGGACCACAGGCGCGTGCCA

CCACGCCTGGCTAATTTTTGTATTTTTAGTAGAGACAGGGTTTCACCATGTTG

GCCAGGCTGGTCTCGAACTCCTGACCTCAAGTGATCTGCCCGCCTTGGCCTCC

CAAAGTGCTGGGATTACAGGCATGAGCCACCGCGCCCGGCC

\`\`\`

T2T-CHM13 comparison at the same 50 coordinates: 19.3% mean k-mer overlap, 45/50 above 10%. Not identical, but sharing structural vocabulary at the same genomic positions. Random sequence against random T2T regions gives <2%.

Is this valid? Am I recovering real genomic architecture, or is there a methodological flaw I'm missing? Roast me.

Im making a program using my own method to find dna

Ill respond to all. Skeptic? Ill run it and get back to u. If there is a mistake lets find it!

reddit.com
u/Spare-Association714 — 2 days ago

[Q] resources to teach myself reading bioinformatics files such as fasta, fastq

Hi all,

I am working as a statistician and trying to expand my knowledge and skills to cover bioinformatics, but I am totally new to bioinformatics. Somehow, I got to understand that bioinformatics tasks require reading data files, not only in .xlsx or .csv, but also something like fasta, fastq. I wonder if there are books or other resources that I could teach myself about these. Any recommendations and suggestions will be greatly appreciated.

reddit.com
u/dgjang — 2 days ago

Best tool for spatial proteomics cell type annotation

Hey, so my supervisor suggested try celltypist which is originally for transcriptomics data, and thus it gives terrible results. I have searched and Annospat seems to be suitable, what other tools would you suggest that works best for proteomics data? Thank you in advance

reddit.com
u/igcse_sufferer — 2 days ago
▲ 1 r/bioinformatics+1 crossposts

Biotechnology vs Computational Biology

I am in a really tricky situation, i already paid for biotechnology but my university didn't start yet and both of the courses have the same eligibility so I feel like I can ask for a request.. what should I do? Should I switch to computational biology and risk the AI bubble growing and taking it over in the near future or.. should I take biotechnology and risk the slow economic growth?

reddit.com
u/Acrobatic_Treat7430 — 3 days ago

Can KEGG pathways names be translated to other languages

I have a painfully stupid question. I have absolutely no knowledge in bioinformatics but im wrinting my bachelors about microbiota. It will be in polish and i was wondering if KEGG pathways names are universal in English or they can be translated to other languages. Im very sorry for how stupid this question is but im loosing my mind over it and cant find answear anywhere

reddit.com
u/mushypotatogang — 3 days ago

How do I visualize BGC, AMP and AMR contigs from my multi sample data?

I have 5 shotgun samples of fermented food. I am confused as to how do I visualize this and which tools to use?

reddit.com
u/MrKiling — 3 days ago

Older academic packages on modern Linux systems

I am trying to install some github repo on my Linux 25. It failed. What i got to know is the issue with older packages source code and modern compiler. Have you ever faced such thing and how do you tackle that?

reddit.com
u/No_Food_2205 — 4 days ago