r/bioinformatics

▲ 9 r/bioinformatics

Publication reputation

My supervisor always emphasizes doing good science and writing good documentation, instead of minding which journal we submit to, and I wholeheartedly agree with him,

But I am still a bit disappointed that he decides to send the paper to Bioinformatics instead of at least Nature Communication because he said the wait time for Nature Communication is long.

While Bioinformatics is the top journal for the field, it is not as competitive as a Nature publication. Would this impact my chances of finding a good postdoc or even industry job that require a PhD with publication?

u/Weird_Asparagus9695 — 6 hours ago

▲ 3 r/bioinformatics

ATAC seq -- data quality issue ?

Hi everyone, I am running an ATAC seq analysis. Here I largely follow the ENCODE pipeline. My input data has great quality with FastQC ≥95% >Q35. However, I realised that I was not able to generate a satisfying peak set, i.e. FRiP 6%, TSE 1.4, ca 300 peaks after idr.

Tracing back the error, I realised that after alignment with bowtie2 my read length distribution does not show the nucleosome bumps. Starting to doubt this step, I downloaded a sample from ENCODE for reference (ENCSR019XCN) and ran the exact pipeline on it, leading to the result you see here.

Now I am starting to wonder if my input data is somehow corrupt? Did the experiment fail? What could be going on here? Is there a way to salvage this?

u/bauchibaer — 8 hours ago

▲ 79 r/bioinformatics

If you use custom chromosome names, I hate you right now.

Whoever decided that the reference names weren't suitable for your variant set, I hope you stub your toe today. That is all.

u/BractNotCalyx — 12 hours ago

▲ 0 r/bioinformatics

Help a novice

Context
I’m a bioengineering student that happens to like bioinformatics and is entering this world. A professor of mine offered me to help him in a project of antibodies. The sequences of the mentioned antibodies were sequenced with Miseq Illumina from the results of a rtPCR (this was in 2010 or so). Millions of reads with only 100 bases each read. The antibodies passed panning and Elisa assays, so I have a “selection” of antibodies.

The struggle
I can’t do a de novo assembly because I have no such computing power. I know that DADA2 and QIIME2 are used for metagenomics/metabarcoding and such (remember I am very very new to this world), but I’m very interested in using ASVs to infer CDR3 regions of the antibodies and finding abundance and diversity of each one (given that that is my main goal). I know my workflow is very crooked or I may sound like I have no idea, because I don’t have any.

Any tips? I’m not looking for a complete answer but maybe for some guidance. Thank you!!

u/MangoSantos26 — 1 day ago

▲ 0 r/bioinformatics

CLC genomics workbench help please!

Hi, I am now writing my manuscript. But I need to use the CLC Genomic Workbench one time. So is there someone who can help me create a phylogenetic tree with metadata?
Please help me urgent

Thank you

u/Kuframous — 1 day ago

▲ 0 r/bioinformatics

My PC is not installing AutoDock vina despite countless tutorials

Context: I am a high school student-researcher, and before my concept paper will be approved to move on to chapter one, my research advisor told me to simulate first the AMR (Antimicrobial Resistance) of local dogs to common veterinary antibiotics in silico. This involves the use of AutoDock and other docking programs to predict how these molecules might react with bacterial proteins and for some reason, I cannot install it on my laptop. I really need to do this simulation before the deadline in 2 days. It doesn't even bother to open anything when I click on it.

u/b33ya — 2 days ago

▲ 7 r/bioinformatics+1 crossposts

Log2 fold change vs Fold Change

I am not a biostatician and would love to understand. My project deals with looking at comparing samples from 2 different groups (say one with hot dogs and one without hot dogs). My biostatician sent me the volcano group and I am able to see which proteins are downregulated and those that upregulated. He attached a table with the fold change. However, when I look at the volcano plot, the x axis is log2 fold change, with y axis as pvalue. From my understanding, semantics wise utilizing log2 fold change is usually how represent differential expression. However, when I do the equation for log2 fold change some of the proteins will change to negative values. What does this mean? This does not make sense as in my volcano plots, these proteins are definitely placed in the appropriate side (downregulated vs upregulated).

For example Protein A listed as upregulated; with fold change 0.9, but log2 fold change is -0.11. Does that mean this protein A is actually downregulated? I also have vice versa where protein B is listed as downregulated; with fold change say 1, with log2 fold change as -0.06. Does that mean protein B is actually upregulated?

Thank you for your time!

u/jaltj — 2 days ago

▲ 12 r/bioinformatics

Undergrad Learning Single Nuclei/Bioinfiormatics Part 3: Log Normalization Confusion

Hi guys me again. I think I have a decent understanding of the tissue to sequence process, so now I'm working to learn the analysis portion. I am mostly doing my learning through the scbest practices book and a lot of gemini.

My core question is: How necessary is it to know the different types of log normalizations like shifted normalization, scran normalization and Pearson residuals? How important is it to know the math behind it?

From my understanding, log normalization is used to account for differences in the gene expression that housekeeping genes have compared to low transcripted genes. I.E house keeping has 10k counts while gene z has only 1-5 counts. It does this by dividing the counts of gene x in cell z by the total counts in cell z then multiplying by a scale factor. Repeat this across cells and you get a list of normalized expressed values. Another question, wouldn't this be computationally intensive, if you are doing this across 20k genes and 10k cells?

Also cool news, my PI announced that I could help lead the project and potentially get a first author!!! This would be next year after their paper gets published, so I still have time. I think we will get to practice nuclei isolation in a month or two (a bit nervous but excited.)

Anyways, any help or advice would be appreciated!

- Undergrad P_T67

u/Pristine_Temporary67 — 2 days ago

▲ 0 r/bioinformatics

single cell data

I'm asking regarding the following: i want to do meta analysis for single cell data from different studies, some studies used human genome reference hg19 in alignment step of raw data, other studies used human genome 38. so, will this be a problem when i merged studies together ? if yes how can i overcome this ?

u/Current-Shopping-793 — 2 days ago

▲ 60 r/bioinformatics

“Public stress-related or organ-related RNA-seq data sets were added into this analysis and treated as replicates to make our results more robust” in a DE analysis. That’s insane, right?

Treating public datasets as additional replicates of your own experiment is not a good idea, right? Is there any right way to do it? Saw it on an article published on a journal with ~6 IF as I was searching for public plant datasets with a good number of replicates and I could not believe it… or am I missing something??

u/ytmk — 3 days ago

▲ 65 r/bioinformatics+7 crossposts

A preprint titled “Evolutionary shifts in spike glycan-binding specificity suggest a possible association with host adaptation during SARS-CoV-2 Omicron evolution”

We have published a preprint titled “Evolutionary shifts in spike glycan-binding specificity suggest a possible association with host adaptation during SARS-CoV-2 Omicron evolution” .

u/Proof_Strawberry5086 — 3 days ago

▲ 83 r/bioinformatics

Could Claude Science replace bioinformaticians?

I recognize this may be a controversial subject, but I want to hear all sides to the argument. Could Claude Science replace bioinformaticians? https://www.anthropic.com/news/claude-science-ai-workbench

I haven't tried it out yet, but the demo was impressive. Food for thought ¯\_(ツ)_/¯

EDIT: i don't work for anthropic, openAI or any AI company. simply just curious what people think. thanks!

u/PepperCareless724 — 4 days ago

▲ 0 r/bioinformatics

Recommendations for dealing with DEseq2 (DEGs) in non-model organism.

Hi All,

Hope you can help please:

I'm working on RNA-seq data from a non-model organism. I assembled transcripts with StringTie, performed differential expression with DESeq2, and now have a list of significant DEGs. My transcript IDs in this look like MSTRG.xxx|LOCxxxxxx and MSTRG.xxx.

I have the stringtie_merged.gtf, the reference genome FASTA and DESeq2 results.

I'm now at the annotation and functional analysis stage. What annotation tools/ r packages have people found work best for this type of dataset? I am considering blasting everything, but would be interested to hear what others use?

Also for downstream functional analysis, what do people recommend for GO and pathway enrichment? Are there particular R packages or workflows that work well with StringTie/DESeq2 output of a non-model.

Thanks in advance for all your help.

u/Realistic_Dig_3714 — 2 days ago

▲ 16 r/bioinformatics

Concerning about possible paper mill for genome-wide identification and characterization studies

Hi,

My main research area is in plant genetics (I'm a bit newer to the field) and I'm becoming pretty confused about the number of gene identification and characterization studies in plants.

For context, if you search up "gene identification and characterization" in pubmed or google scholar, you'll see tens of thousands of results that give the same types of article that pretty much do some combination of

gene identification via blast --> chromosomal localization --> multiple sequence alignment and phylogenetic trees --> cis-regulatory elements + protein-protein interaction graphs --> GO term analysis (which is already frequently done by the genome sequencing paper or some auto-annotating software)--> then gene expression profiling of X conditions (either they do it themselves or they retrieve some public screening data)

Maybe I'm misunderstanding this but isn't everything on this in-silico (except the expression profiling/stress condition test, which even that seems to be a "we need to do an easy, small wet-lab assay to pass the the reviewer's conditions") and couldn't it all be automated? I've heard of some tools like PlantTribes2, Spdev3.0 (or even random preprint pipelines like reactr and bat) but it's also possible for people to find/make their own Snakemake/Nextflow pipeline for this, which could automate large segments of this. I think those tools I mentioned are relatively newer, but seeing the vast volume of all the papers that have been going on for decades and also seeing that bioinformatics pipelines have existed for equally as much time, I feel like this is almost feels like an intentional (or maybe not, I don't know) paper mill operation.

Mostly seeing that these papers are coming from "X agriculture/forestry university" in some university in China but are still getting passed in peer-reviewed journals with decent impact factors (and they pretty much all cite each other as they're "building on" the methods framework).

Despite this technically being novel information (as one could simply mine out millions of papers for thousands and thousands of gene families in millions of cultivars and species) feels like me to be a violation of academia since it doesn't really feel creative, novel, or "research."

Thoughts on this?

EDIT: typos, examples, links

u/MaybeTasty5082 — 3 days ago

▲ 0 r/bioinformatics

autdock vina - problems

hello, when i try to run autdock vina in cmd this errors appears "This app can't run on your PC"

any solutions?

Thanks

u/Ok-Back-13 — 3 days ago

▲ 0 r/bioinformatics

Our WGS pipeline silently mislabeled 40 samples for three weeks. Nobody noticed until a PI asked why two "unrelated" patients had identical variant calls.

Someone changed a sample sheet template in a shared drive. Just reordered two columns and didnt mention to anyone. Our pipeline read sample IDs by column position instead of header name, because that's how it was written five years ago and nobody had touched that part since.

Every run after that point silently swapped IDs between adjacent samples. No errors. No warnings. Pipeline finished green, QC metrics looked fine, because the actual sequencing data was perfectly good, just labeled as the wrong person.

We only caught it because two "unrelated" cohort patients came back with identical rare variants and someone actually looked instead of assuming coincidence. Three weeks of production runs, forty samples, every one needed a re-check against the raw sample sheets to figure out who was actually who.

Fix took ten minutes: read by column name, not position, and add a checksum step that flags renamed or reordered sheets before the pipeline touches them.

The part that gets me: this bug existed for five years. It only needed someone to reorder two columns once to go off. Every "it's worked fine for years" pipeline has one of these sitting in it somewhere.

Did you have one of those "silent errors"? Curious to hear other stories and learn from them.

u/Shoddy_Card_237 — 4 days ago

▲ 2 r/bioinformatics

How to lower the numbers of clusters and find the best parameters (Spatial Transcriptomics)

Hi bioinformatics experts!

I am doing a project but i am struggling with finding parameters to lower the numbers of clusters. I am trying different parameters such as lowering resolution, lambda, and k_geom.

I tried a total of three so far and they all look similar so I decided to make this post to get some ideas of how i can lower the numbers of clusters and make it clear.

Trial1: K_geom 30 and resolution = 0.5
Trial 2: K_geom 30 and resolution = 0.1

Trial3: K_geom10 and resolution = 0.5

Parameters I was told to adjust:

Before running BANKSY, there are two important model parameters that users should consider:

k_geom : Local neighborhood size. Larger values will yield larger domains
lambda : Influence of the neighborhood. Larger values yield more spatially coherent domains

I ended up having 44 clusters and I would love to get some insights!

Thank you!

u/Long_Store9792 — 4 days ago

▲ 65 r/bioinformatics

AutoDock Vina results with HO-2-IN-1 with COX-2

Hi everyone,

I'm an 18-year-old student from India who's recently become fascinated by computational biology. My background is stronger in mathematics than biology, and I'm very new to molecular docking, protein structures, and computational drug discovery.

I've started experimenting with AutoDock Vina using publicly available protein structures as a way to learn. I know enough to realize that I don't know enough, so I'm here to understand how to interpret my results correctly rather than make claims.

As one example, I docked Heme Oxygenase-2-IN-1 against COX-2 (PDB: 5IKR) and got a best docking score of -8.385 kcal/mol. Since this surprised me, I'd like to understand whether this could be due to blind docking, the scoring function, protein structure choice, or something else.

I'd really appreciate any corrections, reading recommendations, or advice on what I should learn next.

u/ArkaneelRoy — 4 days ago

▲ 22 r/bioinformatics

Tips for staying organized

Has anyone here been the sole bioinformatician in an academic lab after finishing their PhD?
I’m about to start such a role, and I’d love to hear about your experience.
How do you organize your projects when you’re supporting multiple people at once?
How do you keep track of requests, analyses, deadlines, and ongoing collaborations? Are there any tools that make your life much easier?
I’d appreciate any advice or lessons you wish you’d known when you started. Thanks!

u/wonder3756 — 4 days ago

▲ 4 r/bioinformatics

Professional Indemnity Insurance as a independent contractor bioinformatics researcher?

I will shortly have to start working as an independent contractor, taking the company I did my curricular internship with as my client. Do I need PI insurance?

u/AdOk3759 — 3 days ago