u/Spare-Association714

Day 1 of posting unknown human DNA BLAST

​

GRCh38 has 603 documented assembly gaps. I pulled the real flanking DNA on both sides of each gap from UCSC, trained a 5th-order Markov model on those borders, and grew sequences outward until the k-mer profiles converged with the downstream border.

Results: 459 gaps filled, 46.3 million bp. 70/70 BLAST queries against the full nt database: zero matches. Not human, not primate, not anything.

Here's one — a 30-million-base heterochromatin gap on chrY that GRCh38 leaves as Ns:

\`\`\`

\>chrY:26,673,214-56,673,214 | BLAST NOVEL

AGGCCTAGTGCTGGCTGTGTGTGTGCCTGTGCTCCAGGCTGGTCTCGAGCTCA

AGCAATCCTCCCACCTCAGCCTCCCAAGTAGCTGGGACCACAGGCGCGTGCCA

CCACGCCTGGCTAATTTTTGTATTTTTAGTAGAGACAGGGTTTCACCATGTTG

GCCAGGCTGGTCTCGAACTCCTGACCTCAAGTGATCTGCCCGCCTTGGCCTCC

CAAAGTGCTGGGATTACAGGCATGAGCCACCGCGCCCGGCC

\`\`\`

T2T-CHM13 comparison at the same 50 coordinates: 19.3% mean k-mer overlap, 45/50 above 10%. Not identical, but sharing structural vocabulary at the same genomic positions. Random sequence against random T2T regions gives <2%.

Is this valid? Am I recovering real genomic architecture, or is there a methodological flaw I'm missing? Roast me.

Im making a program using my own method to find dna

Ill respond to all. Skeptic? Ill run it and get back to u. If there is a mistake lets find it!

reddit.com
u/Spare-Association714 — 3 days ago
▲ 1 r/DNA

Day 1 of posting unknown human DNA

&#x200B;

GRCh38 has 603 documented assembly gaps. I pulled the real flanking DNA on both sides of each gap from UCSC, trained a 5th-order Markov model on those borders, and grew sequences outward until the k-mer profiles converged with the downstream border.

Results: 459 gaps filled, 46.3 million bp. 70/70 BLAST queries against the full nt database: zero matches. Not human, not primate, not anything.

Here's one — a 30-million-base heterochromatin gap on chrY that GRCh38 leaves as Ns:

```

&gt;chrY:26,673,214-56,673,214 | BLAST NOVEL

AGGCCTAGTGCTGGCTGTGTGTGTGCCTGTGCTCCAGGCTGGTCTCGAGCTCA

AGCAATCCTCCCACCTCAGCCTCCCAAGTAGCTGGGACCACAGGCGCGTGCCA

CCACGCCTGGCTAATTTTTGTATTTTTAGTAGAGACAGGGTTTCACCATGTTG

GCCAGGCTGGTCTCGAACTCCTGACCTCAAGTGATCTGCCCGCCTTGGCCTCC

CAAAGTGCTGGGATTACAGGCATGAGCCACCGCGCCCGGCC

```

T2T-CHM13 comparison at the same 50 coordinates: 19.3% mean k-mer overlap, 45/50 above 10%. Not identical, but sharing structural vocabulary at the same genomic positions. Random sequence against random T2T regions gives <2%.

Is this valid? Am I recovering real genomic architecture, or is there a methodological flaw I'm missing? Roast me.

reddit.com
u/Spare-Association714 — 3 days ago

Day 1 of posting unknown human genome BLAST

I took a documented gap in the human reference genome (chrY heterochromatin, GRCh38). Pulled the real DNA on both sides from UCSC. Used a Markov model to bridge the gap. Submitted the result to NCBI BLAST.

Zero matches. Not human, not primate, not anything in nt.

```

&gt;GAIA_chrY_gap_fill | BLAST NOVEL

AGGCCTAGTGCTGGCTGTGTGTGTGCCTGTGCTCCAGGCTGGTCTCGAGCTCA

AGCAATCCTCCCACCTCAGCCTCCCAAGTAGCTGGGACCACAGGCGCGTGCCA

CCACGCCTGGCTAATTTTTGTATTTTTAGTAGAGACAGGGTTTCACCATGTTG

GCCAGGCTGGTCTCGAACTCCTGACCTCAAGTGATCTGCCCGCCTTGGCCTCC

CAAAGTGCTGGGATTACAGGCATGAGCCACCGCGCCCGGCC

```

44/44 gap fills tested. All NOVEL. 45.6 million bp total across 374 gaps.

Method: my own method i developed.

What am I missing? Roast me

reddit.com
u/Spare-Association714 — 3 days ago