Day 1 of posting unknown human DNA BLAST
​
GRCh38 has 603 documented assembly gaps. I pulled the real flanking DNA on both sides of each gap from UCSC, trained a 5th-order Markov model on those borders, and grew sequences outward until the k-mer profiles converged with the downstream border.
Results: 459 gaps filled, 46.3 million bp. 70/70 BLAST queries against the full nt database: zero matches. Not human, not primate, not anything.
Here's one — a 30-million-base heterochromatin gap on chrY that GRCh38 leaves as Ns:
\`\`\`
\>chrY:26,673,214-56,673,214 | BLAST NOVEL
AGGCCTAGTGCTGGCTGTGTGTGTGCCTGTGCTCCAGGCTGGTCTCGAGCTCA
AGCAATCCTCCCACCTCAGCCTCCCAAGTAGCTGGGACCACAGGCGCGTGCCA
CCACGCCTGGCTAATTTTTGTATTTTTAGTAGAGACAGGGTTTCACCATGTTG
GCCAGGCTGGTCTCGAACTCCTGACCTCAAGTGATCTGCCCGCCTTGGCCTCC
CAAAGTGCTGGGATTACAGGCATGAGCCACCGCGCCCGGCC
\`\`\`
T2T-CHM13 comparison at the same 50 coordinates: 19.3% mean k-mer overlap, 45/50 above 10%. Not identical, but sharing structural vocabulary at the same genomic positions. Random sequence against random T2T regions gives <2%.
Is this valid? Am I recovering real genomic architecture, or is there a methodological flaw I'm missing? Roast me.
Im making a program using my own method to find dna
Ill respond to all. Skeptic? Ill run it and get back to u. If there is a mistake lets find it!