u/AffectionateWar5927

▲ 1 r/django+2 crossposts

Scraping is unsolved. Not because it's hard to fetch HTML — because pages are chaos and LLMs aren't free.
Throwing a full page at an LLM works. It's also expensive and lazy.
I wanted something smarter. So I asked: what do humans actually pay attention to on a page?
Not just metadata. Not just content. The relationship between the two.
That question became a small tool — DOM Distillation. 🔬
It takes a raw page and returns high-quality, distilled candidates: cleaner input for LLMs, better chunks for vector DBs, more meaningful nodes for graphs.
The relevance model is loosely inspired by intent-driven chunking, but I built my own spin on how structure and semantics interact.

Building the concurrency model was the weird part. More quirky than I expected — ended up as a DP algorithm. Those are the problems I live for.
It's not fast. It won't replace your existing pipeline everywhere. But in the cases it fits, it fits well.

Still thinking about where this goes. The tool is one thing. The right use case is another.

Might be the more interesting problem. 🤔

Repo -> https://github.com/ArnabChatterjee20k/domdistill

u/AffectionateWar5927 — 21 days ago