u/Healthy-Owl1683

I do a lot of RAG ingestion and kept hitting the same annoyances with existing crawlers: token-based pricing that's hard to predict, and output I had to clean up before chunking. So I built a small tool that does just the part I needed.

You give it a start URL. It uses the sitemap if there is one, otherwise follows same-domain links, and returns one clean markdown record per page. Each record includes an estimated token count, so you can see your context budget before ingesting anything. It respects robots.txt and only reads public pages. Pricing is flat per page instead of token credits, which made my costs predictable.

Honest limitation: it fetches server-rendered HTML, so JavaScript-only pages come back mostly empty. Docs sites, blogs, and most content sites work well. A browser-rendering mode is next on my list.

It's my own tool, so feel free to be critical. I'd genuinely like to know what's missing for your pipeline. https://apify.com/adambounhar/site-to-knowledge-base

Built a tool that turns a docs site into LLM-ready markdown, one record per page with token counts