r/thewebscrapingclub

▲ 7 r/thewebscrapingclub+1 crossposts

If you've ever cried at 2am because Cloudflare ate your scraper, this post is for you

Hey r/thewebscrapingclub ,

I'm a solutions engineer at Intuned. We build a platform for running browser automations and scrapers in production — Playwright-based, with the infra stuff (proxies, captcha handling, retries, scheduling, storage) handled for you so you can focus on the actual scraping logic.

We're opening up free access and I'd genuinely like feedback from people who do this work day-to-day. Specifically curious what you think about:

- The dev experience vs. rolling your own Playwright + proxy stack

- How it compares to Apify / Browserless / Browse AI for your use cases

- What's missing that would make you actually switch

Not looking for fake praise — if it sucks for your workflow, I want to know why. I spend my days helping customers scrape stuff like government procurement portals, so I've seen what breaks in the real world.

Link in comments to avoid the spam filter. Happy to answer questions about the internals (anti-bot stuff, captcha pipelines, fingerprinting) — that's the part I find most interesting anyway.

Happy to chat in DMs too.

reddit.com
u/Chance-Drink9651 — 4 days ago
▲ 41 r/thewebscrapingclub+1 crossposts

Just open-sourced my personal scraping engine: tiny self-contained binary with Lua scripting

I originally built it for myself because I wanted something extremely lightweight that runs in the background like it never existed. It's called SpyWeb.

It's designed to be "set and forget." I've had it running for months on my PC tracking job boards without a single crash or memory leak.

Specific features:

  • Zero Runtime: Self-contained ~7MB binary. No Python, Node, or Docker needed.
  • Low Footprint: Uses <5MB RAM at idle.
  • Lua Scripting: Use Lua to handle complex logic like custom headers, JS rendering, advanced monitoring, etc.
  • Hot Reloading: Change a config or Lua script and the job respawns instantly, no restarts.
  • Web Dashboard: Simple local UI to monitor scrape data in real-time.
  • Desktop Alerts: Built-in support for system notifications and webhooks.
  • Embedded DB: Built-in KV store so you don't need a separate database.
  • CDP Support: Controls any Chromium or CDP-compatible browser via Lua for JS-heavy sites.
  • Dual Mode: CLI for servers and a System Tray version for silent background runs.
  • Deduplication: Internal database ensures you never see the same result twice.

I just released the beta with CDP integration. If you need something that just sits in the background and sips resources while actually being maintainable, check it out.

Set up is very easy and straightforward: for server-side rendered pages, it's just a few lines of config (URL, selectors, fields). For JS-heavy sites, you can write a little Lua to launch a browser and drive the workflow.

You can check it out here: https://github.com/spyweb-app/spyweb

u/Additional-Elk-3712 — 6 days ago
▲ 131 r/thewebscrapingclub+1 crossposts

A stealth Playwright(Firefox) version that passes all anti-bot and CAPTCHA checks

Hey guys,
I’ve been working on browser automation that can actually survive modern anti-bot systems (especially for AI agents).
So I created a fork of Playwright for Firefox patched directly at the C++ level. It generates a different but internally consistent fingerprint per session:
• CreepJS → 0% fake
• reCAPTCHA v3 → Score 0.90
• hCaptcha → Pass
• Fingerprint Pro → bot=false, tampering=false

Repo: https://github.com/feder-cr/invisible_playwright

If you’re fighting heavy anti-bot protection or building resilient agents, I’d love to hear your thoughts or test results. Feedback, issues, and contributions are very welcome!

Thanks in advance 🚀

u/Elieroos — 8 days ago
▲ 9 r/thewebscrapingclub+1 crossposts

How do you tell if failures are caused by bad proxies or bad automation?

I'm dealing with a recurring problem where automated jobs fail inconsistently when proxies are involved.

Sometimes the browser test passes locally but fails in CI. Sometimes the request works without a proxy but times out with one. Sometimes one proxy provider works fine for one domain but performs terribly on another.

for me right now the hard part is diagnosis. I dont want to waste hours debugging selectors, waits, or test code if the real issue is proxy quality.

For those using proxies with Playwright, Selenium, scraping tests, or geo-based QA checks, what's your process for proving whether the proxy is the problem?

Do you benchmark providers before adding them to your automation stack? What metrics are actually useful?

I'm thinking:

  • success rate
  • median and p95 response time
  • timeout frequency
  • CAPTCHA/block rate
  • repeatability over time
  • results per target site, not just generic speed

If there's a standard way to test this properly.

reddit.com
u/Beardybear93 — 9 days ago

Are mobile proxies best for sm scraping?

Been looking into mobile proxies for scraping social platforms and the price jump over residential is pretty significant. Wondering if it's actually necessary or if good residential proxies do the same job. Do platforms like Instagram or TikTok detect residential IPs differently than mobile? What are you using for this?

reddit.com
u/SorinxD — 9 days ago

I built a Web-Scraper API that is 6-7x more efficient than current ones

Runo is a web-scraping API that returns typed, structured JSON. You define a schema (field name, type, example value), and Runo fetches the page and returns the data. No HTML, no parsers, no post-processing.

Over the past few weeks, I have been building this non stop. Currently, every scraper API out there solves the site fetching problem but left the extraction of the actual data entirely to users. Runo makes that completely disappear.

For Runo, I went ahead and added JS rendering, stealth mode, and full LLM extraction to make this a fully functional and capable of scraping most if not all sites.

Also, another major problem with current web scrapers is that they charge per feature or bundle them into expensive credit tiers. A single large or JS rendered request can cost 5-75 credits, which means you essentially get nothing out of their plans. Runo is flat per request, no matter the site. At the Scale tier, Runo works out to $0.90 per 1,000 effective requests vs. around $6 for the nearest Firecrawl equivalent. My jaw dropped when I was testing Runo and came across these numbers.

You can check it out here. I created a free tier that is 500 requests/month, no credit card required. Take it for a spin and let me what can be improved. I would love feedback.

u/kimotheapple — 7 days ago
▲ 5 r/thewebscrapingclub+1 crossposts

What is your opinion on AI agents for web scraping?

AI agents can help get the ball rolling, but I don’t think they work as the final approach.

I’ve seen people treat them like they can just hand over a finished scraper on the first go. The first draft might look decent, but once you test it you still have to clean up the logic and figure out what it misunderstood.

Sometimes the back and forth takes just as long as writing it yourself. At the end of the day its still just a tool to help with some gaps but it shouldn't be blindly trusted.

reddit.com
u/BlueLagoon226 — 8 days ago
▲ 77 r/thewebscrapingclub+2 crossposts

What happens when you make a browser that is identical to chrome but it's use is scraping

I built a real C++ browser and gave you a TypeScript library to control it — here's why it changes scraping

Most tools like Puppeteer and Playwright bolt automation onto Chrome from the outside. They're always playing catch-up with anti-bot systems.

I took a different approach. I built the actual browser — Qt6 + Chromium engine, written in C++. Then I wrote a TypeScript library (Piggy) that controls it over a local socket. That's why Cloudflare bypasses are almost trivial and the code stays dead simple.

Two repos, one ecosystem:

🖥️ Nothing Browser (the C++ browser) https://github.com/BunElysiaReact/nothing-browser

📦 Piggy (the TS library) — https://github.com/ernest-tech-house-co-operation/nothing-browser

What you get out of the box:

🪪 Persistent TLS fingerprint identical to real Chrome — sites can't profile you

🧠 Human Mode — randomized delays, natural scrolling, no robotic timing

⚡ Socket-based IPC — millisecond latency between your script and the browser

🌐 Remote deployment — binary runs on a VPS, you scrape from local

💾 Session persistence — save/restore cookies and storage, stay logged in

🏊 Tab pooling — concurrent requests inside one browser instance

🚀 Built-in API server — one line turns your scraper into a REST endpoint with OpenAPI docs

🔄 Proxy rotation — built-in fetch, test, switch, rotate

The code looks like this:

Ts import piggy from "nothing-browser";

await piggy.launch(); await piggy.register("books", "https://books.toscrape.com"); await piggy.books.navigate();

const books = await piggy.books.evaluate(() => Array.from(document.querySelectorAll(".product_pod")).map(el => ({ title: el.querySelector("h3 a")?.getAttribute("title") ?? "", price: el.querySelector(".price_color")?.textContent?.trim() ?? "", })) );

console.log(books); await piggy.close();

That's a real browser. Not a wrapper around someone else's.

Bun-first but Node compatible. Headless and headful ship as separate binaries so you're not carrying GPU overhead when you don't need it.

📚 Docs: https://nothing-browser-docs.pages.dev

Would love issues, feedback, and ⭐ stars — built in Kenya 🇰🇪

u/PeaseErnest — 12 days ago

built a browser MCP because every other one stunk, especially for scraping work

i scrape a lot. fifty plus sources, anti-bot stacks, login walls, geo gates. spent months copy-pasting HTML and headers into Claude/Cursor because they couldn't see the page themselves. they'd guess from my secondhand summary and get it wrong. just bringing them up to speed on a new source took forever.

tried every browser MCP out there. all stunk for the same reason.

  • Anthropic's Chrome extension. sandbox, macOS only, screen has to be awake. only works inside Claude.
  • Playwright MCP. empty Chromium, not your Chrome. re-auth from scratch. local only.
  • Browserbase / Stagehand. decent, but cloud Chromium from a datacenter IP. for scraping that's suicide. you lose your fingerprint, your residential IP, the whole moat.
  • BrowserMCP (open source). real browser via extension, gets that right. local stdio only. one tab. half-built.

so i built Reins: https://reins.vulcanos.pro

the thing nobody else does: hosted, but drives your real Chrome. Browserbase is hosted but cloud. BrowserMCP is your browser but local. Reins is both. extension in your actual Chrome with your real cookies, fingerprint, residential IP. MCP server is hosted so it works from Claude Code, Cursor, Zed, web Claude, anywhere over OAuth.

what that gets you:

  • your own session does the work. anti-bot sees your real fingerprint, real IP, warm cookies, normal mouse. nothing looks like a bot because nothing is a bot.
  • gated sources stop being special. SSO, geo-locked, login walled. you log in once like a human, agent runs on top.
  • multi-profile, one account. split work across profiles for ip diversity or regional accounts, pick from your MCP client. nobody else does this.
  • dumps can live remote. HARs, full DOMs, network logs stored off your laptop, LLM pulls on demand from any client.
  • runs anywhere MCP runs. every other "real browser" tool is local stdio that dies when you close your terminal.

install: https://chromewebstore.google.com/detail/reins/ifnmhlnmioieckkknedkikfbpkhkfpdi

my brother also uses it. takes his school quizzes, hunts apartments, does his online shopping. totally different use case, works because its his browser, already logged into everything.

free tier covers normal use. only hit metered if you scrape at scale and want dumps off your local disk.

Dm me if you have any questions

reddit.com
u/NoTicket660 — 11 days ago
▲ 12 r/thewebscrapingclub+2 crossposts

If you have been looking for a no-browser alternative, feel free to give this a go!

Fast and lightweight.

Would love feedback or bug reports if you run it against anything weird.

u/jinef_john — 10 days ago