r/WebScrapingInsider

Anyone else think the "best Pornhub proxy 2026" lists are mostly garbage now?

Trying to understand what people are actually using now that more states/countries are blocking sites or adding age verification stuff.

Most "top proxy" articles feel fake or SEO spam.. Half the free proxies either die instantly or redirect somewhere sketchy.

I tested a couple browser proxies and one VPN and honestly the experience was all over the place. Some worked for 10 mins then got blocked again.

Curious what people are actually doing in 2026. VPN? Residential proxies? Tor? Mirror domains? Something else?

Mostly asking because I'm learning more about networking/privacy stuff and this rabbit hole got weird fast lol.

reddit.com
u/doubledweeb — 19 hours ago

Amazon scraping was working perfectly fine, until suddenly everything broke.

Everything was going well for a while. The product data seemed stable, the requests were working fine, and the scraper had finally become reliable enough that it didn't need to be checked all the time. Then, out of the blue, the problems started to reappear.

Some product details disappeared, some requests stopped working properly, and parts of the workflow that had been stable for weeks became unreliable overnight. The frustrating thing is, everything can break again very quickly, without any warning.

And to be honest, keeping the scraper running right now seems to be harder than it was when it was first built. Has anyone else encountered this problem recently?

reddit.com
u/Whole-Flatworm-7609 — 24 hours ago
▲ 48 r/WebScrapingInsider+1 crossposts

I built klura, a toolkit for an AI agent to reverse-engineer websites

Hi r/webscraping,

I've been working on klura — a free toolkit that gives a coding agent (Claude Code, Cursor, Claude Desktop, any MCP host) the ability to reverse-engineer a website. The agent drives a browser to complete a task once. Klura captures everything the page does underneath, then does what I call LIFT (Learn Interface From Traffic — the analysis pass) and extracts the real underlying requests and saves them as a readable, LLM-annotated JSON config, so that it can be run later - without driving the UI. The JSON config can be analyzed to understand how the site works, and can also be copied and run on other devices.

There's prior art in this space — capture-and-replay isn't a new idea. The main places klura tries to push it forward:

  • page-script tier — here the saved strategy is a snippet of JavaScript that the AI codes for you. klura injects it into the live, already-authenticated page and runs it there. The pure-HTTP approach (the "convert to a requests / curl script" pattern) hits a wall on sites that bind the request to in-page state — rotating per-session CSRF tokens (GitHub's X-Fetch-Nonce is one public example), request bodies built by a JS function in the bundle, binary WebSocket transports. Reproducing those from outside the browser means porting the site's own signing/encoding logic — fragile, and often impossible. Page-script sidesteps it: the saved JS calls the page's own functions — the request signer, the encoder, the socket the page already has open — so the site does the hard part and klura just collects the result. And it isn't slow the way browser automation usually is — klura keeps a warm pool of already-open, already-authenticated pages, (configurable of course) so you don't pay browser startup per call; a page-script run is dominated by the actual request and lands close to plain-curl latency.

  • A real RE + debugging toolkit the agent drives — most tools capture traffic and pattern-match. Klura hands the agent an actual JS debugger: breakpoints, stepping, live-stack inspection, WebSocket frame tooling etc. It reverse-engineers a site the way a human would — break on the function that builds the request, read the encoder, verify it reproduces. You never touch any of it; you ask in plain language. Call it vibe-RE. (Detail in How the discovery works below.)

  • Self-healing — when a site changes shape and a saved strategy breaks, klura re-discovers the delta and patches the strategy automatically, instead of silently returning garbage or making you re-record from scratch.

  • Runs anywhere, hands off to a human when it must — klura is an MCP server, so the same saved skill is callable from Claude Code, Claude Desktop, Cursor, Windsurf, the CLI, or a programmatic import. Skills live globally at ~/.klura/skills/<platform>/, not per-project. When a site genuinely needs a human (login, 2FA, captcha), it escalates to a remote viewer — you do that one step, the agent continues, and the saved strategy runs in the resulting authenticated session.

  • Plugability — The browser driver is swappable: default is Playwright, klura ships a stealth driver that fixes the standard automation-fingerprint leaks (navigator.webdriver, canvas/WebGL consistency), and you can drop in your own. The interruption layer is pluggable too — when a site throws something up mid-flow (login, captcha), the handler that decides what happens is yours to define.

The artifact

Every skill is a single JSON file on disk. A fetch-tier example for a public search API:

{
  "strategy": "fetch",
  "method": "GET",
  "baseUrl": "https://hn.algolia.com",
  "endpoint": "/api/v1/search",
  "params": {
    "query":       "{{query}}",
    "tags":        "story",
    "hitsPerPage": "{{count}}"
  },
  "response": { "format": "json", "extract": "hits[]" }
}

The LLM that produced it leaves notes fields explaining the why of each parameter, the prereq chain, and what shape the response has. Read the file → you've reverse-engineered the site.

Execution path

Once saved, run it via klura (recommended):

klura execute hackernews search_stories --args '{"query":"show hn","count":3}'

Or you could ask your LLM again, it will prefer to execute already created strategies above creating new ones.

For page-script strategies the saved JS has to run inside the live authenticated page, so that path needs the klura runtime regardless of how you wire it up. For fetch strategies the saved JSON file gives you method + URL + body/header templates — slot in your params/cookies and curl away.

Three execution tiers

Tier What runs When it lands Curl-able?
fetch Static HTTP from Node, templated body/headers + prereq chain Plain APIs (HN search, Amazon /s?k=, GitHub GETs) yes
page-script JS dispatched inside the live authenticated browser tab Sites that bind requests to in-page state no — page state needed
recorded-path UI action replay through the driver Fallback when neither above is reconstructible no

How the discovery works

There's no recorder UI implemented or planned at the moment. The agent itself drives the browser. You tell your MCP host in plain language — "search HN for the top posts this week using klura" — and that's it. No script to write, no breakpoints to set, no JS to read. You could call it vibe-RE: you describe what you want, klura performs it and figures out how the site does it. You can also ask klura to map out a site, if you just want it to observe the API without performing a specific action.

For readers who want the how — on hard sites, the agent has access to a full RE toolkit (none of which you ever touch as a user):

  • JS debugger: set_breakpoint, step, resume_execution, wait_for_pause — break on the function the page calls before dispatch, read the live stack, see the encoded payload the page is about to send.
  • JS source navigation: search_js_source, read_js_function, list_loaded_scripts — find the function in the bundle that builds the body or signs the request.
  • WebSocket frame tooling: inspect_ws_frame, find_in_ws_frame, explain_ws_frame_structure, get_send_encoder, try_generator_in_page — for binary protocols, capture frames live and verify the captured encoder reproduces the wire format before saving.
  • Network log (get_network_log) for HTTP and WS in one feed, filterable.
  • Live page inspection: a11y tree, screenshots, page text, arbitrary js_eval.

This is how klura one-shots genuinely hard flows — for instance a mainstream chat app whose message-send rides a binary MQTT-style WebSocket, with an in-page codec, snowflake IDs beyond Number.MAX_SAFE_INTEGER, and a per-session token that rotates. The agent debugs the way a human would: set a breakpoint at the call site, read the actual encoder, verify it reproduces the wire output, save it as a page-script. The same toolkit generalizes to anything that builds a signed or encoded payload in-page — rotating CSRF on graphql endpoints, signed URL params, custom binary framings.

Once on disk, the next call hits the saved skill directly — or you cat the JSON and write your own scraper.

Cost: LLM once, then never

Klura runs the LLM during discovery, then never again. After LIFT the saved skill is a plain request (or a page-script dispatch): no model, no agent loop, no tokens. The only time the LLM will need to be looped back is when self-healing is required.

A couple of concrete runs, same task, measured against a plain browser agent on the same model: a Hacker News search that costs the browser agent ~20s and ~$0.09 every time runs from the saved skill in ~270ms at zero token cost. A mainstream chat app's message-send took ~27 minutes on the first run — that's the agent reverse-engineering the binary WebSocket — but every send after is a sub-millisecond dispatch into the authenticated page. The one-time discovery cost scales with site difficulty; the warm cost is always ~zero.

Install

# As an MCP server (Claude Code)
claude mcp add klura -- npx -y @klura/mcp

As an MCP server it supports many other harnesses, please check the documentation for yours to figure out how to install it.

Repo: klura · npm: @klura/runtime, @klura/mcp

Currently at 0.2.2. Real-world feedback welcome — especially the shape of the failures.

u/rundfunk — 2 days ago

Best podcasts for web scraping / data engineering practitioners?

Hey everyone,

Looking for podcast recommendations from people actually working in the field.

Specifically interested in:

  • Web scraping and anti-bot
  • Data engineering and pipelines
  • Proxy infrastructure and web data collection

What are you listening to?

reddit.com
u/shasedoge — 3 days ago
▲ 7 r/WebScrapingInsider+1 crossposts

If you've ever cried at 2am because Cloudflare ate your scraper, this post is for you

Hey r/thewebscrapingclub ,

I'm a solutions engineer at Intuned. We build a platform for running browser automations and scrapers in production — Playwright-based, with the infra stuff (proxies, captcha handling, retries, scheduling, storage) handled for you so you can focus on the actual scraping logic.

We're opening up free access and I'd genuinely like feedback from people who do this work day-to-day. Specifically curious what you think about:

- The dev experience vs. rolling your own Playwright + proxy stack

- How it compares to Apify / Browserless / Browse AI for your use cases

- What's missing that would make you actually switch

Not looking for fake praise — if it sucks for your workflow, I want to know why. I spend my days helping customers scrape stuff like government procurement portals, so I've seen what breaks in the real world.

Link in comments to avoid the spam filter. Happy to answer questions about the internals (anti-bot stuff, captcha pipelines, fingerprinting) — that's the part I find most interesting anyway.

Happy to chat in DMs too.

reddit.com
u/Chance-Drink9651 — 4 days ago
▲ 5 r/WebScrapingInsider+1 crossposts

[DISCUSSION] Built bots, automations and websites — struggling more with clients than the technical side

Hi,

I’ve been building:
- Telegram bots
- scraping/monitoring systems
- browser automations
- workflow automations
- websites for businesses

Recently I started trying to contact local businesses directly, especially ones with outdated websites or very manual workflows.

The idea is simple:
help them modernize their website and automate repetitive tasks.

The technical side honestly isn’t the hard part for me.

Getting actual clients is.

I’m trying:
- direct outreach
- LinkedIn
- Reddit
- contacting local businesses manually

For people already doing this:
what actually helped you get your first clients consistently?

Would appreciate real advice from people who’ve been through it.

reddit.com
u/Impossible-Fox9834 — 4 days ago

ran the same scraper profile through 8 fingerprinting surfaces and found my residential proxy was leaking on 3 of them

I have been running a Puppeteer pipeline against a few ecommerce sites for price monitoring, and last month I started getting soft blocks on pages that used to return clean HTML. Nothing changed in my code, so I figured something shifted on the fingerprinting side after a Chromium dependency upgrade.

Instead of guessing, I wanted to systematically check every surface that antibot vendors typically fingerprint on. I found an open source scanner that runs eight detection modules in one pass: WebRTC (STUN probe for local/public IPs and mDNS candidates), Canvas and WebGL rendering deltas, AudioContext signatures, font enumeration via Canvas text width measurement, DNS health (DoH reachability, DNSSEC, resolver location), network egress (real IP, geo, ASN, TLS fingerprint), a standard browser fingerprint dimension check (UA, screen, timezone, plugins, CPU cores, memory), and an automation detection module that looks for the same signals antibot systems use (navigator.webdriver, headless indicators, Puppeteer artifacts).

I ran the scan in three configurations: my headed Chrome with no proxy, the same Chrome routed through my residential proxy provider, and my actual Puppeteer headless setup with the same proxy. Here is what I found.

The headed Chrome with no proxy scored lowest as expected. WebRTC exposed my real local IP, DNS resolved through my ISP, and the egress probe showed my home ASN. No surprises there.

The headed Chrome with the residential proxy was where things got interesting. The egress IP and ASN looked clean, showing the proxy provider's residential range. But WebRTC still leaked an mDNS candidate that mapped back to my local network, and the DNS check showed my queries were resolving through a different geographic region than the egress IP claimed to be in. Two surfaces that a sophisticated antibot system could use to flag the session as inconsistent.

The Puppeteer headless setup was the real eye opener. On top of the WebRTC and DNS issues from the proxy config, the automation detection module flagged navigator.webdriver as present (I thought my stealth plugin was patching that), and the Canvas/WebGL fingerprint was returning a rendering signature that was identical across every single profile I tested. Meaning my "unique" browser profiles were all producing the same Canvas hash. That alone is a strong correlation signal for anyone running server side fingerprint clustering.

The fingerprint checks all ran locally in the browser, which I verified by reading through the source code. The only server call is the network egress probe, and it is not tied to any account unless you sign in and save. No signup required to run the detection; the free tier lets you save up to 3 scans if you want to compare configurations side by side later.

After seeing the results I made three changes: forced mDNS to be disabled in my Chromium flags, switched my DNS to route through the proxy tunnel instead of leaking to my local resolver, and updated my stealth plugin config to actually patch the webdriver property (turns out an upstream dependency bump had broken the patch silently). Reran the scan and the WebRTC, DNS, and automation verdicts all flipped from Critical to Safe. The Canvas issue is a deeper problem that I am still working through since it requires injecting per profile noise into the rendering pipeline.

The part that was most useful for my workflow is that the scanner covers the same eight surfaces in one click rather than me having to visit separate WebRTC leak test sites, DNS leak test sites, and Canvas fingerprint demo pages individually. The 0 to 100 score is not an absolute grade but it is useful as a relative comparison between configs. My headed Chrome with proxy went from 34 to 71 after the fixes, and my Puppeteer setup went from 22 to 58.

The tool is called Leakish. The entire codebase is open source (TypeScript, Next.js, Prisma, MySQL) with Docker and Kubernetes manifests in the repo for self hosting. I will drop the repo link and the hosted URL in a reply below.

reddit.com
u/Tall-Peak2618 — 6 days ago
▲ 9 r/WebScrapingInsider+1 crossposts

How do you tell if failures are caused by bad proxies or bad automation?

I'm dealing with a recurring problem where automated jobs fail inconsistently when proxies are involved.

Sometimes the browser test passes locally but fails in CI. Sometimes the request works without a proxy but times out with one. Sometimes one proxy provider works fine for one domain but performs terribly on another.

for me right now the hard part is diagnosis. I dont want to waste hours debugging selectors, waits, or test code if the real issue is proxy quality.

For those using proxies with Playwright, Selenium, scraping tests, or geo-based QA checks, what's your process for proving whether the proxy is the problem?

Do you benchmark providers before adding them to your automation stack? What metrics are actually useful?

I'm thinking:

  • success rate
  • median and p95 response time
  • timeout frequency
  • CAPTCHA/block rate
  • repeatability over time
  • results per target site, not just generic speed

If there's a standard way to test this properly.

reddit.com
u/Beardybear93 — 9 days ago
▲ 23 r/WebScrapingInsider+2 crossposts

Stop hardcoding your scraper logic: use the browser's Copy as cURL first

My two cents. Most people spend bunch of time reverse engineering request headers when the browser will just hand them to you. Next time you find an API call in the Network tab, right click it and hit Copy as cURL. Paste it into your terminal and it works instantly, cookies, headers and all. From there you can import it directly into Postman or use a tool like curlconverter to turn it into clean Python requests code in seconds. The browser already did the hard work of figuring out what the server needs. There's no reason to reconstruct that by hand

reddit.com
u/Bharath0224 — 11 days ago

Is this useful? A scanner for third-party scripts and EU alternatives

Hey everyone,

I’m a solo founder building a small tool called StackPatrol.eu.

The idea is simple: you enter your website URL, and it scans the site for third-party scripts, cookies, trackers and external services. Then it shows which vendors your site appears to depend on, whether they are US-owned, Europe-based or unknown, and suggests European alternatives where relevant.

It’s not meant to be a legal GDPR compliance tool. More like a quick visibility tool for founders, small businesses and agencies who want to understand what their website is actually loading.

The MVP currently focuses on:

\- detecting third-party domains and scripts

\- identifying known vendors like Google Analytics, Meta Pixel, HubSpot, Intercom, Hotjar, Stripe, etc.

\- showing a simple US / EU / Unknown breakdown

\- suggesting European alternatives

\- making the result easy to understand without needing a technical or legal background

I’m curious:

Would this be useful for other solo founders, especially those building for European users?

And if you ran a scan on your own site, what would you want the report to show?

reddit.com
u/PausePulse — 9 days ago
▲ 77 r/WebScrapingInsider+2 crossposts

What happens when you make a browser that is identical to chrome but it's use is scraping

I built a real C++ browser and gave you a TypeScript library to control it — here's why it changes scraping

Most tools like Puppeteer and Playwright bolt automation onto Chrome from the outside. They're always playing catch-up with anti-bot systems.

I took a different approach. I built the actual browser — Qt6 + Chromium engine, written in C++. Then I wrote a TypeScript library (Piggy) that controls it over a local socket. That's why Cloudflare bypasses are almost trivial and the code stays dead simple.

Two repos, one ecosystem:

🖥️ Nothing Browser (the C++ browser) https://github.com/BunElysiaReact/nothing-browser

📦 Piggy (the TS library) — https://github.com/ernest-tech-house-co-operation/nothing-browser

What you get out of the box:

🪪 Persistent TLS fingerprint identical to real Chrome — sites can't profile you

🧠 Human Mode — randomized delays, natural scrolling, no robotic timing

⚡ Socket-based IPC — millisecond latency between your script and the browser

🌐 Remote deployment — binary runs on a VPS, you scrape from local

💾 Session persistence — save/restore cookies and storage, stay logged in

🏊 Tab pooling — concurrent requests inside one browser instance

🚀 Built-in API server — one line turns your scraper into a REST endpoint with OpenAPI docs

🔄 Proxy rotation — built-in fetch, test, switch, rotate

The code looks like this:

Ts import piggy from "nothing-browser";

await piggy.launch(); await piggy.register("books", "https://books.toscrape.com"); await piggy.books.navigate();

const books = await piggy.books.evaluate(() => Array.from(document.querySelectorAll(".product_pod")).map(el => ({ title: el.querySelector("h3 a")?.getAttribute("title") ?? "", price: el.querySelector(".price_color")?.textContent?.trim() ?? "", })) );

console.log(books); await piggy.close();

That's a real browser. Not a wrapper around someone else's.

Bun-first but Node compatible. Headless and headful ship as separate binaries so you're not carrying GPU overhead when you don't need it.

📚 Docs: https://nothing-browser-docs.pages.dev

Would love issues, feedback, and ⭐ stars — built in Kenya 🇰🇪

u/PeaseErnest — 12 days ago
▲ 12 r/WebScrapingInsider+2 crossposts

If you have been looking for a no-browser alternative, feel free to give this a go!

Fast and lightweight.

Would love feedback or bug reports if you run it against anything weird.

u/jinef_john — 10 days ago

Any better and cheaper option to changedetection.io?

Hi, I'm looking for an alternative to changedetection.io that is better and cheaper. Now I know it is difficult to find something better and cheaper so anything cheaper would work too. I am thinking of self hosting, if anyone has experience doing that let me know if its worth it. I need to crawl more than 15000 links everyday so it is proving to be very expensive.

reddit.com
u/abdullahabsar — 10 days ago
▲ 12 r/WebScrapingInsider+1 crossposts

We built a Claude Code plugin that generates crawler + scraper projects from a URL

We just posted a quick demo from the ScrapeOps YouTube channel showing how our Claude Code plugin generates a working web scraping project from a prompt.

The example in the video builds a crawler + product scraper pipeline for a Walmart search page. It generates the project files, schemas, parsers, README, run commands, and JSON/JSONL output. The demo uses Python + BeautifulSoup, but the plugin also supports other languages and scraping libraries like Scrapy, Playwright, Puppeteer, etc.

The part I'm most interested in feedback on is the workflow: instead of using AI to just write a parser snippet, the goal is to generate the full scraping pipeline and then let devs inspect, run, modify, or fix it from there.

Video covers:

  • installing the Claude Code plugin
  • adding the ScrapeOps to it
  • using /generate-scraper, /fix-scraper, and /generate-crawler-scraper
  • choosing language + library
  • generating crawler and product parser files
  • running the scraper and checking the structured output

This is still aimed at developers, not "magic no-code scraping."

https://www.youtube.com/watch?v=qcE5sK0DDus

The generated code still should be reviewed, especially for ToS/robots considerations, and production monitoring. But it's been useful for cutting down the boring scaffold/debug loop.

Would be interested to hear what people here think: useful direction, or does AI-generated scraper code create more maintenance debt than it saves?

youtube.com
u/ian_k93 — 13 days ago
▲ 10 r/WebScrapingInsider+1 crossposts

1,081 GitHub stars and 124 forks later. Update on my Rust web scraper for AI agents: anti-bot, JS pages, and cleaner failures

I posted here before about webclaw, the Rust web extraction tool I’m building for AI agents and LLM workflows.

Since then I’ve been working mostly on the part that is least fun but matters the most in production: making extraction survive real websites.

The original version was mostly about turning pages into clean markdown/JSON quickly. That worked well on normal pages, docs, blogs, changelogs, etc.

But once people started trying it on harder targets, the pattern was obvious:

  • simple fetch works until it suddenly doesn’t
  • datacenter-looking traffic gets blocked fast
  • some pages return a 200 with a challenge page
  • some pages need JS rendering
  • some sites work once and fail at scale
  • some failures look like empty content instead of actual errors

So the newer architecture separates a few concerns:

  1. First try the lightweight path
  2. Most pages do not need a full browser. A lot of useful content is already in SSR HTML, JSON-LD, hydration payloads, or embedded data islands.
  3. Detect bad responses before extraction
  4. A 200 response is not always success. If the page is a bot challenge, login wall, empty shell, or blocked response, treating it as “content” is worse than failing loudly.
  5. Escalate only when needed
  6. Instead of using a browser for everything, webclaw tries to keep the common path cheap and fast, then escalates for pages that actually need rendering or stronger handling.
  7. Return useful output, not just HTML
  8. The goal is markdown, text, JSON, structured extraction, metadata, links, screenshots when needed, and clear warnings/errors.
  9. Make it usable from agents
  10. It ships as a CLI, REST API, SDKs, and MCP server, because Claude/Cursor/custom agents need tools they can call directly.

I’m deliberately not going to share the exact anti-bot mechanics. That is a weird arms race and posting the details publicly just burns everyone’s work faster.

But at the product level, the goal is simple:

normal pages should be fast and cheap
hard pages should not silently poison the output
agents should get clean context instead of raw page chaos

Current state:

  • Rust
  • AGPL-3.0
  • 1,081 GitHub stars
  • 124 forks
  • latest release v0.5.8
  • CLI, MCP server, REST API
  • scrape, crawl, map, batch, extract, summarize, diff, brand, search, research
  • hosted API for the parts that are painful to self-host

Curious from people here who run scrapers in production:

What do you care about more when a target blocks you?

  • automatic fallback
  • clear failure reason
  • ability to bring your own proxy/browser
  • lower cost per successful page
  • raw HTML access for debugging
  • screenshots
  • session/sticky behavior
  • something else?

Repo: https://github.com/0xMassi/webclaw
Site/docs: https://webclaw.io

u/0xMassii — 13 days ago

finds early B2B buyer intent from Reddit, GitHub, Hacker News, news APIs, and reviews before prospects hit a CRM

I was tired of the usual "top-level" scrapers, so I built the Dark Funnel Scraper on Apify.

What it does:

Identifies under-the-radar startups.

Finds the team members behind the projects.

Perfect for finding "hidden" opportunities before they go mainstream.

It's live now: [apify](https://apify.com/lissome\_dancer/dark-funnel-scraper)

Would love for a few of you to run it and tell me what you think of the results!

reddit.com
u/Greedy_Extent4028 — 14 days ago