r/webscraping

I need help

Hello, I don't have almost any knowlegde in web scraping and I don't really know if this is the correct place for this question but I need to extract data from this opendata.camara.cl

I want to extract how every congress men voted in every votation from 2005, so I can't do it manually

Every bill has a number (número de boletín) and every bill can have none o more votations (ID Votación) the ID Votación (that you can know from "Votaciones por proyecto de ley" by the número de boletín) has to be entered one by one in "Votación detalle - Cámara de Diputados"

So what I want is to have a database with how every congressman voted in every election from 2005 to now.

The website says something about SOAP, HTTP GET and stuff. I know how to use R and a little
of python if that works for something.

Thank you if someone can give a lead in the right way!

reddit.com
u/paintwithletters — 1 day ago
▲ 48 r/webscraping+1 crossposts

I built klura, a toolkit for an AI agent to reverse-engineer websites

Hi r/webscraping,

I've been working on klura — a free toolkit that gives a coding agent (Claude Code, Cursor, Claude Desktop, any MCP host) the ability to reverse-engineer a website. The agent drives a browser to complete a task once. Klura captures everything the page does underneath, then does what I call LIFT (Learn Interface From Traffic — the analysis pass) and extracts the real underlying requests and saves them as a readable, LLM-annotated JSON config, so that it can be run later - without driving the UI. The JSON config can be analyzed to understand how the site works, and can also be copied and run on other devices.

There's prior art in this space — capture-and-replay isn't a new idea. The main places klura tries to push it forward:

  • page-script tier — here the saved strategy is a snippet of JavaScript that the AI codes for you. klura injects it into the live, already-authenticated page and runs it there. The pure-HTTP approach (the "convert to a requests / curl script" pattern) hits a wall on sites that bind the request to in-page state — rotating per-session CSRF tokens (GitHub's X-Fetch-Nonce is one public example), request bodies built by a JS function in the bundle, binary WebSocket transports. Reproducing those from outside the browser means porting the site's own signing/encoding logic — fragile, and often impossible. Page-script sidesteps it: the saved JS calls the page's own functions — the request signer, the encoder, the socket the page already has open — so the site does the hard part and klura just collects the result. And it isn't slow the way browser automation usually is — klura keeps a warm pool of already-open, already-authenticated pages, (configurable of course) so you don't pay browser startup per call; a page-script run is dominated by the actual request and lands close to plain-curl latency.

  • A real RE + debugging toolkit the agent drives — most tools capture traffic and pattern-match. Klura hands the agent an actual JS debugger: breakpoints, stepping, live-stack inspection, WebSocket frame tooling etc. It reverse-engineers a site the way a human would — break on the function that builds the request, read the encoder, verify it reproduces. You never touch any of it; you ask in plain language. Call it vibe-RE. (Detail in How the discovery works below.)

  • Self-healing — when a site changes shape and a saved strategy breaks, klura re-discovers the delta and patches the strategy automatically, instead of silently returning garbage or making you re-record from scratch.

  • Runs anywhere, hands off to a human when it must — klura is an MCP server, so the same saved skill is callable from Claude Code, Claude Desktop, Cursor, Windsurf, the CLI, or a programmatic import. Skills live globally at ~/.klura/skills/<platform>/, not per-project. When a site genuinely needs a human (login, 2FA, captcha), it escalates to a remote viewer — you do that one step, the agent continues, and the saved strategy runs in the resulting authenticated session.

  • Plugability — The browser driver is swappable: default is Playwright, klura ships a stealth driver that fixes the standard automation-fingerprint leaks (navigator.webdriver, canvas/WebGL consistency), and you can drop in your own. The interruption layer is pluggable too — when a site throws something up mid-flow (login, captcha), the handler that decides what happens is yours to define.

The artifact

Every skill is a single JSON file on disk. A fetch-tier example for a public search API:

{
  "strategy": "fetch",
  "method": "GET",
  "baseUrl": "https://hn.algolia.com",
  "endpoint": "/api/v1/search",
  "params": {
    "query":       "{{query}}",
    "tags":        "story",
    "hitsPerPage": "{{count}}"
  },
  "response": { "format": "json", "extract": "hits[]" }
}

The LLM that produced it leaves notes fields explaining the why of each parameter, the prereq chain, and what shape the response has. Read the file → you've reverse-engineered the site.

Execution path

Once saved, run it via klura (recommended):

klura execute hackernews search_stories --args '{"query":"show hn","count":3}'

Or you could ask your LLM again, it will prefer to execute already created strategies above creating new ones.

For page-script strategies the saved JS has to run inside the live authenticated page, so that path needs the klura runtime regardless of how you wire it up. For fetch strategies the saved JSON file gives you method + URL + body/header templates — slot in your params/cookies and curl away.

Three execution tiers

Tier What runs When it lands Curl-able?
fetch Static HTTP from Node, templated body/headers + prereq chain Plain APIs (HN search, Amazon /s?k=, GitHub GETs) yes
page-script JS dispatched inside the live authenticated browser tab Sites that bind requests to in-page state no — page state needed
recorded-path UI action replay through the driver Fallback when neither above is reconstructible no

How the discovery works

There's no recorder UI implemented or planned at the moment. The agent itself drives the browser. You tell your MCP host in plain language — "search HN for the top posts this week using klura" — and that's it. No script to write, no breakpoints to set, no JS to read. You could call it vibe-RE: you describe what you want, klura performs it and figures out how the site does it. You can also ask klura to map out a site, if you just want it to observe the API without performing a specific action.

For readers who want the how — on hard sites, the agent has access to a full RE toolkit (none of which you ever touch as a user):

  • JS debugger: set_breakpoint, step, resume_execution, wait_for_pause — break on the function the page calls before dispatch, read the live stack, see the encoded payload the page is about to send.
  • JS source navigation: search_js_source, read_js_function, list_loaded_scripts — find the function in the bundle that builds the body or signs the request.
  • WebSocket frame tooling: inspect_ws_frame, find_in_ws_frame, explain_ws_frame_structure, get_send_encoder, try_generator_in_page — for binary protocols, capture frames live and verify the captured encoder reproduces the wire format before saving.
  • Network log (get_network_log) for HTTP and WS in one feed, filterable.
  • Live page inspection: a11y tree, screenshots, page text, arbitrary js_eval.

This is how klura one-shots genuinely hard flows — for instance a mainstream chat app whose message-send rides a binary MQTT-style WebSocket, with an in-page codec, snowflake IDs beyond Number.MAX_SAFE_INTEGER, and a per-session token that rotates. The agent debugs the way a human would: set a breakpoint at the call site, read the actual encoder, verify it reproduces the wire output, save it as a page-script. The same toolkit generalizes to anything that builds a signed or encoded payload in-page — rotating CSRF on graphql endpoints, signed URL params, custom binary framings.

Once on disk, the next call hits the saved skill directly — or you cat the JSON and write your own scraper.

Cost: LLM once, then never

Klura runs the LLM during discovery, then never again. After LIFT the saved skill is a plain request (or a page-script dispatch): no model, no agent loop, no tokens. The only time the LLM will need to be looped back is when self-healing is required.

A couple of concrete runs, same task, measured against a plain browser agent on the same model: a Hacker News search that costs the browser agent ~20s and ~$0.09 every time runs from the saved skill in ~270ms at zero token cost. A mainstream chat app's message-send took ~27 minutes on the first run — that's the agent reverse-engineering the binary WebSocket — but every send after is a sub-millisecond dispatch into the authenticated page. The one-time discovery cost scales with site difficulty; the warm cost is always ~zero.

Install

# As an MCP server (Claude Code)
claude mcp add klura -- npx -y @klura/mcp

As an MCP server it supports many other harnesses, please check the documentation for yours to figure out how to install it.

Repo: klura · npm: @klura/runtime, @klura/mcp

Currently at 0.2.2. Real-world feedback welcome — especially the shape of the failures.

u/rundfunk — 2 days ago
▲ 23 r/webscraping+1 crossposts

Curated list of AI-powered web scraping tools

Was researching AI scraping tools for a project and noticed the existing awesome lists either cover traditional scraping (Scrapy, BeautifulSoup) or web agents broadly. Couldn't find one focused specifically on LLM-powered scraping, so I put one together.

Covers frameworks (Crawl4AI, Scrapling, ScrapeGraphAI, llm-scraper), hosted APIs (Firecrawl, Jina Reader, Diffbot), browser infrastructure for AI agents, MCP servers, and search APIs built for LLMs.

Open to more what am I missing?

github.com
u/Total_Nectarine_3623 — 3 days ago
▲ 74 r/webscraping+1 crossposts

I made a fully fledged Open-Source Google Maps Company Crawler

Hey guys,

I wanted to share a project I've been working on: SherlockMaps, an open-source Google Maps webcrawler built with Python and Playwright. You can check it out here.

What is it?

SherlockMaps extracts detailed company information from Google Maps searches. You give it a search term (like "restaurants berlin"), and it returns structured data including:

  • Company name, category, address, phone, website
  • Rating and number of reviews
  • Opening hours
  • Attributes (wheelchair accessibility, etc.)
  • Plus Code

Key Features

  • Clean OOP architecture - Well-structured with classes, dataclasses, and design patterns
  • Multiple usage modes:
    • CLI tool for quick data extraction
    • Python library for integration into your own scripts
    • REST API server for headless/production use
  • Multiple output formats - JSON, CSV, pretty-print
  • Deduplication based on company name + website
  • URL validation to filter out invalid websites
  • Docker support for easy deployment
  • Chrome profile persistence - Session data persists between runs
  • MIT License - Fully open source

Hope you like it, I am always open to making it better 😄

u/Ayyouboss — 5 days ago

Best Free Generative AI for Data Hoarding?

What is the best free Gen AI tool for large data hoarding/mining? I plan to create a data directory of video games from 2020 to 2026 and paste it into Excel, but when I used various free AIs, like Gemini, I realized that they're incapable of gathering several types of data at once and could only take up to around 50 games on a table.

What's more is that speaking to Gemini is like speaking to a demented grandma; when I ask for it to continue the list it proceeds to create a new table that doesn't follow the format that I've ordered it to follow initially. ChatGPT has limited chunks, while Deepseek sometimes misses out on the instructions.

reddit.com
u/Still-Goal-9314 — 4 days ago
▲ 107 r/webscraping+5 crossposts

Free Google search MCP that actually works.

(Demo runs Chrome visibly for clarity. Actual usage runs headless by default.)

✅ Actually works (tested 6 free MCPs, all failed)

✅ Search + URL extract in one MCP (replaces the usual search MCP + fetch MCP combo)

✅ 4 tools: `search` / `search_parallel` / `extract` / `search_extract`

✅ No API key, no proxies, no solver

✅ Auto CAPTCHA recovery (Chrome opens, human solves once, retries)

When CAPTCHA fires on any tool, a visible Chrome window opens for a human to solve. Each solve preserves the profile's reputation with Google. Built for sustainable, ethical use.

Speed (1Gbps):

- sequential: ~1.5s/q (warm)

- 4 parallel: ~2s wall

- 10 parallel: ~5s wall

Tools: 'search' / 'search_parallel' / 'extract(url)' / 'search_extract(query)'. Last one bundles search + parallel article extraction (Readability + Turndown).

Stack: TS, Playwright + stealth, Readability, Turndown. ~600 LOC.

💻 https://github.com/HarimxChoi/google-surf-mcp

📦 https://www.npmjs.com/package/google-surf-mcp

⭐ Star helps a solo dev keep maintaining.

Ask me anything about architecture, reliability, or scaling.

u/GarrixMrtin — 6 days ago

Trying to find ways to scrape news...

Hello, hope all is well! I'm currently working on a sentiment classifer system for a greater utilisty of attenuation for market prediction.

Currently, for such a sentiment classifier system, I require a lot of news, for a given topic. Particularly, if I'm trying to predict the market for say Gold, I would require a lot of news on Gold to train the sentiment classifier.

I've tried some ways but it has been quite difficult. GDELT has proven to be quite unfortunate, though I still support it for its amazing work.
Can anyone help me find ways whre I can obtain either the URLs of news for a large span of time for a given topic, or even better the data itself!

I've been also looking into web-scraping, and if someone have perfected a recipe for doing so, given an URL, I would be happy if you could guide me on that!

Thanks!

reddit.com
u/RichardKing1206 — 6 days ago

Open-source static + runtime analyzers for bot-detection JS

TL;DR Two open-source packages that tell you exactly which browser APIs a fingerprinting script touches and what it ships home. One reads the source statically, the other instruments a real browser. Same output shape, so you can diff them.

Why

If you're working anywhere near bot detection, scraping, or building a stealth browser, you eventually need to read the JS on the site. The problem is that those s are 400KB of minified, obfuscated, often-rotating code, and nobody has time to step through them by hand.

I've spent to much time going back and forth combing through minified js and I wanted one tool that would tell me, in a single pass: which APIs does this script probe, which network sinks does it fire, and which fingerprint surfaces actually leave the browser.

What

script2builtins is the static analyzer.

  • Parses with acorn (module, then script fallback).
  • Walks the AST, resolves aliases through string concat and variable reassignment.
  • Matches every property access against a curated catalog of fingerprinting APIs across navigator, screen, canvas, WebGL, audio, WebRTC, timing, headless tells, sensors, media permissions, intl.
  • Scans network sinks (fetch, XHR, sendBeacon, WebSocket, image src, script src, EventSource, Worker, navigation) and traces each body to figure out which cataloged values flow into it.
  • Flags dynamic hazards (eval, Function constructor, with, document.write, computed properties) where static reach ends.

script2builtins-runtime is the dynamic companion.

  • Drives a real browser session (Puppeteer or Playwright).
  • Traps every catalog API, sink, and dynamic-execution point as it actually fires.
  • Emits findings in the same Report shape as the static analyzer, so you can lay them side by side.

Static tells you what could be probed. Runtime confirms what was. The gap between them is where most of the interesting behavior lives. Lazy-loaded modules, environment-gated branches, you know the kind.

Live demo

Both run daily against real production loaders at https://richards.foo/tools/bot-detectors. Fresh report every 24 hours, previous hash retained so you can see what each vendor pushed overnight.

Links

Asking

Open to feedback, especially from anyone who's reverse-engineered niche detectors. What's missing from the catalog? Which vendor do you wish was covered first?

u/0day2day — 6 days ago

Product data scraping advice

I’m working on an early e-commerce/product data project and trying to better understand how to extract clean structured data from public product/category pages.

I want to be upfront, I don’t currently have a budget to hire someone, so I’m not trying to mislead anyone into unpaid contract work. I’m mainly looking for advice, learning direction, or someone who genuinely enjoys product data scraping and would be open to discussing a small proof of concept.

I’m interested in how people structure the data cleanly, keep extractors maintainable when pages change, and avoid messy output.

If anyone has advice on tools, architecture, common mistakes, or would be open to chatting, I’d appreciate it.

reddit.com
u/JustAStranger1156 — 8 days ago

Scraper needed for price fetching.

I need an ebay bot to fetch price for 15k products on weekly

The product names exist in csv and output can be done in same csv or new csv whatever suits.

Do hit me up if someone can do this for me.

We can discuss pay in DM.

reddit.com
u/TheImmortalHooman — 7 days ago

An app for forwarding mobile otp for verification when scraping

So long story short I tried scraping a website but it had otp logins . So I vibecoded the entire night into making a flutter app that will listen for messages , select the otp code using regex and forwards it to a custom backend server. So it basically solves the otp login problem if we maintain it in a continous session 🤷🏾‍♂️.

Hope it will be useful for someone

Github : https://github.com/jidukrishna/otp_listener

u/jiduk — 9 days ago

Can I get in legal trouble if I monetize scrapped data ?

I am planning to create a personal app/dashboard that I will use to get trends within the Apple App Store. If I theoretically find a way to get the data for all public apps (rating, tags, name, ...), can I get in legal trouble for monetizing this as an advanced dashboard / charts?

reddit.com
u/syvdv — 9 days ago
▲ 458 r/webscraping+9 crossposts

Today I declare scraping free again

reCAPTCHA v3 at 0.3, FP Pro flagging bot:true, Cloudflare banning my ASN on sight. Sick of it.

Built this: a Firefox patched at the C++:

reCAPTCHA v3 0.90, FP Pro bot=false, tampering=false · CreepJS 0 lies · sannysoft all green · WebRTC no leak.

Self-hosted, MIT, no cloud, no subscription.

Repo: https://github.com/P0st3rw-max/stealthfox

u/bolaretyr — 12 days ago

Data pipeline and storage after scraping

The web scraping part I have covered. I'm scraping multiple sources using Crawlee. Total data size is 200 GB. Every day I'm fetching new records.

I fetch raw HTML which I store in S3 object storage, I then turn this into Parquet and clean up the data using DuckDB. Previously I also used PostgreSQL but had a lot of RAM usage, mostly due to entity resolution. Like if I find two addresses that are the same I want to link them, but sometimes there might be some typo's, same for people.

The goal is to combine these sources into a data warehouse where I can build data marts on top to serve APIs.

What does your process look like after the data is scraped? How do you store it? Where do you store it? How do you combine sources? How do you monitor you scrapers and pipelines?

reddit.com
u/vroemboem — 9 days ago

🚨 Scrapling v0.4.8 is out with a new insane update🕷️

Hello everyone,

It has been some time since I wrote here :)

A lot has happened since my last post, and we are now almost at 1.5M downloads.

I wanted to share this update with you here, as many people have asked for these features since the first release.

Introducing the new spiders templates feature to make generic spiders easier. Now you have a CrawlSpider and a SitemapSpider. Both offer many options and controls. We also added tools to make it easier, like LinkExtractor and CrawlRule, which will help you build your generic spider. Stop rewriting "follow links matching X" boilerplate in every parse() :D

Check the new page for them on the website here: https://scrapling.readthedocs.io/en/latest/spiders/generic-templates.html

Also, we have improved the adaptive feature so it's now much better, fixed some bugs, improved the docs, and updated browsers/fingerprints

The full release notes are here: https://github.com/D4Vinci/Scrapling/releases/tag/v0.4.8

Let me know your thoughts and what I should add next :)

u/0xReaper — 11 days ago

cannot bypass cloudflare in baseball-reference

Neither requests nor playwright bypass cloudflare. I have used user-agents, stealth and permanent-context but it does not work as expected. Webscraping is not my expertise so I do not know if I am missing something.

Even if I do curl baseball-reference.com on my terminal I get

https://preview.redd.it/e2kamcb41l0h1.png?width=2457&format=png&auto=webp&s=e51ca3a7f4e7a4d734e581e3a36913ef7a67c3af

I also know bypass cloudflare is tricky but if someone has some tips to share it'd be helpful.

reddit.com
u/No-Elk6835 — 10 days ago