r/scrapingtheweb

Scraping Walmart search results returns different products each run

Running a scraper on Walmart search results for the same keyword every few hours. The product list shifts a lot even when filters are fixed. Not talking about ranking changes, completely different items appear or disappear. Is this just personalization or do they rotate inventory/search results aggressively?

reddit.com
u/eduhpmelo — 3 days ago
▲ 34 r/scrapingtheweb+2 crossposts

FireCrawl just hit 121k GitHub stars and I have a LOT of questions, the hype, the pricing trap, and what's actually going on

Okay, so I've been in the web scraping game for quite some time now. I was browsing the GitHub top-100 stars list yesterday and saw it sitting at #73 globally with over 120k stars. That's ahead of Node.js. That's in the same breath as projects that have been around for a decade. For context, at the end of 2024 they celebrated 20k stars. They raised their Series A in August 2025 at 43k stars. Now it's 120k+. That's roughly 3x growth in under a year, for what is essentially a web scraping API aimed at AI developers. What in the world happened? How did a scraping API beat Node.js in stars? The repo describes itself as "search, scrape, and clean the web for AI agents." Useful, I'd say. But 120k-star useful?? There are open-source alternatives like Crawl4AI with 65k stars doing very similar things for free. Is it just incredible timing with the AI/RAG pipeline wave, or is there genuine technical moat here that the community is rewarding? My main main concern is the star count organic? I'm not accusing anyone of anything, but a jump from ~20k to 120k in roughly 16 months is one of the most aggressive trajectories I've seen outside of projects with massive corporate backing (and I'm thinking of Microsoft's markitdown). FireCrawl got $14.5M Series A from Nexus and YC. Is any of that marketing spend showing up in developer mindshare as stars? I'm genuinely curious how you break into the GitHub top-100 that fast. Additionally, can someone explain the pricing to me without making my head hurt? On the surface it looks simple: 1 credit = 1 page scraped. But the moment you turn on anything useful, AI extraction, JSON output, Enhanced Mode, you're burning 5–9 credits per page. The Hobby plan at $16/month gives you 3,000 credits, which sounds great until you realize that's only ~333 pages with JSON + Enhanced Mode enabled. A 500-page website on the Hobby plan exceeds your entire monthly allowance in a single scrape. Now before someone says "just self-host it", that's an option, yes, it's AGPL-3.0 open source. But the self-hosted version is deliberately crippled: no Fire-Engine (their proprietary anti-bot system), no proxy rotation, no Actions endpoint, no browser sandbox. The stuff that actually makes it worth paying for is cloud-only. AGPL also means commercial self-hosting has licensing implications your legal team needs to look at and that's if you're within a company, if you're a an individual developer, well, that can get quite expensive. To be fair, the product genuinely seems excellent. Zapier, Shopify, Replit, and Apple are customers. The clean markdown output uses 67% fewer tokens than raw HTML. The MCP server integration means you can pipe live web data straight into Cursor or Claude. That's real value, and the community clearly feels it. But I keep coming back to the same question: is this one of the best-marketed developer tools of the AI era, or is it genuinely the best technical solution? Someone kindly explain what is going on with firecrawl

reddit.com
u/Gwapong_Klapish — 4 days ago

Open sourced CLI to reverse engineer APIs from any website

I’m open-sourcing Opensteer -- a lightweight CLI runtime that lets AI agents control your real browser, inspect network requests, and turn browser workflows and reverse engineered APIs into reusable agent tools.

The idea is that an agent can use Opensteer to drive your local Chrome through DOM automation, observe how a site loads data, and then write repeatable scripts or functions for the parts that should be deterministic. Those functions live in your project directory, so future agents running from that directory can call them like tools.

It runs locally, connects to your actual browser session, and is meant for workflows where plain HTTP scraping is not enough: logged-in pages, JS-heavy apps, visual debugging, API discovery, and repeatable browser automation.

Feel free to try it out, and would appreciate any feedback.

u/dinotimm — 3 days ago
▲ 161 r/scrapingtheweb+14 crossposts

Starting today, I declare scraping free again.

I got tired of anti-bot systems constantly breaking my Playwright AI agent, so I built Invisible_Playwright: an open-source, MIT-licensed Playwright and Firefox fork patched at the C++ level.

Instead of reusing the same noisy automation fingerprint, Invisible_Playwright generates a different but internally consistent browser fingerprint for each session. The goal is to remove the Playwright automation signals while keeping the browser environment coherent and reproducible.

Category Invisible_Playwright result
Fingerprint generation ✅ Different, coherent per-session fingerprint
WebRTC ✅ Pass — no public IP leak
PixelScan ✅ Pass — no inconsistencies
CreepJS ✅ Pass — 0 lies
SannySoft ✅ Pass — all green
BrowserLeaks WebRTC ✅ Pass — no public IP leak
reCAPTCHA v3 ✅ Pass — 0.90
Fingerprint Pro ✅ Pass — bot=false, tampering=false
Cloudflare / Turnstile ✅ Pass
hCaptcha ✅ Pass
DataDome-style checks ✅ Pass
Kasada-style checks ✅ Pass
Akamai-style checks ✅ Pass
Imperva-style checks ✅ Pass
HUMAN / PerimeterX-style checks ✅ Pass
Arkose-style checks ✅ Pass

Repo: https://github.com/feder-cr/invisible_playwright

github.com
u/bolaretyr — 8 days ago

Trying to find a cloud-flare scraping solution

I am scraping a few TB with of data.
My experience with 5G proxies in real world application they operate 4G. Most providers throttle you from what I have seen.
Residential would be too much as that would add up fast.

Data centers from what I understand cloudflare curb stomps now.

The over all project is about 10TB
I got 3TB left to get. I was able to get a majority of it with my personal IP before captchas started hitting. From what I understand captcha solvers don't work unless you have a proxy.

reddit.com
u/grio43 — 6 days ago
▲ 25 r/scrapingtheweb+4 crossposts

ChatGPT lawsuit opinions

I've been following the OpenAI lawsuits and the one detail I can't stop thinking about: a 19-year-old asked ChatGPT about mixing sedatives, it acknowledged the combo "could be risky", then gave him dosages anyway, added Benadryl to the recommendation, and told him to go lie in a dark room instead of seeking help. He died. Source. The Canadian case is somehow worse. OpenAI's own safety team flagged the shooter's account for "gun violence activity and planning" months before the attack and pushed to notify authorities. Management said no. Source. At some point "we're just a general-purpose tool" stops being a defense. Where that point is, that's what these trials are actually going to decide. Guardrails are coming whether the industry wants them or not. Every lawsuit forces a paper trail. And when harmful outputs become liability, the instinct is aggressive filtering, mandatory escalation triggers, activity logging with retention policies. Fine for consumer chat, however, for more tech enthusiasts its going to be brutal. Now the real risk for scraping and agentic workflows is over-correction. If "how do I access this data at scale" gets flagged the same way "how do I build a weapon" does, open-weights models win by default. It would make me want to just run it locally and skip the compliance layer entirely. The smarter play would be tiered access, stricter defaults for consumer products, more permissive behavior for verified API users with actual business context, but that requires product nuance, and right now OpenAI is in legal defense mode.

My bet is that we should expect more API friction over the next 12-18 months. Local models are about to get a lot more interesting.

reddit.com
u/ahiqshb — 9 days ago

Datacenter proxies fine for large scraping or not anymore?

Running a scraper that pulls a lot of product pages daily. Nothing super advanced. Started with datacenter proxies because of cost. Speed is great, but getting blocked more often now, especially on a few bigger sites. Trying to decide if I should keep tweaking this or just move to residential proxies.

reddit.com
u/PomegranateOk9017 — 9 days ago
▲ 458 r/scrapingtheweb+9 crossposts

Today I declare scraping free again

reCAPTCHA v3 at 0.3, FP Pro flagging bot:true, Cloudflare banning my ASN on sight. Sick of it.

Built this: a Firefox patched at the C++:

reCAPTCHA v3 0.90, FP Pro bot=false, tampering=false · CreepJS 0 lies · sannysoft all green · WebRTC no leak.

Self-hosted, MIT, no cloud, no subscription.

Repo: https://github.com/P0st3rw-max/stealthfox

u/bolaretyr — 12 days ago
▲ 44 r/scrapingtheweb+2 crossposts

Stop throwing residential proxies at everything, your fingerprint is the actual problem

Aight, listen up, Imma keep it real with you. I know this is going to rub some people the wrong way, but I've been doing this long enough to feel confident saying it, most of you don't have a proxy problem, you have a fingerprint problem, and you're spending $200+/month on residential bandwidth to brute-force your way around it. I get it. Residential proxies feel like the safe default. The IP looks clean (for those who are checking by these fraud scores), it passes basic geo checks, and every provider markets them like they're the golden ticket. if your TLS fingerprint screams "Python requests library" or your browser automation is leaking navigator properties that no real Chrome session would ever have, it genuinely does not matter how pristine your IP is. Cloudflare, Akamai, DataDome, they all fingerprint the client now, not just the address. A burnt datacenter IP with a properly spoofed JA3 hash and realistic header order will outperform a fresh residential IP attached to a naked requests.get() call nine times out of ten. I ran a test a few weeks ago, across about 15 mid-sized e-commerce sites protected by Cloudflare. Datacenter proxies with curl-impersonate had a ~91% success rate. Residential proxies with default Python requests headers? Around 60%. The residential IPs were objectively "better" IPs, they just didn't matter because the request itself was the red flag. I think the proxy industry benefits from people not understanding this. The less you know about TLS fingerprinting and HTTP/2 header frames, the more bandwidth you burn through rotating IPs trying to find one that "works." That churn is literally their revenue model. Before you upgrade your proxy plan, spend an afternoon with curl-impersonate or look into how got-scraping handles fingerprint randomization. Learn what JA3 and JA4 fingerprints actually are and how to check yours against real browser signatures. You might find that the $30/month datacenter plan you dismissed does the job just fine once your client stops identifying itself as a bot on the first handshake. Now, I'm not saying residential proxies are useless, for account management, social media automation, and anything session-heavy where the IP itself gets scored over time, they're still the right call. But for scraping? Fix your fingerprint first. Then decide if you actually need the expensive IPs

reddit.com
u/MemeLord-Jenkins — 11 days ago
▲ 23 r/scrapingtheweb+2 crossposts

Stop hardcoding your scraper logic: use the browser's Copy as cURL first

My two cents. Most people spend bunch of time reverse engineering request headers when the browser will just hand them to you. Next time you find an API call in the Network tab, right click it and hit Copy as cURL. Paste it into your terminal and it works instantly, cookies, headers and all. From there you can import it directly into Postman or use a tool like curlconverter to turn it into clean Python requests code in seconds. The browser already did the hard work of figuring out what the server needs. There's no reason to reconstruct that by hand

reddit.com
u/Bharath0224 — 11 days ago
▲ 77 r/scrapingtheweb+2 crossposts

What happens when you make a browser that is identical to chrome but it's use is scraping

I built a real C++ browser and gave you a TypeScript library to control it — here's why it changes scraping

Most tools like Puppeteer and Playwright bolt automation onto Chrome from the outside. They're always playing catch-up with anti-bot systems.

I took a different approach. I built the actual browser — Qt6 + Chromium engine, written in C++. Then I wrote a TypeScript library (Piggy) that controls it over a local socket. That's why Cloudflare bypasses are almost trivial and the code stays dead simple.

Two repos, one ecosystem:

🖥️ Nothing Browser (the C++ browser) https://github.com/BunElysiaReact/nothing-browser

📦 Piggy (the TS library) — https://github.com/ernest-tech-house-co-operation/nothing-browser

What you get out of the box:

🪪 Persistent TLS fingerprint identical to real Chrome — sites can't profile you

🧠 Human Mode — randomized delays, natural scrolling, no robotic timing

⚡ Socket-based IPC — millisecond latency between your script and the browser

🌐 Remote deployment — binary runs on a VPS, you scrape from local

💾 Session persistence — save/restore cookies and storage, stay logged in

🏊 Tab pooling — concurrent requests inside one browser instance

🚀 Built-in API server — one line turns your scraper into a REST endpoint with OpenAPI docs

🔄 Proxy rotation — built-in fetch, test, switch, rotate

The code looks like this:

Ts import piggy from "nothing-browser";

await piggy.launch(); await piggy.register("books", "https://books.toscrape.com"); await piggy.books.navigate();

const books = await piggy.books.evaluate(() => Array.from(document.querySelectorAll(".product_pod")).map(el => ({ title: el.querySelector("h3 a")?.getAttribute("title") ?? "", price: el.querySelector(".price_color")?.textContent?.trim() ?? "", })) );

console.log(books); await piggy.close();

That's a real browser. Not a wrapper around someone else's.

Bun-first but Node compatible. Headless and headful ship as separate binaries so you're not carrying GPU overhead when you don't need it.

📚 Docs: https://nothing-browser-docs.pages.dev

Would love issues, feedback, and ⭐ stars — built in Kenya 🇰🇪

u/PeaseErnest — 12 days ago
▲ 4 r/scrapingtheweb+1 crossposts

What tools are currently in your web scraping stack?

I’ve been seeing a lot more Playwright lately, but still plenty of people sticking with Requests/BS4 or Scrapy when the site doesn’t need a browser.

I’m mostly using Python with Requests and BS4 for simple stuff, then Playwright when a site forces it.

Always interesting to see what people actually use once the scraper has to run more than once.

reddit.com
u/BlueLagoon226 — 9 days ago
▲ 5 r/scrapingtheweb+1 crossposts

What was the first web scraping problem that made you realize scraping is harder than it looks?

For me, it was when a scraper worked perfectly on one page, then failed on the next page of the same site because the HTML was slightly different.

At first I thought scraping was just “fetch page, select elements, save data.” Then you run into missing fields, weird pagination, lazy loading, blocked requests, random layout changes, duplicate data, and suddenly your simple script needs error handling, retries, logging, and a way to know when it silently breaks.

Curious what moment made it click for you.

reddit.com
u/TheReverent — 10 days ago
▲ 12 r/scrapingtheweb+1 crossposts

We built a Claude Code plugin that generates crawler + scraper projects from a URL

We just posted a quick demo from the ScrapeOps YouTube channel showing how our Claude Code plugin generates a working web scraping project from a prompt.

The example in the video builds a crawler + product scraper pipeline for a Walmart search page. It generates the project files, schemas, parsers, README, run commands, and JSON/JSONL output. The demo uses Python + BeautifulSoup, but the plugin also supports other languages and scraping libraries like Scrapy, Playwright, Puppeteer, etc.

The part I'm most interested in feedback on is the workflow: instead of using AI to just write a parser snippet, the goal is to generate the full scraping pipeline and then let devs inspect, run, modify, or fix it from there.

Video covers:

  • installing the Claude Code plugin
  • adding the ScrapeOps to it
  • using /generate-scraper, /fix-scraper, and /generate-crawler-scraper
  • choosing language + library
  • generating crawler and product parser files
  • running the scraper and checking the structured output

This is still aimed at developers, not "magic no-code scraping."

https://www.youtube.com/watch?v=qcE5sK0DDus

The generated code still should be reviewed, especially for ToS/robots considerations, and production monitoring. But it's been useful for cutting down the boring scaffold/debug loop.

Would be interested to hear what people here think: useful direction, or does AI-generated scraper code create more maintenance debt than it saves?

youtube.com
u/ian_k93 — 13 days ago
▲ 16 r/scrapingtheweb+2 crossposts

Http3 residential proxies

Has anyone had any luck with them? Particularly scraping? How is the success rate compared to regular http/https or socks5

reddit.com
u/WarAndPeace06 — 14 days ago

Anyone found a reliable proxy for scraping google shopping without constant blocks?

Been trying to get consistent data from Google Shopping for a price tracking project and its honestly driving me insane. Started with some cheap datacenter proxies i had lying around and got captcha'd within like 20 requests. Switched to a residential provider that looked decent on paper but the rotation was too aggressive and i kept losing session state.

The thing is, I don't need massive volume. Maybe a few thousand product pages per day. But I DO need the sessions to stay stable enough to track pricing changes without reauthenticating every 2 minutes. Also tried rotating manually with sticky sessions but half the IPs were already burned by other scrapers apparently.

Has anyone actually found a proxy setup that works smoothly for Google Shopping specifically? I'm starting to think the problem isn't just the proxy type but how the IPs are sourced and whether they're already flagged by Google. Would love to hear whats actually working in production right now, not just what providers claim on their landing pages.

Also curious if anyone has had luck with city level targeting for this. Seems like it might help with consistency but not sure if its worth the extra cost.

reddit.com
u/isohaibilyas — 12 days ago

Have you ever tried scraping yelp without coding?

I’ve been going back and forth on this. I need Yelp business data for about 50 cities across different categories.

I’ve made it this far without Python and I’d like to keep it that way.

I tested a couple no-code scrapers, but the results are super inconsistent. Sometimes it works, sometimes it returns nothing, and there’s no explanation for why.

Is Yelp just a nightmare to scrape, or are no-code tools just not built for this at scale?

If anyone’s found something that actually works reliably, I’d love to know what you used.

reddit.com
u/Sharp_Promotion_5155 — 14 days ago