r/scrapingtheweb

▲ 4 r/scrapingtheweb+1 crossposts

dynamic websites scraping

hello everyone, im an ai engineer. i want to build a real-time project which can scrape the all the realestate websites. which is useful for my research, one more, i want free source because Iam student. can anyone help me regarding to this. please this will helpful to my carreer

u/Ok-Will3506 — 15 hours ago

▲ 4 r/scrapingtheweb+1 crossposts

How are you guys running Playwright/Puppeteer in production?

I’ve been going down the rabbit hole of browser-based scraping lately and I’m curious how people here are handling it in production.
Are you running Playwright/Puppeteer on your own VPS/Kubernetes, or using something like Browserless, Browserbase, ZenRows, Bright Data, etc.?
I’m mostly wondering:

What’s been the biggest pain point?
Is it browser crashes?
Scaling?
Proxy management?
Cloud costs?
CAPTCHAs?
Anti-bot detection?
Something else entirely?

If you’re self-hosting, what made you decide not to use a managed service?
And if you’re already paying for one, what’s the main reason? Reliability? Less maintenance? Better success rate?

I’m asking because I’m trying to understand what problems are actually worth solving instead of making assumptions. Every blog post says something different, but I’d rather hear from people who are running this stuff every day.
Would love to hear your setup and what’s been working (or not working) for you.

u/Fun_Implement_3887 — 1 day ago

▲ 35 r/scrapingtheweb+3 crossposts

First BrightData and now NetNut? What's happening?

Ok so this is gonna be a bit of a ramble but I need y'all to hear this. I work adjacent to the scraping/data space (not naming employer, y'all know how it is) and I always kind of just you know, accepted that "residential proxies" were this slightly grey but mostly fine thing. Like yeah obviously somebody's home IP is being used, somebody agreed to something somewhere, moving on. Then yesterday I see netnut is just gone. Not down, not maintenance page, straight up FBI DOJ IRS-CI seizure banner, Google and Lumen and Shadowserver all stamped on it too like it's a whole coalition operation. Like bruh, I have never in my life seen this on any website. So imagine the shocker lol. And I'm sitting there like wait since when did the IRS care about proxy IPs??? Turns out IRS-CI does financial crime investigations and it could be related to that, but still, seeing that logo on a proxy provider's main website??? Diabolical mate.

So I go searching what's going on and apparently it's tied to this thing called the Popa botnet, basically it's been running for like four years hijacking Android TV boxes (remember that Bright Data SDK thingy?? And Krebs traced a chunk of it back to actual NetNut infrastructure through some ex-employee's domain. Not some randoms, an actual former VP of R&D there. Google's own blog post says they think the network was over 2 million devices at one point and that a lot of those "different" proxy brands you see around are just NetNut wearing a different logo through resellers. I swear I seen a post talking about how there is one or two proxy providers sharing those IPs/reselling whatever. Makes sense now, people?? Am I the only one concerned here with a few other nerds? And to be fair I want to be fair here, Alarum (the parent company) came out and started calling it that a botnet is inaccurate, also said that there's real KYC and consent flows and misuse detection on their end. So it's not like this is some minor thing everyone agrees on, it's actively disputed and I think that matters, I'm not trying to just slam a company because Krebs wrote a scary headline, no. Make what you will out of it, but I things are rising to the surface, more and more often. Again it does make me think about how much of the "residential proxy pool" any of us are touching day to day is actually consented in a way a normal person would recognize as consent, versus consented in the "buried in paragraph 14 of an SDK terms screen" way. Like where's the line between legit residential network and just a nicer branded botnet, and how would any of us even know the difference from the outside.

Not trying to start a witch hunt on the whole industry, genuinely just spent my evening reading legal filings and threat intel blogs instead of sleeping like a normal person, but after Bright Data SDK shennanigans and now Netnut, boy, there's gonna be more stuff coming out. So I figured I'd share since I know a bunch of you here actually work with this infra daily and probably have opinions on these matters.

Sources if anyone wants to go check themselves.

Krebs on Security, the original Popa botnet reporting: https://krebsonsecurity.com/2026/06/popa-botnet-linked-to-publicly-traded-israeli-firm/

Google's threat intel writeup on the disruption: https://cloud.google.com/blog/topics/threat-intelligence/google-continued-disruption-residential-proxy-networks

Alarum's official response: https://alarum.io/alarum-technologies-responds-to-inquiry-into-residential-proxy-networks/

And divinetworks.com itself, which is also showing the seizure page now: https://divinetworks.com/

P.S Some time later I found that they mixed up the domains because .com is down, but not .io which is the main page of netnut.

u/ahiqshb — 3 days ago

▲ 0 r/scrapingtheweb

Need Zillow scaper

Hello, everyone I’m trying to scrape Zillow website, but due to anti bot system I’m unable to scrape anyone can help me about that

u/Ok-Will3506 — 2 days ago

▲ 4 r/scrapingtheweb

Best mobile proxy for Cloudflare / Akamai bypass?

Hey guys, currently running a scraping project that extracts data from high anti-bot protected sites. Tried resi and ISP IPs, mobile rotating as well, but my script is still getting detected and failing to retrieve info.

I don't need thousands of IPs, I need a few good quality mobile IPs with per req rotation. Currently looking at dedicated proxies from IPRoyal, Voidmob and Proxidize. Any other recommendations?

Key requirement is that it has to be a genuine mobile carrier IP from a real device, preferably with switchable p0f as well.

Budget is not a problem, I just want to solve this.

Thanks

u/Forklift_Penguin — 3 days ago

▲ 3 r/scrapingtheweb

Scraping Social Media!

Looking for advice on scraping social media platforms (Beginner)

Hi everyone,

I'm still pretty new to web scraping and currently trying to learn by scraping different social media platforms.

So far, I've managed to get YouTube and TikTok working, but I'm completely stuck with Instagram. The furthest I get is the Instagram landing page, where the Instagram logo just keeps loading, and I can't seem to get past it.

I'm not looking for someone to do it for me—I'd really like to understand what's happening and learn how to approach these kinds of problems.

If anyone has experience with this or is working on similar projects, I'd love to exchange ideas and learn from each other. I'd also be happy to share my code for the YouTube and TikTok scrapers if it could be useful in return.

Since I'm still a beginner, any advice, tips, or explanations would be greatly appreciated.

Thanks!

u/justanorianna — 4 days ago

▲ 3 r/scrapingtheweb

Looking for Free Business Email Finder Tools

Hi everyone,

I'm looking for a free and reliable way to collect publicly available business email addresses for B2B outreach. I only want to use publicly listed contact information from company websites or business directories, not personal emails.

Can anyone recommend free tools or workflows for finding verified business emails? If you've used any open-source tools or browser extensions, I'd love to hear your experience.

Thanks!

u/Patroreddit — 4 days ago

▲ 1 r/scrapingtheweb

Rental property scraping

Hi there! I am currently searching for a rental for my first apartment and I’ve been searching manually for months. I’ve heard that scraping is a great way to filter through listings (especially as right now I’m working full-time and property hunting is so time consuming).

I’m just wondering if this is something that someone would be able to help me build? And if this is the wrong subreddit I’d appreciate being pointed in the right direction!

Thanks so much in advance :)

u/Ok-Bus7775 — 4 days ago

▲ 1 r/scrapingtheweb

What's your favorite source for local business data besides google Maps?

I'm looking for things like restaurants, plumbers, dentists, agencies...

google maps is great but I'm curious if there are other directories people scrape that are easier to work with

u/Kenyatta_Sauve — 4 days ago

▲ 1 r/scrapingtheweb

Has anyone actually measured the difference between headless and headful lately?

I keep hearing 'always use headful', but I've never seen real numbers

curious if people are seeing a noticeable difference on protected sites

u/Unfair_Commission_29 — 4 days ago

▲ 112 r/scrapingtheweb+6 crossposts

TRAWL: Self-hosted scraping engine — bypasses any JS challenge & captcha: Cloudflare, Turnstile, reCAPTCHA, hCaptcha, GeeTest. FlareSolverr & Byparr alternative and drop-in replacement for your *arr stack.

u/Germond_ — 6 days ago

▲ 2 r/scrapingtheweb

How to scrape an ATS?

Hear me out, I don't know if this is even possible.

BUT...

Is there a way to get a list of companies that use Ashby?

I'm a complete newb so this question is likely super stupid for many of you. I'll take the heat.

u/Majestic-Tax8301 — 5 days ago

▲ 21 r/scrapingtheweb+2 crossposts

Another day, another project

ok so this is gonna be long sorry in advance, but I spent my precious weekend comparing n8n scraping workflows against just writing the damn scraper in Python and I have some thoughts to share with yall.

Started because my unemployed friend sent me one of those "I automated my job search with n8n" posts and I was like, not with this again, there's like a million automations already created, why did you even bothered? But he somehow convinced me to try replicating something similar on my side, so basically, I had to try it. Mainly just scraping product listings off a marketplace site, turned on my n8n, dragged in an HTTP node, a Cheerio node for parsing, a loop, a Google Sheets node at the end. All it took maybe 40 minutes and it worked first try which felt great.

Like the majority of the projects I've worked on, it then started throwing Cloudflare challenges after around 600 requests and that's where it stopped feeling great. I tried putting in some cheap datacenter proxies I had lying around from an old project, didn't help much, IP reputation on datacenter ranges is just garbage on anything halfway protected these days. Switched to a residential proxy pool instead and got further but still kept tripping something, which is when I remembered the IP is only half the story, the actual fingerprint matters just as much if not more. (take notes folks).

So I go to fix it in n8n and immediately went full stop, everyone who's done this before already knows about, which is that the visual nodes are amazing for the happy path and genuinely miserable the second you need anything custom. wanting to rotate user agents with actual entropy, not just a static list cycling in order. wanting real TLS fingerprint control so your handshake doesn't scream "I am a script" before you've even sent the request, wanting a headless browser session that actually behaves like a person scrolling and pausing instead of firing requests like a machine gun. none of that is a drag and drop node, you end up writing it in a Code node anyway which is just JavaScript wearing a costume, so you've reinvented half a script but now it lives inside someone else's execution engine and you can't easily version control it or run it locally without spinning up the whole n8n instance.

Compare that to just opening a .py file. requests or httpx if you want async, curl_cffi if the site's fingerprinting you (and these days almost everything past a certain traffic volume is), playwright if you actually need a full headless browser for JS rendered pages. yeah you're typing more in the first 20 minutes, but every single thing is yours, testable, debuggable with an actual setup trace instead of n8n's execution log that sometimes just says "error" and leaves you to guess. and when the scraper needs to scale, you're not paying per execution or fighting workflow timeout limits, you just run more processes or throw it on a queue.

At some point I just gave up taking care of proxy rotation and fingerprint config by hand and pointed the whole thing at a web scraper api instead, basically it felt a little like cheating at first but also I have a day job and the marginal value of me personally maintaining a TLS impersonation layer is zero. Aso tried doing a rough version of the same job in Go for comparison because why not, and that one was interesting for a totally different reason, the speed difference on concurrent requests was kind of stupid honestly, noticeably faster spinning up a thousand goroutines than even async Python, but the dev time to get there was longer and if you're not already comfortable with the language you'll burn an evening just on syntax instead of solving the actual scraping problem. so it's not really "Go is better" it's "Go is better if you already know Go and need raw throughput". One thing I didn't expect going in, mobile proxies actually outperformed residential on a couple of the trickier targets, something about carrier grade NAT making the IP reputation look cleaner since thousands of real phones share the same address anyway. didn't bother testing ISP proxies for this particular target since the site wasn't doing heavy ASN level scrutiny, but I've used them before on stuff where you want the static IP of a datacenter with the trust level of residential, good middle ground when rotation isn't what you need.

Then I poked at a search api for a side piece of this project, pulling SERP results instead of crawling category pages directly, and that ended up being way less of a headache than the rest of the whole ordeal combined, search engines have their own blocking logic obviously but it's a more solved problem than random ecommerce sites running custom bot detection.

What I keep landing on is it's basically a compromise between time to first result and ceiling. n8n wins time to first result by a mile. Python wins ceiling, not even close. Go wins ceiling even further out but only if speed at scale is actually your bottleneck and not, like, getting blocked every 400 requests or so regardless of how fast you can send them, which honestly is the actual bottleneck like 90% of the time, not raw speed.

Anyway I ended up keeping the n8n workflow for the parts that are basically just data movement, sheets, notifications, scheduling, and ripped the actual fetching logic out into a standalone Python script that n8n just calls and waits on. feels like the right choice.

Gotta hand it to my friend, while his solution was probably one of those in the million, this gave me a chance to try out different languages for scraping and whatnot.

Perhaps someone else found a way to make the no code route hold up against fingerprint based blocking, because every workaround I tried inside n8n itself felt like duct tape on duct tape and rinse and repeat.

TLDR: Spent a weekend comparing n8n scraping workflows against Python and Go for the same job, and n8n wins on speed to a working prototype but falls apart fast once a site starts fingerprinting you past basic IP checks. Tried datacenter proxies first (useless), then residential and even mobile proxies (mobile actually did better on a couple targets), but eventually just routed everything through a web scraper api since maintaining fingerprint evasion by hand wasn't worth my time. Python gave way more control for the actual scraping logic while Go only made sense if raw concurrent throughput was the bottleneck, which it usually wasn't compared to just getting blocked. Ended up keeping n8n for the boring data movement parts (sheets, notifications, scheduling) and pulled the real scraping into a standalone script it just calls.

u/WarAndPeace06 — 6 days ago

▲ 0 r/scrapingtheweb

What's the best web scrapping tool you've used?

Hey there, I'm looking for some good web scrapping tools that I can use to scrap some data on linkedin and on other social media platform. Have you used anything similar. If not, do you know about anything that I can use? Or if you've used anything, what's the best you've used and how much it costs?

u/Vivian_3913 — 6 days ago

▲ 18 r/scrapingtheweb+3 crossposts

Web Scraping Insider #8 | "ethical" residential proxy reckoning, free residential proxy tester, browser rewrite wave (CloakBrowser / Obscura / Camoufox)

Posted the latest Web Scraping Insider #8 if anyone here wants the full breakdown:

👉 https://thewebscrapinginsider.beehiiv.com/p/the-web-scraping-insider-8

https://preview.redd.it/073298wqhdah1.png?width=1200&format=png&auto=webp&s=fb13515fdeee641c3e79b23be01e364a5bfdb7d5

Quick summary of what's inside:

⚖️ When "Ethical" Proxies Aren't Ethical

"Ethically sourced" has become the proxy industry's favourite marketing word. Almost no provider will show you which apps their residential IPs actually come from - no public partner list, no audit trail, no independent verification.

The last couple of weeks made that gap impossible to ignore:

Spur Intelligence scanned 6,038 LG webOS + Samsung Tizen apps - proxy SDKs in 2,058 of them (42.5% on LG, 26.9% on Samsung)
Bright Data's SDK enrolling always-on smart TVs as exit nodes, with consent buried in TV remote arrow-key navigation
SuperBox streaming boxes (sold at major US retailers) shipping with dormant Popanet proxy software - routing third-party traffic through home connections with no meaningful consent
FBI/IC3 now warning consumers that everyday devices are being silently turned into proxy nodes

None of those device owners meaningfully opted in. Yet those same residential IPs feed pools sold as "ethical."

Our take: "ethical" should be a claim you have to prove - published partner list, audit trail, who consented / in which app / when - not a landing-page adjective. My bet is the market moves there within the next year or two.

---

🔮 Proxy Tester: now benchmarks residential proxies too (free for you)

We expanded the ScrapeOps Proxy Tester beyond proxy APIs. It already benchmarks ~15 proxy-API-style providers against your exact target URL. Now it does the same for residential pools, so you can compare both side-by-side.

https://preview.redd.it/bcaokzhthdah1.png?width=1163&format=png&auto=webp&s=e0abb2f36866657f4a0814bf0554d9c95093f661

How it works: submit your URL → real requests through each provider → every config they expose gets tested → ranked by success rate + cost per successful request.

Residential is where marketing fluff runs deepest ("30M+ IPs", "99% success rates"). From what we've seen across billions of requests, CPM rarely correlates with performance on your actual target.

Try it: https://scrapeops.io/proxy-providers/tester/

---

🥊 The browser wars are back: people are rewriting Chromium itself

For a decade, scraping browser innovation meant automation libraries on top of Chrome (Selenium → Puppeteer → Playwright). The browser underneath was treated as a commodity.

That may be shifting. Two forces:

Anti-bot reads deeper now - TLS, network stack, process behaviour - so runtime patches (playwright-stealth, undetected-chromedriver) break more often than they hold.
Chrome is heavy at scale. Thousands of concurrent browser instances (or long-running AI agents) make a purpose-built engine attractive on cost + startup time.

Projects worth watching:

CloakBrowser - Chromium fingerprints patched at the C++ source level, not JS injection. Drop-in Playwright/Puppeteer replacement. Claims 30/30 on public bot-detection suites.
Obscura - Rust headless engine from scratch, CDP-compatible so Playwright still talks to it. Claims ~70 MB binary, ~30 MB RAM, near-instant startup vs Chrome's 200 MB+ / ~2s. (Self-reported, v0.1.0 - treat as experimental.)
Camoufox - modified Firefox with C++-level fingerprint spoofing. Strongest headless evasion in independent tests we've seen. Proves this isn't only a Chromium story.

Stealth is moving below the automation layer. Most of these are young and several lean on self-reported numbers - don't rip out your production stack overnight - but the direction is worth tracking.

Bottom line: the residential proxy supply chain is getting scrutinised from every angle (smart TVs, factory hardware, federal warnings), the browser layer is getting rebuilt from scratch, and the boring work still wins - benchmark on your targets, measure cost-per-validated-payload, not vendor adjectives.

Happy to discuss specifics here - especially if you've benchmarked

— Ian (ScrapeOps)

u/ian_k93 — 6 days ago

▲ 12 r/scrapingtheweb

Web Scraping freelance

Hi everyone! Basically, I’d like to give freelancing a go. I learnt Python a while back and was thinking of brushing up on my skills and heading in that direction.
What are your thoughts on this in general? Is it still a viable option, is there plenty of work, and which programming language should I use?

u/TradeIndividual7891 — 6 days ago

▲ 3 r/scrapingtheweb

scraping - google ai overview

has anyone used brightdata's residentials proxies for scraping google ai overview without using serp api , browser api or unlocker api ??

i am currently creating an autonomous web crawler that can scrape response and sources from the google ai overview .

i am using camofoux , and i have some credits in brightdata wallet but when i redirect to google AIO they are constantly giving that 443 error and full access required error which requires KYC bla bla ...

i have already done the kyc last year and i am using it with full access too.

still i keep getting the same error on google domains .

does any of you know how do i overcome this issue ??

i would greatful if you can help me

u/ayushdontfw — 6 days ago

▲ 1 r/scrapingtheweb+1 crossposts

Looking for an people who scrapes data on website

Hello everyone I'm looking for people who scrape data on websites. If anyone is interested let me know.

Thank you

u/Varunkumar_10 — 7 days ago

▲ 7 r/scrapingtheweb

best way to scrape instagram - hiker api

scraping instagram is a big problem, and scraping it for certain projects can actually lead to make very very valuable things (provided you have the consent of the specific accounts you scrape which i do). i play indian classical music and it's a very niche field with a relatively disorganized online presence with no consolidated place to find events happening around you (in the USA at least) so i created ragaradar.com where you can find all the events. initially it was supposed to be community driven - anyone can submit an event. but no one was submitting :(
so i created an instagram scraper using hiker api to directly scrape accounts who i have consent from. it's really useful and has allowed me to make an arguably useful product.
it has a pretty reasonable cost and if you implement user id caching (thank you claude) you can cut down 25% costs on your api calls.
this ain't a promo.
it's a genuine suggestion for those who need to scrape insta - i know there are lot of use cases where u need to for eg making a calendar of club events at a university.

u/Relevant_Mine7529 — 6 days ago

▲ 1 r/scrapingtheweb

website scrapers

Hello, I’m new here so I’m sorry if what I’m asking is stupid. Is there any scraper for websites like Vestiaire Collective, Depop, Poshmark, etc?

u/Subject-Divide512 — 6 days ago