r/webscraping

▲ 3 r/webscraping

Equivalent to undetected-chromedriver with Selenium Node.js?

I already wrote my node.js script to interact with the site in normal chromedriver for selenium planning to switch over to the undetected one but I found out that only works with Python, is there a way to use an undetected chrome browser without rewriting my whole script?

u/BlockofAmethyst — 19 hours ago

▲ 1 r/webscraping

URLs loaded in pass 1 failed to connect in pass 2.

I vibe coded a scraper for my business, Scraping bulk urls. Two scripts, two stages:

Stage 1 Python + aiohttp. For each domain a 'GET https://domain\` then `GET http://domain\` if that fails. 50 concurrent requests n 12s timeout. Reads the first 10KB of the response, checks headers + HTML. This stage gets 750+ matches.

Stage 2 takes those 750 matches and does a deeper scrape per domain (looking for more info) using a longer 30s timeout, much lower concurrency (~10 at a time).

The confusing part: 97% of 750 urls which returned html in 12s in stage 1. Came back as "no html" in stage 2, with a longer timeout and way less concurrent load. I isolated a few of these and ran a single bare aiohttp request against them by themselves, no concurrency at all. Result: the connection dies at the raw TCP handshake. it never even completes the SYN/ACK, let alone gets to TLS or HTTP. Plain `curl` against the same domain gives the same result: connect timeout.

u/MeringueUnusual9775 — 1 day ago

▲ 40 r/webscraping

What kind of captcha is this?

Everyday, captcha is becoming more and more harder than it was before.

u/Acceptable_Reach_312 — 1 day ago

▲ 1 r/webscraping

How do you avoid scraping the same page twice?

We mostly scrape job websites and crawl job postings every 3 hours. A volume of about 1M jobs per month.

My issue is that I need to fetch individual job pages only when they change. For example, once a job has been published, ATSs usually don't republish a new page when the job is closed. Instead, they simply update the expiration date, status, or other details. So instead of searching for new pages, I need to re-fetch the same ones from time to time, but it's very expensive, and I have no idea how to choose which ones are worth a second check.

The approach I'm using now is that we check, in order:

i) jobs with the closest expiration date,

ii) jobs that haven't been crawled for the longest time, and

iii) companies that historically update their job postings more frequently.

But still, we're not getting the right data at the right time. We usually detect those updates after 4–5 days, which is not sustainable. We need fresh jobs as soon as they change so applicants always see up-to-date listings.

Is there a way to monitor whether a webpage has changed without re-scraping it every time? Or is there a better approach I am ignoring?

u/Necessary_Pop_9247 — 3 days ago

▲ 20 r/webscraping

FBI, Google Take Down NetNut Proxy Network

infosecurity-magazine.com

u/madredditscientist — 3 days ago

▲ 0 r/webscraping

Noob here... How to scrape youtube channels with emails ?

I want to offer my services to different YouTube channels that fit certain characteristics, themes, subscriber count, etc. I've created a Codex application that allows me to search for channels using the official YouTube API and save them to a database. So far, so good. Now come the problems:

1: The problem arises when revealing the email address requires solving a CAPTCHA. I've seen services that can do this with a Chrome extension, i dont know if theres other options.

2: The big problem is that I later realized YouTube only allows revealing 5 email addresses per day.

How do people scrape YouTube channels using their contact emails? Thanks in advance.

u/pepitogrillo221 — 3 days ago

▲ 16 r/webscraping

Monthly Self-Promotion - July 2026

Hello and howdy, digital miners of r/webscraping!

The moment you've all been waiting for has arrived - it's our once-a-month, no-holds-barred, show-and-tell thread!

Are you bursting with pride over that supercharged, brand-new scraper SaaS or shiny proxy service you've just unleashed on the world?
Maybe you've got a ground-breaking product in need of some intrepid testers?
Got a secret discount code burning a hole in your pocket that you're just itching to share with our talented tribe of data extractors?
Looking to make sure your post doesn't fall foul of the community rules and get ousted by the spam filter?

Well, this is your time to shine and shout from the digital rooftops - Welcome to your haven!

Just a friendly reminder, we like to keep all our self-promotion in one handy place, so any promotional posts will be kindly redirected here. Now, let's get this party started! Enjoy the thread, everyone.

u/AutoModerator — 5 days ago

▲ 1 r/webscraping

How to scrape an ATS?

Hear me out, I don't know if this is even possible.

BUT...

Is there a way to get a list of companies that use Ashby?

I'm a complete newb so this question is likely super stupid for many of you. I'll take the heat.

u/Majestic-Tax8301 — 4 days ago

▲ 112 r/webscraping+6 crossposts

TRAWL: Self-hosted scraping engine — bypasses any JS challenge & captcha: Cloudflare, Turnstile, reCAPTCHA, hCaptcha, GeeTest. FlareSolverr & Byparr alternative and drop-in replacement for your *arr stack.

u/Germond_ — 6 days ago

▲ 8 r/webscraping

SSL-unpinning the Google Maps app

Does anyone here know if it is possible to MITM Google Maps (the Android app) so that I can look at the https traffic the app makes? I have a rooted phone and I installed a system certificate but the app refuses to accept it (i get mitmproxy errors). Using the web version is unfortunately not an option for me right now. I tried giving Codex an adb shell but that was just a waste of tokens.

u/Filip769 — 5 days ago

▲ 1 r/webscraping+1 crossposts

Looking for an people who scrapes data on website

Hello everyone I'm looking for people who scrape data on websites. If anyone is interested let me know.

Thank you

u/Varunkumar_10 — 7 days ago

▲ 0 r/webscraping

Please help me out in advanced crawling and scraping. This is urgent.

I want to be able to crawl and scrape smartly.

So here is the thing, I am currently working in a company. I have to scrape and crawl through bunch of websites daily.
These websites, say are company websites and what I need to scrape is the information of people in that company from the team/leadership/about pages.

The thing is some websites don't have about in their own pages- I have handled it, I find the separate person's link through the HTML of the team's page, and scrape it as well.

But for the dynamic cards or components/modals, like images of people onto which we have to click in order to get the results, my crawler fails.

I fixed it for a few websites, but there are just so so many outliers, many different types of websites. Sometimes there are even dropdowns of what type of team page u want: Leadership/Board-Of-Directors/Healthcare/etc..

I tried agentic some time ago using tools of an open source browser and API Key of an LLM. One agent only.

But the agent is failing even badly. It has no idea what to do even though I give it all the tools and give the prompt in depth.

Please help. I am very close for the completion of my project and this is really ruining it for me.

u/error-dgn — 7 days ago

▲ 3 r/webscraping

Intermittent 403s when scraping with Selenium

I have a python script running a headless Selenium Webdriver, looking up many individual records from a site. Most of the time it runs fine but every 10 minutes or so it starts hitting 403 errors. From trying different delayed retries, I've found that the 403s consistently happen for about 45-60 seconds. So the best I can do as a workaround is sleep for 60 seconds once I hit a 403, then resume normal requests. I've tried setting a non-headless user agent, namely Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36. This didn't help. This is also consistent whether I'm running from my local machine or from an EC2 instance.

What else can I try?

u/Inca_Digital — 8 days ago

▲ 1 r/webscraping

Please help me out in advanced web crawling and scraping.

I want to be able to crawl and scrape smartly.

So here is the thing, I am currently working in a company. I have to scrape and crawl through bunch of websites daily.
These websites, say are company websites and what I need to scrape is the information of people in that company from the team/leadership/about pages.

The thing is some websites don't have about in their own pages- I have handled it, I find the separate person's link through the HTML of the team's page, and scrape it as well.

But for the dynamic cards or components/modals, like images of people onto which we have to click in order to get the results, my crawler fails.

I fixed it for a few websites, but there are just so so many outliers, many different types of websites. Sometimes there are even dropdowns of what type of team page u want: Leadership/Board-Of-Directors/Healthcare/etc..

I tried agentic some time ago using tools of Browser Use and OpenAI's API Key. One agent only.

But the agent is failing even badly. It has no idea what to do even though I give it all the tools and give the prompt in depth.

Please help. I am very close for the completion of my project and this is really ruining it for me.

u/error-dgn — 8 days ago

▲ 0 r/webscraping

how can you post across all your social media for free?

is there a tool that make you post across all your social media for free? and is this falls under webscrapping as well or not?

please any help is appreciate.

and how hard is it to make my own cross social medial posting tool? and can i make it for free?

u/PomegranateDue4853 — 9 days ago

▲ 3 r/webscraping

Trying to scrape ANA of Japan

Hi guys

I am trying to scrape ANA Airlines of Japan...domestic POS mainly.

https://www.ana.co.jp/en/jp/search/domestic/flight/

I have automated international markets they are working fine on playwright setup. But domestic markets are not working and I am facing blocking on list page of the site.

It is akamai protected. And the API for international and domestic pos is different.

I have tried playwright, camoufox, tried hybrid setup also getting cookies from browser and then hitting on python.

It works fine for few requests but then it gets blocked.

And I am trying to scrape it on scale.

Can someone give it a try or help me if faced similar issues?

Sorry for sentence or grammar issues....I m writing it while traveling anticipating some response by the time I reach home

Thanks:)

u/Natural_Rock_3536 — 10 days ago

▲ 9 r/webscraping

Advise on what to do?

I have a new business. I have worked really hard to try and pull myself out of the trenches. Now, I have found I need data on sold items on eBay to make Anthony meaningful of this business.

I have no coding experience. I thought about learning how to code; however, it would take me about a year or more to accomplish. Meanwhile my business will starve.

I have been collecting data on sold listings for eBay using AI. I pick particular listings to have entered so I originally thought a scraper wouldn't work well. There is no way to pick through the listings automatically without, I imagine, some serious code. I can't have repeats of items in my list and many of the same items have variable names. I suspect this would be very hard for a computer to parse. I currently take a screenshot of the listing and AI collects the info I need out of it and puts it into a spreadsheet. It won't let me enter a direct eBay URL. It is horribly slow though. Much faster than manual entry though.

I am wondering are there scrapers I can enter just a URL for eBay and get the data back fast? I don't need automation. I understand eBay is hard to scrape so I suspect it won't be that easy. I saw there was some APIs for it but if we're being honest I don't even know how to use them.

I need to collect between 200-500 listings a day.

At the rate I'm currently going it will take me about a year to collect all the data I need. Any advice on the direction I should go?

u/LobeLifeCo — 11 days ago

▲ 7 r/webscraping

Is it impossible to scrape IMDb?

Hello. I’m a programming beginner, and I’m trying web scraping for the first time.

I’m trying to scrape the IMDb page /chart/top/?ref_=nv_mv_250 using BeautifulSoup, but the data is not being loaded. Other websites load the data properly.

Does IMDb not allow web scraping?

u/beomstead — 11 days ago

▲ 5 r/webscraping

How to scrape Alibaba without getting caught?

I'm planning to create an AI Agent for personal use,as one of it's functions,I want it to scrape product data without getting caught/blocked.

I'm new to webscraping,and I know that Alibaba has one of the best protection out there,but I also know there are libraries like Playwright that are specifically designed for issues like these,and AI is a game changer too.

I would appreciate anyone guiding me on the topic.

u/Bubibobo_working — 11 days ago

▲ 5 r/webscraping

How to price this job?

Both client and I live in the US. There's a site with continuously updated records of entities, which have to be looked up one at a time. I have a list of 320k of these entities. What would be a fair price to run this list once per week, delivering a spreadsheet of any updates?

u/Inca_Digital — 13 days ago