u/GameLearner7

r/thewebscrapingclub r/u_GameLearner7 r/Programmers_forhire r/SaaS r/PythonLearning r/SideProject

▲ 2 r/u_GameLearner7+1 crossposts

Need Web Scraping or Data Automation? (Freelance Engineer Available)

Hey!

I'm a freelance Backend and Data Automation Engineer with hands on experience on some great data collection projects. I can scrape all kinds of data from the web like text, image, videos etc. I can provide you with data.

Let me know if you need any help.

u/GameLearner7 — 17 hours ago

Building an offline NLP-to-SQL tool. what features are actually useful?

Hey Guys,

So I was looking to learn and develop some new skills and products. One day I got around the problem of people struggling with their data. They might have their data in different kinds of dbs, excel, CSV, google sheets etc. Waiting for a Data Analyst to work on that data and provide reports and insights might get expensive and hectic for many users. Also the existing BI software charges load of money on subscriptions and data privacy might be also a concern for many.

So I'm in the middle of developing an enterprise grade Text-l to SQL or NLP to SQL or you can call it an AI Data Analyst. It's super easy to use just like a Whatsapp chat.

So it works like people connecting their db through url or selecting db as a file from local storage or if they have other forms like excel, CSV etc upload it directly.

Also it works on the BYOK formula, users can connect their own api keys or can select from local llm if they need to go full offline and private.

After the connection establishes the user can chat with their data and get out and insights and also can export as excel, CSV or pdf (it also contains business insights).

It also contains an AI Auto mode which works on autopilot and generates key insights from your data without you asking any questions.

Also it shows sql query it used to generate results and users can get data by running sql queries or editing the existing query.

since i'm right in the middle of building this, i wanted to get a reality check from people who actually work with data.

for the data engineers: what usually breaks when you try to use text-to-sql pipelines? (messy schemas, hallucinations?)

for everyone else: what's the most annoying part of your current BI setup? any feedback would be huge. just want to make sure i'm not building features nobody actually needs.

u/GameLearner7 — 17 hours ago

▲ 3 r/thewebscrapingclub

Building an offline NLP-to-SQL tool. what features are actually useful?

Hey Guys,

So I was looking to learn and develop some new skills and products. One day I got around the problem of people struggling with their data. They might have their data in different kinds of dbs, excel, CSV, google sheets etc. Waiting for a Data Analyst to work on that data and provide reports and insights might get expensive and hectic for many users. Also the existing BI software charges load of money on subscriptions and data privacy might be also a concern for many.

So I'm in the middle of developing an enterprise grade Text-l to SQL or NLP to SQL or you can call it an AI Data Analyst. It's super easy to use just like a Whatsapp chat.

So it works like people connecting their db through url or selecting db as a file from local storage or if they have other forms like excel, CSV etc upload it directly.

Also it works on the BYOK formula, users can connect their own api keys or can select from local llm if they need to go full offline and private.

After the connection establishes the user can chat with their data and get out and insights and also can export as excel, CSV or pdf (it also contains business insights).

It also contains an AI Auto mode which works on autopilot and generates key insights from your data without you asking any questions.

Also it shows sql query it used to generate results and users can get data by running sql queries or editing the existing query.

since i'm right in the middle of building this, i wanted to get a reality check from people who actually work with data.

for the data engineers: what usually breaks when you try to use text-to-sql pipelines? (messy schemas, hallucinations?)

for everyone else: what's the most annoying part of your current BI setup? any feedback would be huge. just want to make sure i'm not building features nobody actually needs.

u/GameLearner7 — 17 hours ago

▲ 1 r/SideProject

Architecture Breakdown: How I built a concurrent pipeline to scrape and migrate 5TB of geo-restricted video data on a low-end laptop.

Hey everyone,

I recently wrapped up a massive data extraction and automation project and wanted to share the architecture. The goal was to scrape, process, and migrate over 2,000 episodes (about 5TB of data) of geo-restricted media, converting dynamic XHR network payloads into a resumable, fault-tolerant local-to-cloud pipeline.

The best part? I achieved all of this on a humble i3 7th Gen laptop with just 8GB of RAM and a 256GB SSD. Because of my severe hardware constraints, aggressive state management and optimized caching were absolutely critical.

Here is how I broke down the system to handle it without bottlenecking my machine:

The Tech Stack: Node.js, Puppeteer, Python (Flask), rclone

Phase 1 & 2: Bypassing Restrictions & Interception (Node.js + Puppeteer)

Initial access was geo-restricted. Instead of fighting it with standard requests, I attached Puppeteer to a remote Chrome instance. I set up network response listeners (page.on('response')) to intercept the raw XHR/Fetch traffic. This allowed me to parse the dynamic JSON and extract the secured HLS .m3u8 stream URLs directly from the payload.

Phase 3: The API Bridge

To keep the scraper lightweight, Node.js doesn't do the heavy lifting. It dispatches the extracted URL and localized metadata (parsed from an Excel sheet) via a POST request to a local Python Flask server, then polls the output directory waiting for a .done state marker.

Phase 4: High-Throughput Processing (Python)

Python takes over, resolves the master .m3u8 for the highest bandwidth stream, and extracts the individual .ts chunks. I used ThreadPoolExecutor (capped at 12 workers) to download the 4MB chunks concurrently. This maxed out my 150 Mbps connection continuously without dropping packets or overloading my 8GB RAM.

Phase 5: Resumable Storage Architecture

Because this ran for days and my storage was highly limited, fault tolerance was critical.

* SSD-to-HDD Caching: Chunks were initially written to my small, fast 256GB SSD temp folder to prevent I/O blocking.

* Validation: Once a full episode was stitched and validated, it was moved to external bulk HDD storage, and the .done marker was written to signal Node.js to fire the next job, clearing up my SSD space immediately.

Phase 6: The Cloud Migration (rclone)

Finally, I used rclone for bulk uploading the finished multi-terabyte library from the HDD straight to Google Drive, optimizing concurrent network transfers to get the data off the local machine as fast as possible.

Takeaways:

If you are scraping heavy media or dynamic single-page apps, bridging Puppeteer's network interception with Python's multithreading is a lifesaver. Don't try to make Node do all the heavy file processing, especially if you are working with hardware constraints!

u/GameLearner7 — 7 days ago

▲ 9 r/thewebscrapingclub+1 crossposts

Architecture Breakdown: How I built a concurrent pipeline to scrape and migrate 5TB of geo-restricted video data on a low-end laptop.

Hey everyone,

I recently wrapped up a massive data extraction and automation project and wanted to share the architecture. The goal was to scrape, process, and migrate over 2,000 episodes (about 5TB of data) of geo-restricted media, converting dynamic XHR network payloads into a resumable, fault-tolerant local-to-cloud pipeline.

The best part? I achieved all of this on a humble i3 7th Gen laptop with just 8GB of RAM and a 256GB SSD. Because of my severe hardware constraints, aggressive state management and optimized caching were absolutely critical.

Here is how I broke down the system to handle it without bottlenecking my machine:

The Tech Stack: Node.js, Puppeteer, Python (Flask), rclone

Phase 1 & 2: Bypassing Restrictions & Interception (Node.js + Puppeteer)

Initial access was geo-restricted. Instead of fighting it with standard requests, I attached Puppeteer to a remote Chrome instance. I set up network response listeners (page.on('response')) to intercept the raw XHR/Fetch traffic. This allowed me to parse the dynamic JSON and extract the secured HLS .m3u8 stream URLs directly from the payload.

Phase 3: The API Bridge

To keep the scraper lightweight, Node.js doesn't do the heavy lifting. It dispatches the extracted URL and localized metadata (parsed from an Excel sheet) via a POST request to a local Python Flask server, then polls the output directory waiting for a .done state marker.

Phase 4: High-Throughput Processing (Python)

Python takes over, resolves the master .m3u8 for the highest bandwidth stream, and extracts the individual .ts chunks. I used ThreadPoolExecutor (capped at 12 workers) to download the 4MB chunks concurrently. This maxed out my 150 Mbps connection continuously without dropping packets or overloading my 8GB RAM.

Phase 5: Resumable Storage Architecture

Because this ran for days and my storage was highly limited, fault tolerance was critical.

* SSD-to-HDD Caching: Chunks were initially written to my small, fast 256GB SSD temp folder to prevent I/O blocking.

* Validation: Once a full episode was stitched and validated, it was moved to external bulk HDD storage, and the .done marker was written to signal Node.js to fire the next job, clearing up my SSD space immediately.

Phase 6: The Cloud Migration (rclone)

Finally, I used rclone for bulk uploading the finished multi-terabyte library from the HDD straight to Google Drive, optimizing concurrent network transfers to get the data off the local machine as fast as possible.

Takeaways:

If you are scraping heavy media or dynamic single-page apps, bridging Puppeteer's network interception with Python's multithreading is a lifesaver. Don't try to make Node do all the heavy file processing, especially if you are working with hardware constraints!

u/GameLearner7 — 7 days ago