u/vroemboem

Where do social media links show?

You can add your social media links to google business profile. Where do these actually show up?

I see them at the bottom of the search knowledge graph. Is there any other place?

reddit.com
u/vroemboem — 7 days ago

How to set dogpile.com search location?

The location where the dogpile.com search happens seems to be solely based on the IP address. Is it possible to change this location. Eg: I'm in the US but I want to perform a search like I would be in France. Is this possible without changing my IP address?

reddit.com
u/vroemboem — 8 days ago

Data pipeline and storage after scraping

The web scraping part I have covered. I'm scraping multiple sources using Crawlee. Total data size is 200 GB. Every day I'm fetching new records.

I fetch raw HTML which I store in S3 object storage, I then turn this into Parquet and clean up the data using DuckDB. Previously I also used PostgreSQL but had a lot of RAM usage, mostly due to entity resolution. Like if I find two addresses that are the same I want to link them, but sometimes there might be some typo's, same for people.

The goal is to combine these sources into a data warehouse where I can build data marts on top to serve APIs.

What does your process look like after the data is scraped? How do you store it? Where do you store it? How do you combine sources? How do you monitor you scrapers and pipelines?

reddit.com
u/vroemboem — 9 days ago

Has anyone succeeded in scraping Google search results without using browser automation?

I believe you first need to get cookie consent and then send a request to the /search?filter=0&gbv=1 with a working user agent. However none of the user agents I've tried work.

I always get a captcha page.

Any pointers?

For anyone just looking for search results instead of Google search results, Yahoo, DuckDuckGo and Startpage are quite easy to scrape.

reddit.com
u/vroemboem — 14 days ago

Render Workflows is very nice, you just pay for compute, which is like $0.10 per GB-hour. It's not really focused on data engineering. For my project that would be like $10/month.

I would really like to use Dagster+ serverless, but I'll quickly be above $100/month (mostly due to materialization credits, compute seems reasonably priced). That's a big cost for a bootstrapped startup just to gain some pipeline observability.

Bruin Cloud also seems really nice and would be a perfect use case fit, but they charge $10.00 per GB-hour. That's 100 times more than Render.

Other tools also seem to start at $100/month or more. I want fully managed devops. Am I overlooking anything?

Scale: 100 tables, 250 GB total, largest table 250 million rows, daily ingestion of 10k rows of JSON API, HTML scraping, XML download from SFTP, downloading PDFs and parsing with OCR, enriching with AI, ...

reddit.com
u/vroemboem — 21 days ago

I have very limited devops experience, but due to the recent price increase I would like to self host Dagster (like on Railway or VPS). What's the best way to go about this? Anything I need to be aware of?

reddit.com
u/vroemboem — 25 days ago