u/vroemboem

r/dataengineering r/webdev r/GoogleMyBusiness r/databricks r/DuckDB r/webscraping r/searchengines r/Agent_AI r/WebScrapingInsider

▲ 2 r/databricks

How to chose compute

I'm new to Databricks building a data pipeline. For now I've always used serverless compute. Is that the best approach? One step in my pipeline involves parsing a 30GB XML file that takes 1,5 hours on serverless compute. Should I consider using different compute. If so, what to pick?

u/vroemboem — 1 day ago

▲ 12 r/databricks

Databricks Genie vs Claude Code vs OpenAI Codex

In terms of value, which of these coding plans offer the most bang for buck?

Currently Genie is free, I'm talking after July 6, when it becomes paid.

u/vroemboem — 3 days ago

I need my non-tech cofounder to manage the marketing site of our SaaS

I made an Astro marketing site for our SaaS. For me as a technical person Astro works very nice. However, every change to the website needs to pass by me, which is very inefficient. i would like marketing to own the marketing site and not IT. How to do this?

We need to support 10+ languages.

I looked at Storyblok , but that seems crazy expensive.

What ther options can you recommend: Webflow, Framer, Wix, Sanity, ...?

u/vroemboem — 7 days ago

▲ 15 r/databricks

Medallion architecture in practice?

This is how I understand the theory. Bronze = raw data from source system with some additional metadata, Silver = Cleaned up source data, Gold = Aggregated data across sources ready for consumption.

I have various sources: api, xml on ftp, scraped HTML. I store this as is in object storage. I guess we can just call this the raw layer and not part of the medallion architecture? I then ingest this as is in parquet in the data lakehouse as the bronze layer. Do I just have 1 schema called bronze, or do I have multiple bronze schemas, 1 for each source?

I then clean this up in the silver layer. For my data I need to perform entity resolution. Meaning, I want to detect if the same person appears across my sources, but I don't have a nice ID. instead I have something like name and address with potential typo's. How I understand it, silver should be a representation of the source. Doing this entity resolution should either be a separate step or part of gold? Do I also have a separate schema for each source?

In gold I integrate the source data in a source of truth. However in my case the data consmuption happens through an API. This source of truth is not ready for consumption by this API. So I need to build separate marts that make it easy to search on certain fields or return a single entity for a detail view. Is this then also part of the gold layer or do I have an additional mart layer.

Here are my goals: Be able to trace each field back to the origin data. Be able to understand why entities were resolved with the option for a manual override. Have clean dataset for each source. Have an integrated source of truth everyone builds on. Make it really easy to query the data for search and detailed view with an API.

How would you structure this in a medallion architecture? Or is that not the right fit for my requirements?

u/vroemboem — 9 days ago

▲ 1 r/databricks

Web scraping with Databricks

I need to process a lot of web scraped data into a data lakehouse.

In he past I've used tools such as Apify, Crawlee, Scrapy to perform this scraping.

I like Databricks for the unified platform it gives me to orchestrate ETL pipelines.

Is it a good idea to perform web scraping within databricks. If so, what's the best approach? Or would it be better to do this outside the databricks platform. However, in that case how would i best orchestrate things?

u/vroemboem — 13 days ago

▲ 6 r/WebScrapingInsider

How to scale to 100s of parallell scrapers?

I'm pretty good at scraping, but now I need to scale up. I need to scrape 10 million pages. How can I scale this so I can complete this in a couple of hours. How have you tackled this, both from the compute part as storage part.

u/vroemboem — 17 days ago

▲ 7 r/webscraping

How to scale up to 100s of parallell scrapers?

I'm pretty good at scraping, but now I need to scale up. I need to scrape 10 million pages. How can I scale this so I can complete this in a couple of hours. How have you tackled this, both from the compute part as storage part.

u/vroemboem — 17 days ago

▲ 10 r/DuckDB

How to speed up DuckLake?

I have a DuckLake on Backblaze B2 and a PostgreSQl data catalog on a Contabo VPS. Doing a cold query take up to 90 seconds. Is this normal? At most a table consist out of 200 files. Any tips on speeding things up or is this normal?

u/vroemboem — 26 days ago

▲ 2 r/Agent_AI

Cheap Chinese API providers for GPT and Calude

I found this website: https://lmspeed.net/ It lists various API pricings for popular AI LLM's such as GPT 5.4 and Sonnet 4.6. It has mostly Chinese providers like: https://api.pie-xian.com/ that are a lot cheaper. Anyone know what's going? Is this legit?

u/vroemboem — 1 month ago

▲ 0 r/dataengineering

Fly.io for ETL pipeline compute

Has anyone used fly.io machines for ETL pipeline compute? What was your experience? Seems like a cheap serverless solution for bursty workloads.

u/vroemboem — 1 month ago

▲ 1 r/WebScrapingInsider

How to resolve LinkedIn company ID to slug?

linkedin.com/company/1035 redirects to linkedin.com/company/microsoft

This redirect happens behind login. Is either any way of figuring out a company's slug from its ID without needing a login?

u/vroemboem — 2 months ago

▲ 1 r/GoogleMyBusiness

Where do social media links show?

You can add your social media links to google business profile. Where do these actually show up?

I see them at the bottom of the search knowledge graph. Is there any other place?

u/vroemboem — 2 months ago

▲ 2 r/searchengines

How to set dogpile.com search location?

The location where the dogpile.com search happens seems to be solely based on the IP address. Is it possible to change this location. Eg: I'm in the US but I want to perform a search like I would be in France. Is this possible without changing my IP address?

u/vroemboem — 2 months ago

▲ 8 r/webscraping

Data pipeline and storage after scraping

The web scraping part I have covered. I'm scraping multiple sources using Crawlee. Total data size is 200 GB. Every day I'm fetching new records.

I fetch raw HTML which I store in S3 object storage, I then turn this into Parquet and clean up the data using DuckDB. Previously I also used PostgreSQL but had a lot of RAM usage, mostly due to entity resolution. Like if I find two addresses that are the same I want to link them, but sometimes there might be some typo's, same for people.

The goal is to combine these sources into a data warehouse where I can build data marts on top to serve APIs.

What does your process look like after the data is scraped? How do you store it? Where do you store it? How do you combine sources? How do you monitor you scrapers and pipelines?

u/vroemboem — 2 months ago

▲ 3 r/webscraping

Has anyone succeeded in scraping Google search results without using browser automation?

I believe you first need to get cookie consent and then send a request to the /search?filter=0&gbv=1 with a working user agent. However none of the user agents I've tried work.

I always get a captcha page.

Any pointers?

For anyone just looking for search results instead of Google search results, Yahoo, DuckDuckGo and Startpage are quite easy to scrape.

u/vroemboem — 2 months ago

▲ 3 r/dataengineering

With MotherDuck you pay for compute instances, but nowhere is it listed what RAM and CPU those provide. Does anyone know?

u/vroemboem — 2 months ago

▲ 11 r/dataengineering

Render Workflows is very nice, you just pay for compute, which is like $0.10 per GB-hour. It's not really focused on data engineering. For my project that would be like $10/month.

I would really like to use Dagster+ serverless, but I'll quickly be above $100/month (mostly due to materialization credits, compute seems reasonably priced). That's a big cost for a bootstrapped startup just to gain some pipeline observability.

Bruin Cloud also seems really nice and would be a perfect use case fit, but they charge $10.00 per GB-hour. That's 100 times more than Render.

Other tools also seem to start at $100/month or more. I want fully managed devops. Am I overlooking anything?

Scale: 100 tables, 250 GB total, largest table 250 million rows, daily ingestion of 10k rows of JSON API, HTML scraping, XML download from SFTP, downloading PDFs and parsing with OCR, enriching with AI, ...

u/vroemboem — 2 months ago

▲ 13 r/dataengineering

I have very limited devops experience, but due to the recent price increase I would like to self host Dagster (like on Railway or VPS). What's the best way to go about this? Anything I need to be aware of?

u/vroemboem — 2 months ago