How to resolve LinkedIn company ID to slug?
linkedin.com/company/1035 redirects to linkedin.com/company/microsoft
This redirect happens behind login. Is either any way of figuring out a company's slug from its ID without needing a login?
linkedin.com/company/1035 redirects to linkedin.com/company/microsoft
This redirect happens behind login. Is either any way of figuring out a company's slug from its ID without needing a login?
You can add your social media links to google business profile. Where do these actually show up?
I see them at the bottom of the search knowledge graph. Is there any other place?
The location where the dogpile.com search happens seems to be solely based on the IP address. Is it possible to change this location. Eg: I'm in the US but I want to perform a search like I would be in France. Is this possible without changing my IP address?
The web scraping part I have covered. I'm scraping multiple sources using Crawlee. Total data size is 200 GB. Every day I'm fetching new records.
I fetch raw HTML which I store in S3 object storage, I then turn this into Parquet and clean up the data using DuckDB. Previously I also used PostgreSQL but had a lot of RAM usage, mostly due to entity resolution. Like if I find two addresses that are the same I want to link them, but sometimes there might be some typo's, same for people.
The goal is to combine these sources into a data warehouse where I can build data marts on top to serve APIs.
What does your process look like after the data is scraped? How do you store it? Where do you store it? How do you combine sources? How do you monitor you scrapers and pipelines?
Has anyone succeeded in scraping Google search results without using browser automation?
I believe you first need to get cookie consent and then send a request to the /search?filter=0&gbv=1 with a working user agent. However none of the user agents I've tried work.
I always get a captcha page.
Any pointers?
For anyone just looking for search results instead of Google search results, Yahoo, DuckDuckGo and Startpage are quite easy to scrape.
With MotherDuck you pay for compute instances, but nowhere is it listed what RAM and CPU those provide. Does anyone know?
Render Workflows is very nice, you just pay for compute, which is like $0.10 per GB-hour. It's not really focused on data engineering. For my project that would be like $10/month.
I would really like to use Dagster+ serverless, but I'll quickly be above $100/month (mostly due to materialization credits, compute seems reasonably priced). That's a big cost for a bootstrapped startup just to gain some pipeline observability.
Bruin Cloud also seems really nice and would be a perfect use case fit, but they charge $10.00 per GB-hour. That's 100 times more than Render.
Other tools also seem to start at $100/month or more. I want fully managed devops. Am I overlooking anything?
Scale: 100 tables, 250 GB total, largest table 250 million rows, daily ingestion of 10k rows of JSON API, HTML scraping, XML download from SFTP, downloading PDFs and parsing with OCR, enriching with AI, ...
I have very limited devops experience, but due to the recent price increase I would like to self host Dagster (like on Railway or VPS). What's the best way to go about this? Anything I need to be aware of?