r/dataengineering

Orchestration platform that doesn't force everyone to learn Python?

Our data team runs Airflow but the infra and backend engineers refuse to touch it, they don't want to learn a Python SDK just to schedule a shell script or trigger a Terraform plan.

I'm looking for something where the whole team can contribute without a language barrier. Ideally declarative (YAML or similar), self hosted, with built-in scheduling and a decent plugin ecosystem. Anyone found sth that works across data + infra teams?

reddit.com
u/PSGCampus — 12 hours ago
▲ 14 r/dataengineering+1 crossposts

Cheapest possible full analytics stack?

Hello! I am a relatively experieced a analytics engineer and I kind of have an idea of the price range of the architecture i am suggesting, but i want to know your take!

The exercise here is to suggest a business setting and try to come up with thecheapest possible production ready set of tool to run it.

Imagine a traditional wholesale company, in the fashion good industry. 2 warehouses (physical, not data warehouses), around 3000 incoming orders per month, 30000 outgoing. Data sources are mainly ERP, provider offers, ticketing system for client complaints, CRM, some supply chain data like delivery times, wayslips...

So the goal here is to have a star schema with all the data needed to understand the business. Nothing fancy, no ML, no AI. Just a good data warehouse, reporting built on top.

The condition is to centralise all data, have full analytics visibility, and use only Cloud resources (all company systems are in the cloud)

So my question is, with the existing available Data tools (ETL, Visualisation...) and without ever running stuff locally (so a notebook with hardcoded API keys does not count), what is the cheapest you could run the analytics stack on this company (excluding headcount)?

PS: i now see this question could seem like i am looking to buy tooling. i am not and this is purely hypothetical.

reddit.com
u/tomtombow — 16 hours ago

Having troubles with airflow.

Hey guys. Most of our stuff ran in cron before. And I decided to make things more reliable. So I setup self hosted airflow in docker etc. But it's been quite a pain. It keeps getting stuck every few days silently due to one or the other random reason every time.

I was using external python operator before inside the same docker as the scheduler. But then I it got stuck in hangups etc and I thought that's the issue so I did it in a more fancy way with 4-5 containers celery, redis, scheduler etc in separate containers. And even today it got stuck on one job randomly. I was on airflow 3.0.0 before though we upgraded it to 3.2.x or something today to see if that helps. But it's been a bit of a fight. That I am starting to get a bit tired.

I had hoped that it being the industry standard and all it would be super smooth a perfect but it's been a bit of a pain in the ass. I am not sure if it's airflow itself that's at fault or am I doing something wrong. I am not an airflow expert and working with ai on it. So I might be missing something. But it has not been a smooth experience and I am considering just using cron, or potentially dagster. But let me know what you guys think. Maybe a managed solution is better but I would like if it's something we can stay on free tier of. As it's a pretty shit dumb low reliability job that cron can almost take over with 0 reliability issues.

Let me know what you guys suggest and if I am doing something wrong. Thanks 🙏🏻

reddit.com
u/Consistent_Tutor_597 — 13 hours ago

Openmetadata and AirFlow

Hi guys,

I’m trying to integrate Airflow with OpenMetadata. Is there an easy or recommended way to do this?

I already tried using the OpenMetadata backend lineage integration, but I ran into dependency hell and it doesn’t really suit my setup.

Now I’m trying to integrate through OpenLineage, but OpenMetadata still doesn’t seem to properly accept or parse messages from Kafka. The events appear in the OM UI, but it looks like OpenMetadata doesn’t actually process them correctly.

Ideally, the Airflow version should be 2.10.5 or newer, and upgrading is not a problem if needed.

Has anyone successfully configured this setup or faced similar issues?

reddit.com
u/Successful-Gap8537 — 12 hours ago

Dimensional Modeling: Handling mixed granularity and broken hierarchies between Ad Platforms and Web Analytics (GA4)

Hi everyone,

I’m currently building a Data Warehouse (PostgreSQL) to consolidate marketing data, and I'm facing an architectural dilemma regarding dimensional hierarchies.

The Setup:

I’m extracting performance data from Google Ads and Meta Ads. I built a Snowflake-like schema with strict 1:N relationships to enforce data integrity:

dim_ad_group (N:1) -> dim_campaign (N:1) -> dim_channel

For the ad platforms, this strict hierarchy works perfectly. A specific Ad Group belongs to exactly one Campaign, and a Campaign belongs to exactly one Channel (e.g., "Paid Social" or "Paid Search").

The Problem:

I am now integrating Google Analytics (GA4) traffic data into a new fact table (fact_web_traffic). GA4 data introduces mixed granularity and missing attributes. A lot of traffic comes in as (not set) for Ad Groups or Campaigns (e.g., Organic Search, Direct, Email, or Performance Max campaigns).

My dilemma with the solutions:

Using NULLs in the Fact Table: I could leave the campaign_id and ad_group_id as NULL in the fact table for non-paid traffic. However, this feels not professional

Using a Default "Dummy" Member (e.g., ID = -1): If I create a single (not set) dummy record in dim_campaign, I break the 1:N hierarchy because that single dummy campaign would need to map to multiple channels (Organic, Direct, Email) simultaneously, which my schema doesn't allow.

What is the industry standard / best practice to resolve this?

Should I generate multiple dummy records (one for each non-paid channel)? Or is there a completely different design pattern for merging strict Ad hierarchies with fluid Web Analytics data?

Thanks in advance!

reddit.com
u/Think-Strain-6274 — 16 hours ago

Cost Effective Data Platforms

Hey all,

We've got a greenfield project and in the hunt for a cost-effective data platform.

I am interested in getting your insights into the cost standpoint of modern data platforms.

The capability to easily handle and deploy streaming ingress and egress use cases is non-negotiable. So as the ease of building architecture to meet ultra low-latency requirements.

What are you thoughts on this?

reddit.com
u/Zealousideal_Bed7898 — 14 hours ago

Does anyone actually enjoy web database IDEs?

If you do, tell me why. And is it because you’ve never been accustomed to using a desktop IDE in the first place?

If you hate these web IDEs like I do, and you stopped using the web IDE altogether, tell me what type of db you’re working in and what desktop app you use instead.

reddit.com
u/StarWars_and_SNL — 21 hours ago

Where do you draw the line between Analytics Engineer and Analyst responsibilities?

I’m a solo Analytics Engineer in my team and we have with a few Data Analysts. We don’t have a DE, so I also do pipeline and ingestion. Right now, the lines between our jobs sometimes feel really blurry.
For example, the analysts build a lot of our dbt models and make changes to them. I know our roles naturally overlap, but I feel like we are missing clear ownership of who does what. Since they are not so technical and lack the engineering mindset, it can quickly turn into a spaghetti and miss best practices. I want to empower them, but I also want to make sure our architecture stays clean and that I'm actually doing AE work, not just acting as a code reviewer.

For those of you on similar teams, how do you split the work? Do you have a clear division of who does what?

I would love to hear what works for your team. Thanks!

reddit.com
u/Rajsuomi — 20 hours ago

Feeling like I can get a job as a data engineer

So, for about 3 months now I have been learning Azure Data Engineering, I can do some ETL with ADF, write basic ETL code on pyspark, I understand SQL, Data warehouse, schema, Medallion architecture and some cool stuff within the Azure Data stack.

But, lately I have been having this fear that I won't be able to land a job as an Azure Data Engineer because each time I turn to LinkedIn, I see someone with 3 or more years of experience with open to work flag on linkedin(even with several certificates), this makes me feel like there isn't any place for me.

Due to this feeling, I am considering taking a course on Health and Safety and just leave the whole tech stuff.

Please I need help, what do I do, I base in the UK

reddit.com
u/ezeamaka2 — 24 hours ago

DE feels like a dead end beyond 4 years at the same company

Been working at the same company for over 4 years and I can see there is no more new work coming in. There are the usual small requirements that come in every now and then but beyond that the project is pretty stale.

The pipelines are fully automated, optimized and pretty much in a self healing mode which requires minimum human intervention. I like what i do but having worked with the same tech stack im now feeling stuck. We use multiple services that are stitched together to make the whole pipeline work.

I have tried applying outside and I realize the market is bad but im getting rejected only because i haven’t worked on databricks/snowflake even though these tools are far easier to learn and implement compared to what im doing now. I have tried explaining recruiters how my experience relates to these tools but all they seem to care are about these words/tools on my profile.

Anyone in the same boat or have any advice on how to handle these situations? Im considering adding these tools as part of my projects even though we dont use them as a last resort.

reddit.com
u/Ok_Illustrator_816 — 1 day ago
▲ 18 r/dataengineering+1 crossposts

Open source data governance compiler for PostgreSQL

I never thought of data governance as a sexy topic. My main focus has always been on performance, insights, cost reduction.

That is, until I joined a startup as the sole data engineer. Dealing with tons of PII/PHI, I realized just how much effort it was to write all these custom tools to handle everything: infinite GRANTs, trigger functions for versioning, cron jobs for retention - and it all needed so much attention and maintenance. Or I could go with an off-the-shelf product that's a complete black-box with a learning curve.

Always one to prefer spending 10x longer automating the task than just doing it, I built a CLI tool that lets users build their DB/governance specs in declarative yamls, and writes all the SQL code for you. And it's open source, fully transparent, as secure as I could conceive of making it, and hopefully super user-friendly too.

I've linked the first release in my repo. Anyone want to try it out?

In the interest of transparency: I did code this with assistance from Claude, but I've been in data engineering for almost 20 years and manually debugged every line. I also got it to build me a suite of over 300 tests that run through GitHub Actions automatically on each commit.

github.com
u/River_Bass — 1 day ago

VP told me to 'just use Cowork' to fix years of data chaos in a month. I am losing my mind.

Hi everyone not sure if this is the right place but I just need to vent and get some outside perspective.

I work at a large conglomerate that spans multiple domains. I'm a data engineer and defacto team lead of a small team of one data analyst, one software engineer, and me. We usually handle POC projects, performance analysis, and process improvement for a consumer-facing product division and the company's manufacturing operations.

Following an org restructure earlier this year, our team was reassigned to support the R&D department of a specialized industrial materials division. At the same time, a company-wide mandate came down requiring each sector to generate a defined amount of AI-driven revenue per year through cost savings, new products, or time savings from AI usage. This landed on our team as "find ways to use AI to help researchers do R&D faster and more efficiently."

I started with doing some preliminary interviews regarding the current R&D workflow. Each researcher or small team owns a single research domain. They design an experiment, create a work order in Excel (containing a work ID, associated sample IDs, and tests needed per sample), then send the work order to multiple labs for testing. The problem is there is almost no data or knowledge management system in place.

The work IDs and sample IDs are created by each researcher with no naming standard. Sample IDs often contain duplicates across experiments. Two of the labs generate their own internal IDs when they receive the work order, fill out their test forms, and send results back. A third lab requires the researcher to manually create test tasks in a web application with no linkage back to the original work order. There is no standardization of data schema, naming conventions, or terminology across any of it. Most records are Excel files, but some exist only as emails or chat thread replies. If you want to trace an experiment from the original work (named '22032026_work_paper_exp1', yeah the named is the work_id for this researcher....) to lab 1 results (named '26M0321') to lab 2 results (named '26C0926') to lab 3 results (named '26AS0265436'), you need to open each files, extract the sample ID and matches them together and it is even possible that one sample does not includes test from all 3 lab. In that case you need to use the date to match them with the closest date and sample ID as sample ID can be the same across different experiment (thus different work paper).

It is an abosolute mess.

To make things worse, about two months before my team got involved the department had already engaged an external AI company to build prediction and optimization models for their core research workflows. The AI company's first ask was "send us the past year of research data so we can start training the models". That's when everything unravelled. The department couldn't produce a single clean dataset. They scrambled to manually piece something together and ended up with 48 rows of experiment data for one research domain and 147 rows for another and our company has been in this domain for a really really long time. For anyone who doesn't know, you typically need thousands of clean, structured records minimum to train a model that's worth anything (at least try to get them hundreds of data points damnit). What they handed over was essentially unusable. The external engagement is now stalled.

That context explains a lot about what happened next. After my preliminary investigation I met with the VP of the R&D department, presented the findings, and proposed a ground-up digital transformation (minimum 3 to 4 months). He stopped me at "3 to 4 months," told me to just find AI tools to ingest the legacy data and build a database from it, and said we could "talk about transformation later." He wanted something done within a month. Then he asked: "Have you ever heard of Claude Cowork? Just use Cowork, it should be really easy." I walked out completely drained.

My direct manager told me to try to accommodate the VP's request. We've just come under his department and the political reality is that the AI mandate created pressure to show something quickly even though this R&D function has been a core domain of the company for a long time with no data infrastructure to show for it. The external AI engagement presumably isn't cheap either, and right now it's going nowhere.

So here I am two weeks later, sifting through a complete mess of reports, Excel files, and PDFs. I can probably build file parser heuristics for one researcher's output, maybe a team's but to do it for every researchers, knowing it's just a band-aid that solves nothing structurally, feels like an enormous waste of everyone's time including mine. And even if I somehow pull it off, the data coming out the other end still won't be clean or consistent enough to unblock the external AI company.

Has anyone been in a similar situation? How did you handle the gap between what leadership wants to hear and what actually needs to happen?

PS. Sorry for the long post....I really need to vent a bit.

PS2. I really did tried to persuade them to pursue ground-up transformation first and why it is not a sustainable solution and a waste of everyone resources to try to piece the legacy data together (you can imagine how inefficient this is if the researchers themselve can only scrapped together ~200 rows of experiment data over 2 months.)

reddit.com
u/Acinac — 1 day ago

Feedback DE

I am DE, having 4 yrs of experience working in top Mnc in India.

People (other engineers and leadership) don't respect DE work in my company. Backend engineer/ MLE are generally considered as superior.

We are often treated as Analyst or non engg folks.

Is this the same for other companies as well ? What companies do we have in India where DE work is challenging (and DEs are given respect and acknowledgement).

reddit.com
u/Stock_Wallaby9748 — 1 day ago

How to run a 5-minute script online every day

I want to run a scraper and save some data. I don't want to set up a Raspberry Pi.

Are there any free servers that can be used for this, or are there servers that offer a limited number of free tokens?

reddit.com
u/Lord_Home — 1 day ago

Laid off a week ago, am I screwed?

I've been in the business for a couple of years now, and my latest job was a big upgrade. I learned a ton, and I was doing pretty damn well for myself, but I ended up getting laid off through unfortunate circumstances.

I was only able to work there for 9-ish months, and it's just now hitting me how fucked I feel. I've been applying like crazy, but I'm terrified I won't get hired. I'm just constantly applying to everything I see on LinkedIn. I feel relatively experienced now, but I feel like I just lucked out and won't get another job. I absolutely loved my job, and now it's gone.

I guess I'm just posting here because I'm sad and afraid, hoping someone was in my position. It's not like I'm an elite engineer with 5+ years of experience under my belt, so I just don't feel super secure right now...

Edit: I should clarify that I feel relatively confident in my skills. I'm very skilled in Python (plus data libraries e.g. polars, duckdb, pandas) and SQL, I spent the last 9 months thrown into an Azure environment and familiarized myself a ton with cloud stuff and Synapse Analytics before migrating to Fabric. Got very comfortable with terraform, spark, and general SDLC/team stuff. I come from a more traditional developer background, so I'm familiar with version control + CI/CD. I spent most of my time optimizing queries/pipelines, debugging pipelines, and building internal tooling to help debug/prevent pipeline problems in a relatively big-data environment. In general I feel like a jack of all trades with a shallow mastery in Python/SQL. Every job posting I'm seeing feels like it's out of my league (as in I feel underqualified). Idk what the hell I should be applying for, and I can basically only do local or remote, and local options are few and far between where I live. I have no references - just work history.

reddit.com
u/ThrowRA0429100 — 2 days ago

Portfolio approach and projects?

Hi, I have almost 2 years of experience in SAP BW

And I want to switch from SAP BW to Data Engineering, I want to put some project into my portfolio and then apply to companies,

I have considered the fact of me searching for Data Engineering projects in my own company but they don't allow this kind of cross platform change.

So I reckon my best move is to change the company

I have a little bit experienced from fabric as my current client are using it and I helped them with data ingestion from BW

I believe I should put that too in my portfolio.

I am really not sure how to approach this.

It will be really helpful if someone has insights on this

Thank you.

reddit.com

The BI team was gutted overnight, and I’m one of the few left. How do I deal with the "survivor’s guilt" and the feeling that my company is just winging it?

Yesterday, my company went through a major round of layoffs without warning. My entire BI team for our analytics department team colleagues I’ve worked with for the past six months since I joined as a junior DE were let go, leaving only one person left in that entire department. Management is framing this as an "AI-first" pivot, replacing those Power BI focused roles with tools like Claude Code, but the reality on the ground feels chaotic and completely unproven.
My team (Data Engineering) survived, which puts us in the strange position of being the "pillars" who now have to build the pipes for an AI that hasn't proven it can handle the workload of the team we just lost.
I’m struggling with a few things and could use some perspective from others who have been through this:
The Guilt: It’s hard to sit at my desk knowing my teammates were shown the door, especially as someone relatively early in their career. How do you process this without letting it eat you alive?
The "Skeleton Crew" Reality: Has anyone else had to watch their company bet the farm on AI tools to replace real people? It feels like we’re being asked to build something that isn't ready to replace the institutional knowledge we just threw away.
The Professional Uncertainty: I feel "safe" on paper, but the culture feels fundamentally broken. How do you stay grounded when the company you were hired into feels like a completely different place than it was 48 hours ago?
I’m just looking for some advice on how to handle the emotional toll of this. It’s been a rough 24 hours, and I’m finding it hard to just "go back to work" like nothing happened. We have monthly meetings with our entire Analytics department and the SVP said in January don’t worry hang tied we don’t plan on any layoffs happening any time soon what a joke and monthly vacation/trip pictures just to give them the middle finger.

reddit.com
u/typodewww — 2 days ago

Is open table formats dead ?

Suddenly last year everyone was talking about open table formats, apache iceberg delta lake etc and suddenly we can find no one talking about it are you guys still using iceberg or delta lake or is there any other alternative approach the found out against open table formats

reddit.com
u/ClassroomFar8509 — 2 days ago