r/dataengineering

You just started in a new company. Huge and messy repository. What do you do first?

Specially nowadays with AI, what's your protocol to make sense of everything before start changing stuff?

Technical article on LTAP

I've been watching cautiously Databricks' announcement of OLTP/OLAP unification from their summit. They dub it Lake-TAP as opposed to HTAP. Now their cofounder published this technical piece that popped up on my HN:
https://www.databricks.com/blog/lakebase-ltap-rethinking-database-storage

This made me look at it very differently from their original announcement at their conference. Basically, I think this is a great capability of their Lakebase Postgres offering (assuming it works, I still haven't had my hands on it): You store your data in Postgres but the source of truth is stored in Iceberg/Parquet on S3. That means you can now do warehousing on all the Postgres data directly because it's just our usual big data stack of Iceberg etc.

This seems like a bigger/different deal than how the company presents it. That is, it's really Postgres on Iceberg more than some grand unification theory. The fact that both are open source standards (Postgres and Iceberg) means it could have legs. Personally, I think CDC is a necessary evil for data engineering. But maybe no longer necessary?

u/Dry_Chocolate_9396 — 18 hours ago

▲ 14 r/dataengineering+1 crossposts

Thoughts on new LTAP/Lakebase arch

Read a Databricks piece on Lakebase/LTAP and wrote a short note on the idea that clicked for me: maybe OLTP and OLAP should meet at storage, not inside one engine.

https://ameeer.in/posts/ltap-storage-layer/

reddit.com

u/ExmachinaCoffee — 16 hours ago

▲ 3 r/dataengineering

SQL design for a subscription service

I'm trying to develop SQL tables for a subscription service as part of my uni coursework. It's for a subscription microservice, so it only handles subscription related stuff. A subscription then grants certian 'privileges' such as ad-free and bla bla which will affect how the other microservices work. My question is: there's only one paid tier, so the structure is very simple.

Should i:
a) make a sql table which can detail exact tiers and attributes (adfree, send notifications etc)
b) leave the attributes which aren't strictly payment/billing related OUT of the table because the microservices can handle that on their own (like ie this is a plus member so this microservice can figure out on its own what extra priviledges relevant to itself it should grant)

B seems like the cleaner option, as from a development perspective it makes no sense to necessitate passing the user's exact priviledges to every single microservice it accesses when they can within their own service easily determine what to do. But what worries me about this implementation is that there isn't exactly a 'single source of truth' for what tier does what. I also don't want to be seen as lazy like maybe I found a way to not have to bother with writing out all the tier attributes myself?
Also since this is a coursework piece the other microservices do not actually exist so it isn't possible to just check whether they handle it on their own

reddit.com

u/Primary-Change5225 — 14 hours ago

▲ 19 r/dataengineering+11 crossposts

PROJECT REVIEW

Hello Everyone!!, I just completed a BIG project I have been working for a month and i want your opinion about it.

It's a SpaceX Launch Predictor & Cost Optimizer (A full end-to-end ML system that predicts the probability of a SpaceX Falcon 9 booster landing successfully, enriches launch data with real weather conditions, and exposes the results through an interactive Streamlit web application with a business ROI calculator.)

It Includes Data Pipeline, Advanced Machine Learning Algorithms (with Hyperparameter tuning), Explainability AI (SHAP), MLOps (AWS S3, Docker) and Business Value (ROI Calculator = Financial Results).

FUN FACT: For this project i used my own Evaluation Metric library (standardizes supervised and unsupervised model diagnostics into a single, consistent API), that is also Verified and Published in PYPI Community.

Project Info: https://github.com/Alkiviadisss/SpaceX

github.com

u/Senior-Neck499 — 2 days ago

▲ 50 r/dataengineering

Conformed Dimensions vs Dependency Explosion

Alright everyone, lower your voices. Bring it in. Let’s talk about the thing no one in data engineering is talking about right now. JUST KIDDING! Not another slop post.

For years I build independent data marts and the Kimball strategy was clear, associate each fact with as many dimensions as possible, reuse dimensions across facts as much as possible. Fill in the squares in the bus matrix. But with a modern 3 layer data lakehouse in the cloud in a big enterprise, it can't all end up as one big star because the dependency explosion will slow down changes. So you separate the stars by business case or closely-related business cases. But then every star needs the employee dimension, for example. We don't want separate employee dimensions across our org, but if every report uses the employee dimension that's a tough model to change when required.

Would love to hear how others have handled this and what were the benefits/tradeoffs.

Ideas

If a fact shares most dimensions with a given star, it goes in that star. Benefit - efficiency of development. Drawback - some dependency explosion leading to just a few very large stars.

Only a select few "key" dimensions are truly shared across stars. The rest are created distinct within a given star, even if we steal code from existing models. It's ok because we use natural keys or hashes of them, so ultimately cross domain analysis across star is technically possible. Benefit - individual use cases can select from a small number of "shared" dimensions and build the rest according to requirements. Drawback - similar models loaded twice resulting in extra compute and multiple versions of the truth become possible.
<your idea here>

Thank you!

Edit:

Seems to be a recurring question for an example so I'll imagine one here for illustration. Employee dimension is just the most obvious example, because everyone needs it.

Lets say someone wants to change the org structure (parent/child departments). And the org structure doesn't come from a single source because of affiliates, contractors, etc. One department wants contractors split out separately, another wants them merged. Or one dashboard wants to group some departments that are marginally related to a given business process so it would be weird on the report to just show each department normally because 2 would have all the volume and the other 9 would have 1%. So just group the other 9 into a bucket.

So if it's HR (data owner) deciding this, do they just "I am the law" and make the change, or do they need approval from all other departments? Seems tedious. What if the "grouping" scenario is not HR and so they need to add a department_group_domainC field to the employee dimension, do they need <i>approval from HR</i> (shivers down my spine just saying that).

reddit.com

u/Truth-and-Power — 3 days ago

▲ 48 r/dataengineering

How do you ensure that the data is 100% clean apart from manual review?

Hi!

So I am working on cleaning up our customer data quality to arrive at a customer masterdata. I tried to check for duplicates, nulls, invalid email formats and phone numbers, etc. I also tried to review with business some logic, like an inactive customer cannot have an active subscription etc.

However, my problem is when just skimming the data, I still see some weird data quality issues-- like a full name and last name combined (i.e., last name is made redundant and entered in both full name and last name), some company names have zzzz or are named customer, some first names have Mr and Mrs, etc. Is this the part where AI will be useful? Or is there a more deterministic and appropriate approach for this?

What are your thoughts?

reddit.com

u/Arethereason26 — 3 days ago

▲ 53 r/dataengineering

How bad is the lock in for Azure Fabric?

I am working with a client and they got sold hard on the Azure Fabric platform. I am there to assist them and I am trying to point them in a direction where they are not as locked in with one vendor.

So for those who have made the move to Fabric, how difficult was it to move in and then out?

reddit.com

u/sysacc — 3 days ago

▲ 0 r/dataengineering

Reality of job hunting for DE roles in 2026

I can’t believe I am the only one here who is hit with disappointment when this happens to them… I really don’t see the point of wanting DBT in you project unless you want to have Analysts building pipelines instead of Data Engineers.

u/BaseballLimp3423 — 2 days ago

▲ 59 r/dataengineering+1 crossposts

Hardwood 1.0: A Fast, Lightweight Apache Parquet Reader for the JVM

morling.dev

u/gunnarmorling — 3 days ago

▲ 30 r/dataengineering+3 crossposts

'No hope of protecting it': inside the data oversight crisis facing the public service

One in three public-sector data professionals do not trust the data held within their own departments, a recent survey showed.

The survey of 133 public-sector data professionals between February and April 2026 suggested tools for tracking data assets, and more than half said departments did not document the reasons for collecting data.

---

As the person who ran this research (and an ex-public servant), do you agree with our findings? Do you think trust in data is higher, lower or were we about right?

archive.is

u/sam-at-aristotle_mdr — 3 days ago

▲ 63 r/dataengineering

is anyone else using scala?

hello devs,

Just a question: are companies still using Scala in their businesses, or are they using a mix of scala and other codes/AI? Is it hard to hire in this area?

Just looking for opinions on this,

Thanks

reddit.com

u/AlexaG_2026 — 4 days ago

▲ 995 r/dataengineering+5 crossposts

101 concepts every data engineer should know (or some of them :)

This is me updating the concept page with the latest addition, including backlinks and a pop-up preview for each term. I hope it's useful.

u/Turbulent_Board_9291 — 4 days ago

▲ 10 r/dataengineering

1 Month Job Search | 5-7 YOE

Hello!

I just wanted to share my own stats following being impacted by recent layoffs. Interviewing can be intense so I'm happy to answer any questions about the experience and what did/didn't work for me.

Some notes:

Most applications were done via cold apps on company portals or LinkedIn
~15% conversion rate to interviews since starting applying a month ago
- Many of these are recent applications and may yet yield a response, TBD
2 interviews seen through to the end; 1 offer
Withdrew from ongoing conversations after offer as I was happy with the company landed (hence the high withdraw %)

https://preview.redd.it/ko3kj09onuah1.png?width=2358&format=png&auto=webp&s=750c1ca9b32ff8d3f14c44db9f413695d2884c03

https://preview.redd.it/42lc649onuah1.png?width=2400&format=png&auto=webp&s=df507695e8baf806202a7717db492a2bc67500b0

https://preview.redd.it/d9d8v29onuah1.png?width=2400&format=png&auto=webp&s=225cf772c901e3fa9035100cdb77114f786c55dd

https://preview.redd.it/v977119onuah1.png?width=3024&format=png&auto=webp&s=e42754bd377a909a9ee9c87357e2f489dc548b8f

reddit.com

u/WorriedMeat — 3 days ago

▲ 74 r/dataengineering+1 crossposts

Data Quality pattern I landed on using dbt + DQX

dbt tests are a great CI gate, but they run after the model builds and only detect by the time one fails, the bad data is already in your table. For "keep the good rows, isolate the bad ones, keep going" pattern, you need row-level DQ that runs in-transit, and I adopted DQX as I use it with other Spark workloads too.

The thing I had to unlearn is that you do not rewrite your dbt SQL gold model as Python. The transformation logic stays in a normal `.sql` model, you just add a thin Python model beside it that dbt.ref()s it and applies DQX. The Python model (orders_gold_dq) becomes the published Gold table; the SQL model (orders_gold) becomes an internal intermediate. Your downstream consumers point to orders_gold_dq, not orders_gold.

-- models/orders_gold.sql
select order_id, customer_id, amount, status from {{ ref('orders_silver') }}

Thin DQ Layer:

# models/orders_gold_dq.py  
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient

def model(dbt, session):
  df = dbt.ref("orders_gold")
  checks = [
    {"criticality": "error", "check": {"function": "is_not_null",
       "arguments": {"column": "order_id"}}},
    {"criticality": "warn",  "check": {"function": "is_in_list",
       "arguments": {"column": "status", "allowed": ["new", "paid", "shipped"]}}},
    {"criticality": "error", "check": {"function": "is_unique",
       "arguments": {"columns": ["order_id"]}}},
 ]
  dq = DQEngine(WorkspaceClient())
  valid_df, quarantine_df = dq.apply_checks_by_metadata_and_split(df, checks)
  return valid_df  # clean rows become the published table

If your transformation is already a Python model (e.g. complex PySpark logic), you don't need the extra _dq.py layer at all, just embed DQX directly inside that model before the return:.

error rows → quarantine only, never written to the clean table; warn rows → stay in clean output with _warnings metadata, not quarantined. Each quarantined row carries _errors/_warnings with the rule + a readable message.

Wire DQX in as a serverless dep in dbt_project.yml (+submission_method: serverless_cluster,
+environment_dependencies: [databricks-labs-dqx]), then just dbt run or use your preferred scheduling pattern. And I scheduled this on Databricks Lakeflow with a dbt task, as seen in the picture above.

u/zr-brickster — 4 days ago

▲ 9 r/dataengineering

ADF - How to manage trigger parameters?

I built a ADO cicd pipeline that validates and publishes from our develop branch, exports a build artefact can be deployed to preprod and prod with override parameters.

However the ADF has now hit the 256 ARM template parameter limit. I have streamlined the linked service so none are redundant, I have also optimised variables to be dynamic.

The issue is the number of trigger workflow parameters. Since they are all hardcoded string values.

Can anyone advise on methods to reduce trigger parameters? (Outside of the obvious having less triggers or creating a second ADF).

Currently looking into airflow to trigger pipelines and dataflows.

reddit.com

u/Otherwise_Western431 — 4 days ago

▲ 8 r/dataengineering

FREE review copy of my new book "Data Engineering for Beginners"

rajamanickam.com

u/qptbook — 3 days ago

▲ 62 r/dataengineering+1 crossposts

Is it rare for someone to have PowerBI and SQL experience?

Edit: I never said I was a unicorn—I specifically said I wasn't— that's what the staffing agency said. I also don't know what they're paying, I'm just helping with interviewing and thought what the staffing agency said didn't make sense. People are coming at me like I said that and like I'm doing the hiring.

Original post:We're hiring a developer and the company I'm with is using an offshore staffing agency.

We needed someone who has worked in PowerBI and can understand SQL. The staffing agency said that's a unicorn. They said you'll typically have someone who either specializes in data or specializes in PowerBI.

I don't consider myself a unicorn, and I have a lot to learn, but I can work in PowerBI and write SQL. I don't feel I'm that special, so I wanted to ask, is that actually rare?

I'm based in the US and the offshore team is based in India. Is that rare to find that skillset or is it more likely that it's rare at the company's price point? That could be it too. You do get what you pay for.

I'm curious what you all may have seen.

About me. I started as a data analyst and worked my way into BI Analytics, but I do end-to-end pipelines and visualizations. I thought more people work with PowerBI and SQL. I could be mistaken, maybe it is rare.

reddit.com

u/hijkblck93 — 6 days ago

▲ 10 r/dataengineering

Portable vs Fivetran, anyone make the swap?

Recently the sales team at Portable reached out on linkedin to pitch their product. I have never heard of them but have a strong disdain for Fivetran's pricing model. So has anyone else used them for the data ingestions? Recently we moved a bunch of postgres<>bigquery connectors out of Fivetran into Datastream. So now we mostly use Fivetran for all of our Ads connectors and some other odd one off things. Specifically has anyone used it to get data into BigQuery? I don't see too much documentation about BQ on their site.

reddit.com

u/limeslice2020 — 5 days ago

▲ 15 r/dataengineering

I've been working on a self hosted dagster/dbt/evidence setup. Looking for feedback and suggestions for improvements.

github.com

u/dashiznit101 — 4 days ago