How should I update tables in Databricks?
I’m very unfamiliar with data engineering (I’m a junior data analyst), so any feedback would be appreciated. I have a set up in Databricks where I use python scripts to ingest multiple table data from SAP and put them in the bronze layer. These data could be changed, added, or deleted, and we always want the latest versions of the tables.
We’ve had some iterations for updating our silver tables from bronze. At first we just called CREATE OR REPLACE TABLE, so it would overwrite all the data with fresh data every time we ran.
Then, we used MERGE INTO to make it more efficient for incremental changes.
Then, we used row-hash comparison in Python to update all the tables.
The tables don’t have many rows yet, with the largest having almost 1M rows. But we are constantly ingesting more tables as the project size grows.
Now looking back, maybe all the iterations was a waste of time since the tables aren’t big enough. We wanted to minimize sql run time to minimize cost.
Those who are seasoned experts, what do you think?