Hi everyone,
I’ve lately been digging into fundamentals around ingestion and I was wondering if anybody can share some experience or resources to check.
When doing batch ingestion pipelines from relational databases, how do you approach schema enforcement and schema evolution?
For context, I am playing around in some kind of a hybrid-DBT style architecture where I have landing layer for snapshots/deltas/CDC. These serve as my DBT sources, which I transform through staging/intermediate/mart layers. I am using BQ for DWH. I use simple python on Airflow for orchestration and ingestion?
So some questions I was wondering about. Would you write to object storage or immediately BigQuery landing_ schema for the landing layer? Some of these tables are very small and rarely updated, so it seemed like unnecessary overhead? If I go with object storage would you go plain parquet first or do you go with something like Iceberg? In general, lets say I want to think about this from first principles, what would you say these principles are for ingestion and what should I be careful? Also what would you say is SOTA for the relatively small scale pipelines/problems if I don’t want to over complicate the thing, but still be robust and as fault tolerant as possible?
Which leads me to my main concerns: In either of these two setups, how do you evolve schema? Lets say I load parquet, do you go SELECT * or pick only necessary? Did you ever regret not picking something? If I go table immediately, do you apply additive changes to landing tables or do you do strict schema evolution? Did you create the landing table beforehand with strict schema and git versioned DDL or do you kind of let it evolve freely in this first layer? Do you do some kind of manifest, dynamically generated DDL or code by hand? How does it scale to dozens of tables?
Looking forward to hearing your opinions and experience.
Thank you!