So let’s start by saying I am not a data engineer, I work mostly on infra and DevOps as well as software engineering.
I built my company’s entire data stack, migrated some existing elements and kept some (BI tool and data warehouse).
We basically CDC a bunch of Aurora MySQL tables into Redshift via DMS. Those go into a staging schema, and dbt CI/CD pipelines on GitHub help engineering and analysts model stuff into separate schemas. I even went an extra mile and set up CI that creates temp schemas while using state stored in S3 to speed things up
I am not sure which improvements I can introduce at this point. Things work well, but I always bump into these blockers:
Analysts not good enough in SQL find it hard to push changes. Yes, they can instantly see their staged changes into Metabase, but without visual feedback it’s hard for them to write dbt models
I feel the whole stack uses outdated tech. I never used Spark, Delta, Hive… Not sure if those are relevant here
CDC in DMS is buggy and often requires manual intervention when errors do not auto-recover, mainly when certain DDL changes are made in the transactional DB
I slightly used Snowflake in the past for some POC stuff, how does cost compare to Redshift, which i feel is super expensive. Its most likely a superior product given how bad Redshift is around documentation and weird behavior overall, my worry is compliance and security since we run everything in a private network and exporting data outside aws would usually require a BAA