Repository organization in Databricks (Lakeflow/DLT)
Hey everyone,
Looking for some architectural advice on directory and file organization for a large-scale project. We are migrating to Databricks’ new Lakeflow Pipelines (pyspark.pipelines / dp) using a fully config-driven Medallion architecture, and we're trying to prevent our repository from becoming unmaintainable.
The Scale & Setup
- Data Size: ~300 raw tables across 3 distinct financial data providers.
- Architecture: Medallion (Bronze/Silver/Gold) deployed via Databricks Asset Bundles (DABs) into Unity Catalog.
- The Pattern: We are using a config-driven approach (YAML files defining schemas/DQ rules) passed into a Python
forloop that dynamically generates thedp.tableanddp.viewstructures. We are splitting the ingestion into separate pipelines by provider to avoid driver bottlenecks.
The Complexity (Where it gets messy)
Bronze is a clean 1:1 loop from raw. However, when we hit Silver and Gold:
- Many-to-Many Mappings: A single raw/bronze table often feeds into multiple business entities (e.g., one raw table splits into parts of our business objects that are created in gold).
- Cross-Provider Joins: Gold entities require joining across the different providers to build the final target application objects.
Our Current Proposed Structure
Plaintext
├── config/
│ ├── provider_a.yaml # Metadata for 80 tables
│ └── provider_b.yaml
├── src/
│ ├── pipelines/
│ │ ├── ingest_provider_a.py # Generic loop for Bronze -> Silver
│ │ ├── ingest_provider_b.py
│ │ └── build_gold_ledger.py # Bespoke cross-provider joins
The Questions for the Community
- File Granularity: For the Silver layer where a single config-driven table needs to fork into multiple complex business entities, do you isolate those transformations into bespoke Python files per entity, or handle the routing directly inside the config loop logic?
- Repo Organization: If you've managed 300+ tables in a declarative framework, what does your actual
src/folder look like? How do you organize the custom SQL/PySpark transformation snippets so they don't get buried in a monolithic script? - Pipeline Boundaries: Databricks recommends splitting pipelines by domain to avoid high initialization times. How do you split your Python files to align cleanly with separate pipeline definitions in your DAB bundle config?
Would love to see examples or hear lessons learned from anyone who has tackled this scale without losing their sanity. Thanks!