u/Money-Meaning1561

Repository organization in Databricks (Lakeflow/DLT)

Hey everyone,

Looking for some architectural advice on directory and file organization for a large-scale project. We are migrating to Databricks’ new Lakeflow Pipelines (pyspark.pipelines / dp) using a fully config-driven Medallion architecture, and we're trying to prevent our repository from becoming unmaintainable.

The Scale & Setup

  • Data Size: ~300 raw tables across 3 distinct financial data providers.
  • Architecture: Medallion (Bronze/Silver/Gold) deployed via Databricks Asset Bundles (DABs) into Unity Catalog.
  • The Pattern: We are using a config-driven approach (YAML files defining schemas/DQ rules) passed into a Python for loop that dynamically generates the dp.table and dp.view structures. We are splitting the ingestion into separate pipelines by provider to avoid driver bottlenecks.

The Complexity (Where it gets messy)

Bronze is a clean 1:1 loop from raw. However, when we hit Silver and Gold:

  • Many-to-Many Mappings: A single raw/bronze table often feeds into multiple business entities (e.g., one raw table splits into parts of our business objects that are created in gold).
  • Cross-Provider Joins: Gold entities require joining across the different providers to build the final target application objects.

Our Current Proposed Structure

Plaintext

├── config/
│   ├── provider_a.yaml      # Metadata for 80 tables
│   └── provider_b.yaml
├── src/
│   ├── pipelines/
│   │   ├── ingest_provider_a.py  # Generic loop for Bronze -> Silver
│   │   ├── ingest_provider_b.py
│   │   └── build_gold_ledger.py  # Bespoke cross-provider joins

The Questions for the Community

  1. File Granularity: For the Silver layer where a single config-driven table needs to fork into multiple complex business entities, do you isolate those transformations into bespoke Python files per entity, or handle the routing directly inside the config loop logic?
  2. Repo Organization: If you've managed 300+ tables in a declarative framework, what does your actual src/ folder look like? How do you organize the custom SQL/PySpark transformation snippets so they don't get buried in a monolithic script?
  3. Pipeline Boundaries: Databricks recommends splitting pipelines by domain to avoid high initialization times. How do you split your Python files to align cleanly with separate pipeline definitions in your DAB bundle config?

Would love to see examples or hear lessons learned from anyone who has tackled this scale without losing their sanity. Thanks!

reddit.com
u/Money-Meaning1561 — 18 hours ago

Rack on road bike

Hey everone, I have a road bike (Cube Attain SL 2022) that I mainly used for competition. And this summer I want to join some friends for a bike trip for a few days. I wanted to know if it was possible to install any sort of rack on the back of the bike that is easily removable. I don't see any eyelets on my back frame so I think any traditional rack is not an option; But would like to have your thoughts and experiences on this matter.
Thank you in advance for your help.

reddit.com
u/Money-Meaning1561 — 1 day ago