
I recently built this as an experiment project to go deep on Microsoft Fabric end-to-end — from raw data to a Gold star schema, with automated orchestration and CI/CD.
The project is based on a fictional omnichannel apparel retailer called CSNP & Co.
The synthetic dataset includes:
3 years of transactional history
142 stores
320K customers
3.2K SKUs
Store + online-style retail flows
The goal was not just to create a demo dataset, but to make it realistic enough to surface actual engineering problems you would face in a real Fabric project.
Architecture
The pipeline follows a medallion pattern:
Bronze
Raw Parquet files
Silver
Delta tables with SCD1 / SCD2 merge logic
Gold
Reporting-ready star schema
The implementation includes:
20 PySpark notebooks
One Fabric Data Pipeline for orchestration
Dependency-aware scheduling
Incremental processing
A custom Python wheel called csnp_helpers bundled into a Fabric Environment
Shared merge logic across notebooks
fabric-cicd for repeatable deployment across DEV / TEST / PROD workspaces
Daily scheduled run at 06:00 UTC
A few lessons I learned the hard way:
Notebook metadata matters more than I expected.
If lakehouse and environment dependencies are not declared properly, pipeline runs can fail in confusing ways.
Parallelism needs to be handled carefully on F-capacity.
Spinning up 14 parallel Spark sessions hit HTTP 430 errors quickly. Sequential chaining was more reliable.
Notebook IDs are workspace-scoped.
This becomes important when planning deployments across DEV / TEST / PROD.
Reusable helper logic is worth it.
Moving common merge logic into a Python wheel made the notebooks much cleaner and easier to maintain.
Code is here:
https://github.com/bcsnpc/csnp-retail-platform
I’m sharing this mainly to get feedback from people working with Fabric / Spark / lakehouse architectures.
Would you structure anything differently, especially around notebook orchestration, CI/CD, or environment promotion?