u/Constant_Plane4937 — reddlx

I recently built this as an experiment project to go deep on Microsoft Fabric end-to-end — from raw data to a Gold star schema, with automated orchestration and CI/CD.

The project is based on a fictional omnichannel apparel retailer called CSNP & Co.

The synthetic dataset includes:

3 years of transactional history

142 stores

320K customers

3.2K SKUs

Store + online-style retail flows

The goal was not just to create a demo dataset, but to make it realistic enough to surface actual engineering problems you would face in a real Fabric project.

Architecture

The pipeline follows a medallion pattern:

Bronze

Raw Parquet files

Silver

Delta tables with SCD1 / SCD2 merge logic

Gold

Reporting-ready star schema

The implementation includes:

20 PySpark notebooks

One Fabric Data Pipeline for orchestration

Dependency-aware scheduling

Incremental processing

A custom Python wheel called csnp_helpers bundled into a Fabric Environment

Shared merge logic across notebooks

fabric-cicd for repeatable deployment across DEV / TEST / PROD workspaces

Daily scheduled run at 06:00 UTC

A few lessons I learned the hard way:

Notebook metadata matters more than I expected.

If lakehouse and environment dependencies are not declared properly, pipeline runs can fail in confusing ways.

Parallelism needs to be handled carefully on F-capacity.

Spinning up 14 parallel Spark sessions hit HTTP 430 errors quickly. Sequential chaining was more reliable.

Notebook IDs are workspace-scoped.

This becomes important when planning deployments across DEV / TEST / PROD.

Reusable helper logic is worth it.

Moving common merge logic into a Python wheel made the notebooks much cleaner and easier to maintain.

Code is here:

https://github.com/bcsnpc/csnp-retail-platform

I’m sharing this mainly to get feedback from people working with Fabric / Spark / lakehouse architectures.

Would you structure anything differently, especially around notebook orchestration, CI/CD, or environment promotion?