r/databricks

Databricks Professional Dumps
▲ 33 r/databricks+2 crossposts

Databricks Professional Dumps

For anyone preparing for Databricks Professional (latest syllabus/post Nov 2025 update) — I’ve compiled topic-wise prep material with practice Q&A and structured notes.

Happy to share if it helps. DM me.

u/Lonely-Vacation-6447 — 17 hours ago

Repository organization in Databricks (Lakeflow/DLT)

Hey everyone,

Looking for some architectural advice on directory and file organization for a large-scale project. We are migrating to Databricks’ new Lakeflow Pipelines (pyspark.pipelines / dp) using a fully config-driven Medallion architecture, and we're trying to prevent our repository from becoming unmaintainable.

The Scale & Setup

  • Data Size: ~300 raw tables across 3 distinct financial data providers.
  • Architecture: Medallion (Bronze/Silver/Gold) deployed via Databricks Asset Bundles (DABs) into Unity Catalog.
  • The Pattern: We are using a config-driven approach (YAML files defining schemas/DQ rules) passed into a Python for loop that dynamically generates the dp.table and dp.view structures. We are splitting the ingestion into separate pipelines by provider to avoid driver bottlenecks.

The Complexity (Where it gets messy)

Bronze is a clean 1:1 loop from raw. However, when we hit Silver and Gold:

  • Many-to-Many Mappings: A single raw/bronze table often feeds into multiple business entities (e.g., one raw table splits into parts of our business objects that are created in gold).
  • Cross-Provider Joins: Gold entities require joining across the different providers to build the final target application objects.

Our Current Proposed Structure

Plaintext

├── config/
│   ├── provider_a.yaml      # Metadata for 80 tables
│   └── provider_b.yaml
├── src/
│   ├── pipelines/
│   │   ├── ingest_provider_a.py  # Generic loop for Bronze -> Silver
│   │   ├── ingest_provider_b.py
│   │   └── build_gold_ledger.py  # Bespoke cross-provider joins

The Questions for the Community

  1. File Granularity: For the Silver layer where a single config-driven table needs to fork into multiple complex business entities, do you isolate those transformations into bespoke Python files per entity, or handle the routing directly inside the config loop logic?
  2. Repo Organization: If you've managed 300+ tables in a declarative framework, what does your actual src/ folder look like? How do you organize the custom SQL/PySpark transformation snippets so they don't get buried in a monolithic script?
  3. Pipeline Boundaries: Databricks recommends splitting pipelines by domain to avoid high initialization times. How do you split your Python files to align cleanly with separate pipeline definitions in your DAB bundle config?

Would love to see examples or hear lessons learned from anyone who has tackled this scale without losing their sanity. Thanks!

reddit.com
u/Money-Meaning1561 — 16 hours ago

claude.md and skills

Hey, since there is something like: https://github.com/databricks-solutions/ai-dev-kit, how much do u work on writing skills and working on claude.md to better standardize how your team works? I am looking for some tips on this topic, how mush has this helped u?

We have many ML projects where we try to standardize how we code, how we change from dev to prod, how we iterate, how we deploy, so logically since everyone is using agents we should also have them standardized and working in the same manner as we.

u/ptab0211 — 16 hours ago

Alrighty data pookies, what Databricks issue keeps violating your peace?

AI Agents hallucinating? Unity Catalog acting like Unity Catalogue of Errors? Genie Spaces granting wishes to absolutely nobody?

Drop the most cursed recurring problem you face with building AI agents or ML or BI/Analytics - no matter how difficult, unhinged or borderline impossible the solution may be. Hit me with all u got. I am sitting this databricks hackathon this Friday as a self-reward and I want to try something different this time.

Nothing but the pursuits of overly engineered solutions for the most trivial problems >!because I can and I like abstractions!< - but hey its good to be alive

reddit.com
u/Tiddyfucklasagna27 — 1 day ago
▲ 11 r/databricks+5 crossposts

Anyone using telemetry data in tandem with AI coding agents?

Hey folks 👋

I'm building an open-source dev tool that turns telemetry data into knowledge graphs that can be used as context in AI coding agents for debugging purposes or improving performance & costs.

Why? My intuition is three fold:

(1) coding agents are much more useful when they understand how a system actually behaves in production, not just what the repo looks like

(2) using raw telemetry data (for example traces) doesn't really work with coding agents at scale

(3) telemetry context graphs might be even cheaper and more efficient to query compared to using raw telemetry data

Before spending too much time on this & going down the rabbit hole, I'm trying to sanity-check my assumptions and assess if this is actually useful for people building/running AI systems in production. Curious to hear from software engineers that have tried something like this: what worked & what didn't, etc.

Happy to hear thoughts directly in the comments and if anyone's interested in helping out with feedback on the actual tool as I build it, please let me know and I can send more details in private - not my intention to spam anyone.

Appreciate it 🙇

reddit.com
u/n4r735 — 1 day ago
▲ 16 r/databricks+1 crossposts

Struggling to learn Spark UI on Databricks, all tutorials are outdated. Any good resources?

Hey everyone, I'm fairly new to Spark and trying to understand how it actually executes jobs specifically the DAG visualization, stages, task metrics, and executor stats in the Spark UI.

The problem I'm running into: almost every video tutorial I find was recorded on an older version of Databricks, and the UI looks completely different from what I see today. The gap is big enough that I can't follow along at all.

A few specific issues I've hit:

- `spark.databricks.io.cache.enabled` throws a CONFIG_NOT_AVAILABLE error on newer runtimes

- `spark.catalog.clearCache()` throws a NOT_SUPPORTED error because I'm on Serverless compute (Community Edition)

- The Spark UI itself looks different from what tutorials show

I'm using Databricks Community Edition (free tier), which I've now learned only gives Serverless compute so some things just aren't available.

My questions:

  1. Is there a good up-to-date resource (video, blog, or docs) for understanding the Spark UI on the current Databricks version?

  2. For learning Spark internals (DAG, stages, task metrics), is it better to just use local Spark or Google Colab instead of Databricks free tier?

  3. Any tips for following older Spark UI tutorials and mentally mapping them to the current UI?

Thanks in advance!

reddit.com
u/FlatTackle918 — 1 day ago

GraphRag on top of databricks

Hey there,

I am interested in real use cases, prototypes or future product enhancements related to GraphRag on top of databricks.

Looking forward to hearing from you all!

reddit.com
u/ubiquae — 1 day ago

How do you handle API ingestion when historical data volume varies a lot and causes OOM?

Hi everyone,

I’m currently working on ingesting historical data from an API into Databricks, and I’d like to get some opinions on the best approach.

The API data volume is quite inconsistent by date. Some days have no records at all, some days only have around 100 records, some have 50k records, and the highest I’ve seen so far is more than 2 million records in a single day.

My current approach is:
1 day = 1 ingestion window
Run ingestion for 1 month of historical data at a time

This works fine for most dates, but the issue happens when one particular day has more than 1 million records. The job fails with an OOM error.

One idea I’m considering is to first check the record count for each day. Then, if a day has more than 1 million records, I split that particular day into smaller hourly windows instead of ingesting the whole day at once.

For those who have handled similar API ingestion scenarios in Databricks, how do you usually deal with this kind of volume spike?

Would you recommend dynamic windowing like this, or is there a better pattern for handling unstable historical data volumes from APIs?

Also curious if there are any best practices around avoiding OOM in this kind of API-to-Delta ingestion pipeline.

reddit.com
u/xahyms10 — 1 day ago

Finally, AI Spend Controls is now available with Unity AI Gateway

Databricks launched AI Spend Controls in Unity AI Gateway, adding proactive budget alerts that let organizations set AI cost limits at the per-user, per-use-case, per-workspace, and per-account levels. The goal is to prevent runaway AI costs - like agents stuck in retry loops or accidental overnight experiments - before they show up on the bill. All spend is logged to Unity Catalog system tables for detailed analysis by user, model, provider, and team.

databricks.com
u/sai-nageshwaran — 1 day ago

Databricks DAIS 2026 Community Virtual Contest is Live! (Build on Free Edition &amp; Win Exclusive Swag 🚀)

Hey everyone!

Databricks just launched the DAIS 2026 Community Virtual Challenge! With the DAIS just around the corner, we wanted to bring that high-energy summit excitement straight to you. 

Whether you're attending the summit in person or cheering from home, this is the ultimate pre-summit warm-up to show off your data skills, join the summit buzz, and score some exclusive Databricks swag!

Here is a quick breakdown of how it works:

🛠️ What You Need to Do:

  1. Create: Build any project of your choice using Databricks Free Edition. It can be a business problem, a personal passion project, or a unique data pipeline/ML model.
  2. Record: Make a short 2–5 minute demo video walking through what you built and why it matters.
  3. Post: Write a quick summary in the Community Articles section sharing your learnings (no strict code snippet requirements here, just share the experience!).
  4. Submit: Fill out the official Open Submission Form with your video and article links.

📊 How It's Judged (50 Points Total):

An elite panel of Databricks SMEs will judge submissions across 5 categories.

📅 Important Dates:

  • Opens: May 15, 2026
  • Closes: May 31, 2026

The top 5 winners get official Databricks Community Swag shipped directly to their door.

For full details, rules, and to RSVP, check out the official Databricks Community Event Page.

Good luck if you're entering! What kind of projects are you all thinking of building?

reddit.com
u/Subject_Ant1789 — 1 day ago

Databricks let me into their docs repo 😈 tell me what you want changed

....subject to the review process, so I can't make it all comic sans. But the smaller the change, the likelier it is to be accepted.

LMK your doc gripes

reddit.com
u/datasmithing_holly — 1 day ago

Databricks genie appreciation

I've been highly critical of genie or databrick assistant for quite some time now. I even have a post here criticizing it, but kudos to the DTB team! It is sooooo much better now that i dont even bother connecting claude code via dev ai kit anymore.

reddit.com
u/Miraclefanboy2 — 2 days ago

Though Asset Bundle deployment failed it deploys the artifacts

I observed that the asset bundle deployment that we are performing through the GIT hub actions sometimes fails with resources not found or tables not found. But it deploys the other artifacts that did not have any issues though the CICD pipeline failed. How to stop if the failure happens none of the artifacts should deploy like declarative pipelines, jobs and notebooks etc in the workspace.

reddit.com
u/Ok-Tomorrow1482 — 1 day ago

Possible bug with MV and cluster by auto in pipeline?

Ran into a new one today and I'm curious to know if anyone else has hit this.

I've had a pipeline running for a bit. Works like a champ. Uses materialized views with cluster by auto set to true.

Today, I started getting a pipeline validation error.

> Cannot resolve the clustering column enzyme__row__id.__enzyme__row__id__table__2 in root

Genie is telling me this is probably a bug with cluster by auto.

> A Lakeflow SDP pipeline using cluster_by_auto=True and pipeline_internal.enzymeMode: Advanced fails on update because Enzyme selected its own internal struct fields as clustering columns for output materialized views.

> These fields exist in the internal _materialization_mat_487f530d..._qm_gen_info_1 table as STRUCT<__enzyme__row__id__table__1:bigint, __enzyme__row__id__table__2:bigint> at ordinal_position 0, but do NOT exist in the user-facing output MV schema.

I thought changing the pipeline and redeploying as FULL reload would potentially resolve it, but no. The cluster by auto is still on the object, so it fails to validate, so I can't publish the update. I cannot seem to manually alter the clustering because the object is managed (it's refusing to let me update it), which leaves me with the last option - dropping the impacted objects.

Anyone else run into this? It's not consistent on all of my pipelines and it's only really just started happening. My guess is the optimization processes finally rolled around to selecting these internal columns.

reddit.com
u/lofat — 2 days ago

What’s the most frustrating part of the table experience today?

👋 I’m a PM for tables and storage at DB. Interested in whatever you all have to share on the table maintenance / optimization experience.

reddit.com
u/Fun-Reference7942 — 2 days ago
▲ 1 r/databricks+1 crossposts

[BLOG + video] Snowflake and Databricks benchmarks

We put Snowflake and Databricks head-to-head across 5 scenarios.

𝗦𝗻𝗼𝘄𝗳𝗹𝗮𝗸𝗲 𝘄𝗼𝗻 𝟰 𝗼𝘂𝘁 𝗼𝗳 𝟱 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀:

- Sequential queries: 34% faster, 17% cheaper (at $2/credit)

- Concurrent queries: 38% faster, 39% cheaper

- Cold start: 54% faster (Databricks startup time: ~7 sec. Snowflake: sub-second. Every. Single. Time.)

- DML (delete + insert): 59% faster, 32% cheaper thanks to elite query pruning that treated 6B rows like 6M

𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 𝗰𝗹𝗮𝗶𝗺𝗲𝗱 𝘁𝗵𝗲 𝗼𝗻𝗲 𝘁𝗵𝗮𝘁 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 𝗳𝗼𝗿 𝗱𝗮𝘁𝗮 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝘀:

CTAS (Create Table As Select): 58% faster, 71% cheaper when writing billions of rows across multiple table shapes

If your workload is heavy on dbt materializations, large table builds, or data pipeline writes, Databricks has a real edge here.

If your workload is analysts running queries, dashboards, and incremental refreshes, Snowflake Standard looks compelling, even vs. Databricks Enterprise pricing.

- Read the full methodology and results: https://select.dev/posts/snowflake-vs-databricks-showdown

- Take a look at the repo: https://github.com/get-select/snowflake-databricks-benchmark

u/SELECT_dev — 3 days ago

Databricks Apps vs Power Apps -/ which to use when you need a UI for business rules? (cost-sensitive)

Hi everyone,
I’m deciding between Databricks Apps and Microsoft Power Apps for a simple but important need and I’d love your experience/advice.

Context:
• I have models and data processing in Databricks (already built a demo + deployment using Databricks).
• Need a small UI so business users can edit/add rules (stored in a table) that our models use.
• I care a lot about cost. I don’t have an easy way right now to start/stop Databricks compute to save money.
• I already run regression tests and understand Databricks workflows, but I don’t fully know the real benefits of Databricks Apps vs Power Apps for this use case.

Questions:
1. For a rules-editing UI (CRUD for a rules table) what worked better for you: Databricks Apps or Power Apps?
2. Cost-wise, which is cheaper to build/run for low-traffic business users?
3. Any recommended patterns to connect a low-code UI (Power Apps) to Databricks compute securely and cheaply?
4. If you chose Databricks Apps, how do you reduce runtime cost — do you use job clusters, serverless endpoints, or something else?

I have this decision tree (WIP).

u/Gullible_Head_9464 — 2 days ago

Lakeflow Designer Updates (Default on, Custom Operators, AI Operator Search, etc.)

Hey all, sharing some of our recent updates to Lakeflow Designer:

  • Default enablement: Lakeflow Designer is now enabled by default for all free edition, premium, and enterprise tier workspaces.
  • AI semantic operator search: The operator panel now suggests operators based on intent. For example, typing “average by month” surfaces the Aggregate operator
  • User-defined operators: You can now create user-defined operators that appear alongside built-in operators. These are custom operators that can do anything you can write in Python
  • Configurable sample size: You can now explicitly set the number of sample rows run with each operator.
  • Git, import, clone, and export: You can now export, clone, and import visual data prep files using File > Export or File > Clone. Files can also be stored and managed in Git folders.

Link to release note -- https://docs.databricks.com/aws/en/release-notes/product/2026/may#lakeflow-designer-updates

The feature I’m personally most excited about is user-defined operators. The basic idea is that you define the fields you want to expose in the operator UI with a YAML file, then write the Python code that runs behind it. That makes it possible to extend Designer with your own custom logic, while still giving users a reusable visual operator they can drag onto the canvas.

A few examples could be:

  • An operator that writes an Excel file directly to SharePoint
  • An operator that sends an email based on the output of a workflow
  • An operator that runs your team’s custom forecasting logic
  • An operator that generates a formatted Excel or PDF report from a table
  • An operator that calls an internal API or service

The hope is that this opens up a lot of use cases that were previously either hard to do in Designer or required dropping into a bunch of custom Python. Full docs available here -- https://docs.databricks.com/aws/en/designer/user-operators

Would love to hear any thoughts or feedback if you try these out!

u/curiousbrickster — 2 days ago