
Databricks Professional Dumps
For anyone preparing for Databricks Professional (latest syllabus/post Nov 2025 update) — I’ve compiled topic-wise prep material with practice Q&A and structured notes.
Happy to share if it helps. DM me.

For anyone preparing for Databricks Professional (latest syllabus/post Nov 2025 update) — I’ve compiled topic-wise prep material with practice Q&A and structured notes.
Happy to share if it helps. DM me.
Hey everyone,
Looking for some architectural advice on directory and file organization for a large-scale project. We are migrating to Databricks’ new Lakeflow Pipelines (pyspark.pipelines / dp) using a fully config-driven Medallion architecture, and we're trying to prevent our repository from becoming unmaintainable.
for loop that dynamically generates the dp.table and dp.view structures. We are splitting the ingestion into separate pipelines by provider to avoid driver bottlenecks.Bronze is a clean 1:1 loop from raw. However, when we hit Silver and Gold:
Plaintext
├── config/
│ ├── provider_a.yaml # Metadata for 80 tables
│ └── provider_b.yaml
├── src/
│ ├── pipelines/
│ │ ├── ingest_provider_a.py # Generic loop for Bronze -> Silver
│ │ ├── ingest_provider_b.py
│ │ └── build_gold_ledger.py # Bespoke cross-provider joins
src/ folder look like? How do you organize the custom SQL/PySpark transformation snippets so they don't get buried in a monolithic script?Would love to see examples or hear lessons learned from anyone who has tackled this scale without losing their sanity. Thanks!
Hey, since there is something like: https://github.com/databricks-solutions/ai-dev-kit, how much do u work on writing skills and working on claude.md to better standardize how your team works? I am looking for some tips on this topic, how mush has this helped u?
We have many ML projects where we try to standardize how we code, how we change from dev to prod, how we iterate, how we deploy, so logically since everyone is using agents we should also have them standardized and working in the same manner as we.
AI Agents hallucinating? Unity Catalog acting like Unity Catalogue of Errors? Genie Spaces granting wishes to absolutely nobody?
Drop the most cursed recurring problem you face with building AI agents or ML or BI/Analytics - no matter how difficult, unhinged or borderline impossible the solution may be. Hit me with all u got. I am sitting this databricks hackathon this Friday as a self-reward and I want to try something different this time.
Nothing but the pursuits of overly engineered solutions for the most trivial problems >!because I can and I like abstractions!< - but hey its good to be alive
Hey folks 👋
I'm building an open-source dev tool that turns telemetry data into knowledge graphs that can be used as context in AI coding agents for debugging purposes or improving performance & costs.
Why? My intuition is three fold:
(1) coding agents are much more useful when they understand how a system actually behaves in production, not just what the repo looks like
(2) using raw telemetry data (for example traces) doesn't really work with coding agents at scale
(3) telemetry context graphs might be even cheaper and more efficient to query compared to using raw telemetry data
Before spending too much time on this & going down the rabbit hole, I'm trying to sanity-check my assumptions and assess if this is actually useful for people building/running AI systems in production. Curious to hear from software engineers that have tried something like this: what worked & what didn't, etc.
Happy to hear thoughts directly in the comments and if anyone's interested in helping out with feedback on the actual tool as I build it, please let me know and I can send more details in private - not my intention to spam anyone.
Appreciate it 🙇
Hey everyone, I'm fairly new to Spark and trying to understand how it actually executes jobs specifically the DAG visualization, stages, task metrics, and executor stats in the Spark UI.
The problem I'm running into: almost every video tutorial I find was recorded on an older version of Databricks, and the UI looks completely different from what I see today. The gap is big enough that I can't follow along at all.
A few specific issues I've hit:
- `spark.databricks.io.cache.enabled` throws a CONFIG_NOT_AVAILABLE error on newer runtimes
- `spark.catalog.clearCache()` throws a NOT_SUPPORTED error because I'm on Serverless compute (Community Edition)
- The Spark UI itself looks different from what tutorials show
I'm using Databricks Community Edition (free tier), which I've now learned only gives Serverless compute so some things just aren't available.
My questions:
Is there a good up-to-date resource (video, blog, or docs) for understanding the Spark UI on the current Databricks version?
For learning Spark internals (DAG, stages, task metrics), is it better to just use local Spark or Google Colab instead of Databricks free tier?
Any tips for following older Spark UI tutorials and mentally mapping them to the current UI?
Thanks in advance!
Hey there,
I am interested in real use cases, prototypes or future product enhancements related to GraphRag on top of databricks.
Looking forward to hearing from you all!
Hi everyone,
I’m currently working on ingesting historical data from an API into Databricks, and I’d like to get some opinions on the best approach.
The API data volume is quite inconsistent by date. Some days have no records at all, some days only have around 100 records, some have 50k records, and the highest I’ve seen so far is more than 2 million records in a single day.
My current approach is:
1 day = 1 ingestion window
Run ingestion for 1 month of historical data at a time
This works fine for most dates, but the issue happens when one particular day has more than 1 million records. The job fails with an OOM error.
One idea I’m considering is to first check the record count for each day. Then, if a day has more than 1 million records, I split that particular day into smaller hourly windows instead of ingesting the whole day at once.
For those who have handled similar API ingestion scenarios in Databricks, how do you usually deal with this kind of volume spike?
Would you recommend dynamic windowing like this, or is there a better pattern for handling unstable historical data volumes from APIs?
Also curious if there are any best practices around avoiding OOM in this kind of API-to-Delta ingestion pipeline.
Databricks launched AI Spend Controls in Unity AI Gateway, adding proactive budget alerts that let organizations set AI cost limits at the per-user, per-use-case, per-workspace, and per-account levels. The goal is to prevent runaway AI costs - like agents stuck in retry loops or accidental overnight experiments - before they show up on the bill. All spend is logged to Unity Catalog system tables for detailed analysis by user, model, provider, and team.
Hey everyone!
Databricks just launched the DAIS 2026 Community Virtual Challenge! With the DAIS just around the corner, we wanted to bring that high-energy summit excitement straight to you.
Whether you're attending the summit in person or cheering from home, this is the ultimate pre-summit warm-up to show off your data skills, join the summit buzz, and score some exclusive Databricks swag!
Here is a quick breakdown of how it works:
An elite panel of Databricks SMEs will judge submissions across 5 categories.
The top 5 winners get official Databricks Community Swag shipped directly to their door.
For full details, rules, and to RSVP, check out the official Databricks Community Event Page.
Good luck if you're entering! What kind of projects are you all thinking of building?
....subject to the review process, so I can't make it all comic sans. But the smaller the change, the likelier it is to be accepted.
LMK your doc gripes
I've been highly critical of genie or databrick assistant for quite some time now. I even have a post here criticizing it, but kudos to the DTB team! It is sooooo much better now that i dont even bother connecting claude code via dev ai kit anymore.
I observed that the asset bundle deployment that we are performing through the GIT hub actions sometimes fails with resources not found or tables not found. But it deploys the other artifacts that did not have any issues though the CICD pipeline failed. How to stop if the failure happens none of the artifacts should deploy like declarative pipelines, jobs and notebooks etc in the workspace.
Ran into a new one today and I'm curious to know if anyone else has hit this.
I've had a pipeline running for a bit. Works like a champ. Uses materialized views with cluster by auto set to true.
Today, I started getting a pipeline validation error.
> Cannot resolve the clustering column enzyme__row__id.__enzyme__row__id__table__2 in root
Genie is telling me this is probably a bug with cluster by auto.
> A Lakeflow SDP pipeline using cluster_by_auto=True and pipeline_internal.enzymeMode: Advanced fails on update because Enzyme selected its own internal struct fields as clustering columns for output materialized views.
> These fields exist in the internal _materialization_mat_487f530d..._qm_gen_info_1 table as STRUCT<__enzyme__row__id__table__1:bigint, __enzyme__row__id__table__2:bigint> at ordinal_position 0, but do NOT exist in the user-facing output MV schema.
I thought changing the pipeline and redeploying as FULL reload would potentially resolve it, but no. The cluster by auto is still on the object, so it fails to validate, so I can't publish the update. I cannot seem to manually alter the clustering because the object is managed (it's refusing to let me update it), which leaves me with the last option - dropping the impacted objects.
Anyone else run into this? It's not consistent on all of my pipelines and it's only really just started happening. My guess is the optimization processes finally rolled around to selecting these internal columns.
👋 I’m a PM for tables and storage at DB. Interested in whatever you all have to share on the table maintenance / optimization experience.
We put Snowflake and Databricks head-to-head across 5 scenarios.
𝗦𝗻𝗼𝘄𝗳𝗹𝗮𝗸𝗲 𝘄𝗼𝗻 𝟰 𝗼𝘂𝘁 𝗼𝗳 𝟱 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀:
- Sequential queries: 34% faster, 17% cheaper (at $2/credit)
- Concurrent queries: 38% faster, 39% cheaper
- Cold start: 54% faster (Databricks startup time: ~7 sec. Snowflake: sub-second. Every. Single. Time.)
- DML (delete + insert): 59% faster, 32% cheaper thanks to elite query pruning that treated 6B rows like 6M
𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 𝗰𝗹𝗮𝗶𝗺𝗲𝗱 𝘁𝗵𝗲 𝗼𝗻𝗲 𝘁𝗵𝗮𝘁 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 𝗳𝗼𝗿 𝗱𝗮𝘁𝗮 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝘀:
CTAS (Create Table As Select): 58% faster, 71% cheaper when writing billions of rows across multiple table shapes
If your workload is heavy on dbt materializations, large table builds, or data pipeline writes, Databricks has a real edge here.
If your workload is analysts running queries, dashboards, and incremental refreshes, Snowflake Standard looks compelling, even vs. Databricks Enterprise pricing.
- Read the full methodology and results: https://select.dev/posts/snowflake-vs-databricks-showdown
- Take a look at the repo: https://github.com/get-select/snowflake-databricks-benchmark
Hi everyone,
I’m deciding between Databricks Apps and Microsoft Power Apps for a simple but important need and I’d love your experience/advice.
Context:
• I have models and data processing in Databricks (already built a demo + deployment using Databricks).
• Need a small UI so business users can edit/add rules (stored in a table) that our models use.
• I care a lot about cost. I don’t have an easy way right now to start/stop Databricks compute to save money.
• I already run regression tests and understand Databricks workflows, but I don’t fully know the real benefits of Databricks Apps vs Power Apps for this use case.
Questions:
1. For a rules-editing UI (CRUD for a rules table) what worked better for you: Databricks Apps or Power Apps?
2. Cost-wise, which is cheaper to build/run for low-traffic business users?
3. Any recommended patterns to connect a low-code UI (Power Apps) to Databricks compute securely and cheaply?
4. If you chose Databricks Apps, how do you reduce runtime cost — do you use job clusters, serverless endpoints, or something else?
I have this decision tree (WIP).
Hey all, sharing some of our recent updates to Lakeflow Designer:
Link to release note -- https://docs.databricks.com/aws/en/release-notes/product/2026/may#lakeflow-designer-updates
The feature I’m personally most excited about is user-defined operators. The basic idea is that you define the fields you want to expose in the operator UI with a YAML file, then write the Python code that runs behind it. That makes it possible to extend Designer with your own custom logic, while still giving users a reusable visual operator they can drag onto the canvas.
A few examples could be:
The hope is that this opens up a lot of use cases that were previously either hard to do in Designer or required dropping into a bunch of custom Python. Full docs available here -- https://docs.databricks.com/aws/en/designer/user-operators
Would love to hear any thoughts or feedback if you try these out!