r/databricks

▲ 35 r/databricks+1 crossposts

I watched 4 hours of Databricks Data + AI Summit 2026 so you don't have to.

My first major project as a Senior Data Engineer, was migrating a decade-old time-series database for a semiconductor company to the cloud. The constraint: sub-second latency on customer queries. Equipment monitoring and predictive maintenance don't work with slow data.

We had Delta Lake for storage, but it couldn't guarantee the query performance we needed.
At the time, Databricks serverless warehouse did not exist.
So we built an additional layer on top: Azure Data Explorer (ADX). The data pipeline became: ingest source data, move to Delta Lake, replicate to ADX, serve queries from ADX.

It worked. Customers got their sub-second latency. But we'd introduced yet another system to maintain, another cost line, another place for things to fail. It was the price of solving the problem at that time.

This past month at Data + AI Summit 2026, Databricks announced Reyden.

A new query engine. Millisecond performance. Massive concurrency. Running directly on your lakehouse. No separate system. No copy. If production matches the demo, a lot of horizontal architectures will collapse into one component. One lake. One source of truth.

That's why I'm watching this closely. They looked at a niche problem I lived through and built a real solution.

Here are the 3 things from the summit that actually matter for data engineers:

Reyden: Millisecond queries on your lakehouse (no more separate real-time database)
Genie Zero Ops: Automated pipeline repair that tests fixes before you see them
Genie Ontology: AI that understands your business through a permission-aware knowledge graph

Did you watch the recent event? What do you think is the next big feature of Databricks to look out for.

reddit.com

u/RevolutionShoddy6522 — 15 hours ago

▲ 2 r/databricks

Getting 'Unauthorized Access' when Serverless compute is trying to read s3 bucket data

Getting 'Unauthorized Access' when Serverless compute is trying to read s3 bucket data, same works fine with normal compute storage and external location, credentials, IAM policy and serverless egress rules all seems to be correct because normal compute is working. I am on a premium plan and have free credits for now, do we need enterprise plan for this? or is it something else?

This is the error I am getting when trying to list the s3 bucket items in serverless compute:

ExecutionError: [UNAUTHORIZED_ACCESS] Unauthorized access: s3://&lt;raw-data-bucket&gt;/&lt;partition-path&gt;: getFileStatus on s3://&lt;raw-data-bucket&gt;/&lt;partition-path&gt;: 
shaded.databricks.awssdk.com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; 
request: HEAD https://&lt;raw-data-bucket&gt;.s3-us-west-2.amazonaws.com/&lt;partition-path&gt; {} 
Hadoop 3.4.2, aws-sdk-java/1.12.681 Linux/6.1.174-217.345.amzn2023.x86_64 OpenJDK_64-Bit_Server_VM/17.0.18+8-LTS java/17.0.18 scala/2.13.16 kotlin/1.9.10 vendor/Azul_Systems,_Inc. cfg/retry-mode/legacy 
shaded.databricks.awssdk.com.amazonaws.services.s3.model.GetObjectMetadataRequest; 
Request ID: null, Extended Request ID: null, Cloud Provider: AWS, Instance ID: unknown 
credentials-provider: shaded.databricks.awssdk.com.amazonaws.auth.BasicSessionCredentials 
credential-header: AWS4-HMAC-SHA256 Credential=&lt;redacted-access-key-id&gt;/&lt;date&gt;/us-west-2/s3/aws4_request 
signature-present: true 
(Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: null; S3 Extended Request ID: null; Proxy: &lt;redacted&gt;), 
S3 Extended Request ID: null: 403 Forbidden

SQLSTATE: 42501

Caused by: com.databricks.sql.io.CloudAccessDeniedException: getFileStatus / listStatus 403 on same path as above.

[Trace ID: &lt;redacted&gt;]

reddit.com

u/Leather-Gene4160 — 14 hours ago

▲ 3 r/databricks

Big Data project ideas? and tutorials

Hello, I am trying to learn more about Big Data,, I have only learn on dataframes of thousands, I know a person in the industry and he told me he works with terabytes of data. I am aware there are big datasets online, I believe in kaggle. But since I am learning I was hoping of a more hand holding project to learn more and then apply sepparately to something on my own.

I look for the question, but still lost post 1 , post 2 , post 3 etcetc

reddit.com

u/Agitated_Daikon001 — 11 hours ago

▲ 11 r/databricks

Comprehensive video on key announcements at the Databricks Summit 2026

Hi all, this is my final deliverable from the recently concluded Databricks Data + AI Summit 2026. In this, I have recorded all the key highlights and my learnings from the event. It is based on the blog which I posted a few days ago. I hope you will find this video to be informative. Key Highlights from DAIS' 26 podcast

u/Safe-Dirt-8209 — 19 hours ago

▲ 5 r/databricks+2 crossposts

Graphical Version: Rethinking Database Storage: From Monolith to Lakebase and LTAP by Reynold Xin

Using NotebookLM, I turned Reynold Xin's blog into a nice deck. I hope this graphical version is more consumable to some of the folks who prefer to read infographics like myself. Enjoy!

https://medium.com/@jasonyip_77999/rethinking-database-storage-from-monolith-to-lakebase-and-ltap-by-reynold-xin-graphical-version-0d362e382142

Original blog:

https://www.databricks.com/blog/lakebase-ltap-rethinking-database-storage

reddit.com

u/CelebrationSea9296 — 21 hours ago

▲ 2 r/databricks

How to chose compute

I'm new to Databricks building a data pipeline. For now I've always used serverless compute. Is that the best approach? One step in my pipeline involves parsing a 30GB XML file that takes 1,5 hours on serverless compute. Should I consider using different compute. If so, what to pick?

reddit.com

u/vroemboem — 1 day ago

▲ 46 r/databricks

I have just passed the Certified Data Engineering Associate

https://preview.redd.it/ydjoa9uke4bh1.png?width=1107&format=png&auto=webp&s=7e8a9e3f9f667efabcf249a78daf54ebeb05a92c

I didn't follow any corse, neither watched videos, i did the following:

No prior experience in Databricks neither in Data engineering's field. 0 experience.

- Take 5 example official questions from databricks
- Ask to an AI to read the most up to date documentation and put all the information in several .md
- Then ask again but to make questions based on the .md's

I made a total of 1000 questions divided into the 7 sections.

I studied in just a week, investing around 7h per day.

Whoever wants this dataset of questions, dm me.

reddit.com

u/DemonValac — 2 days ago

▲ 6 r/databricks+1 crossposts

Datasets versioning

Hey folks,

How are yall managing datasets versions? Does unity catalog have this feature or are you using a 3rd party tool? I am looking for something that keeps track of data changes. Last updated, what was updated etc

reddit.com

u/Severe-Committee87 — 1 day ago

▲ 12 r/databricks

Databricks Genie vs Claude Code vs OpenAI Codex

In terms of value, which of these coding plans offer the most bang for buck?

Currently Genie is free, I'm talking after July 6, when it becomes paid.

reddit.com

u/vroemboem — 3 days ago

▲ 5 r/databricks

Where should I setup guardrails for Genie Space?

I constantly run into an issue, where Genie Spaces analyzes a question, decided that it cannot be answered with current data model, but still eventually give a made up answer (like a closest sql query it can think of). I’ve written instruction that tells it to refuse to answer in such cases, but half the time it just ignores it and continue with the wrong data. Has anyone encountered the same problem? I would love to hear any solution to this.

reddit.com

u/imnessal — 3 days ago

▲ 10 r/databricks

Unity AI Gateway - expected public preview timeline?

Hi All,

Unity AI Gateway has many features I want to introduce to my company. The challenge is that it is still in "Beta" which means that even if I test it and it looks promising, introducing it even in dev will be an issue bc of pushback from Security teams.

Is there an expected timeline for Unity AI Gateway to come out of Public Preview? Not sure if I missed this announcement at DAIS 2026. Hoping that this offering is available in public preview soon.

EDIT: I'd also be curious to hear from those who have Unity AI Gateway enabled whether this is comparatively costlier compared to Microsoft AI Foundry / other Agent Registry Frameworks / AI Governance tools.

reddit.com

u/RazzmatazzLiving1323 — 3 days ago

▲ 74 r/databricks+1 crossposts

Data Quality pattern I landed on using dbt + DQX

dbt tests are a great CI gate, but they run after the model builds and only detect by the time one fails, the bad data is already in your table. For "keep the good rows, isolate the bad ones, keep going" pattern, you need row-level DQ that runs in-transit, and I adopted DQX as I use it with other Spark workloads too.

The thing I had to unlearn is that you do not rewrite your dbt SQL gold model as Python. The transformation logic stays in a normal `.sql` model, you just add a thin Python model beside it that dbt.ref()s it and applies DQX. The Python model (orders_gold_dq) becomes the published Gold table; the SQL model (orders_gold) becomes an internal intermediate. Your downstream consumers point to orders_gold_dq, not orders_gold.

-- models/orders_gold.sql
select order_id, customer_id, amount, status from {{ ref('orders_silver') }}

Thin DQ Layer:

# models/orders_gold_dq.py  
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient

def model(dbt, session):
  df = dbt.ref("orders_gold")
  checks = [
    {"criticality": "error", "check": {"function": "is_not_null",
       "arguments": {"column": "order_id"}}},
    {"criticality": "warn",  "check": {"function": "is_in_list",
       "arguments": {"column": "status", "allowed": ["new", "paid", "shipped"]}}},
    {"criticality": "error", "check": {"function": "is_unique",
       "arguments": {"columns": ["order_id"]}}},
 ]
  dq = DQEngine(WorkspaceClient())
  valid_df, quarantine_df = dq.apply_checks_by_metadata_and_split(df, checks)
  return valid_df  # clean rows become the published table

If your transformation is already a Python model (e.g. complex PySpark logic), you don't need the extra _dq.py layer at all, just embed DQX directly inside that model before the return:.

error rows → quarantine only, never written to the clean table; warn rows → stay in clean output with _warnings metadata, not quarantined. Each quarantined row carries _errors/_warnings with the rule + a readable message.

Wire DQX in as a serverless dep in dbt_project.yml (+submission_method: serverless_cluster,
+environment_dependencies: [databricks-labs-dqx]), then just dbt run or use your preferred scheduling pattern. And I scheduled this on Databricks Lakeflow with a dbt task, as seen in the picture above.

u/zr-brickster — 4 days ago

▲ 51 r/databricks

Timeseries on Databricks Lakebase

Introducing LakeTS: time-series capabilities, now native to Lakebase.

Timeseries has usually meant standing up a separate database. LakeTS chanaes that - a pure-SOL toolkit that brings full time-series power to the Databricks Data Intelligence Platform: hot on Lakebase, governed by Unity Catalog on Cold Layer, zero extensions

What's inside:

-> ChronoTables - time-partitioned tables with pre-created chunks, BRIN indexes. and millisecond chunk drops

-> Native SQL functions - time_bucket, locf, gapfill, rate and more in pure PL/pgSQL

-> Incremental RollUps - watermarked, cascading hierarchical aggregates

-> Last Value Cache- sub-10ms reads on the latest value per key

-> Hot/cold tiering - Lakebase CDF streams older data into a Unity Catalog Managed Table for cheap, long-horizon retention

-> SQL-native alerts + bulk ingest from edge devices

u/Feisty-Angle-4210 — 4 days ago

▲ 75 r/databricks+10 crossposts

We open-sourced a graph-free multi-hop RAG framework — matches Graph-RAG accuracy without the rebuild cost (Apache-2.0)

We just open-sourced MOTHRAG - a multi-hop RAG framework that skips the knowledge graph entirely.

The problem we kept running into: the accurate multi-hop systems (GraphRAG, HippoRAG, RAPTOR) all build a graph offline, and every time the data changes you rebuild it. For a corpus that updates often, that's a constant re-indexing bill.

MOTHRAG uses a graph-free dense index with query-time orchestration instead, no graph, no GPU, every component behind a commodity API. On multi-hop benchmarks it matches the graph-based systems, and updates are just embed-and-append instead of a full rebuild.

Benchmark	MOTHRAG (ours)	GraphRAG	HippoRAG	RAPTOR
HotpotQA	78.1	68.6	75.5	69.5
2WikiMultiHop	76.3	58.6	71.0	52.1
MuSiQue	50.5	38.5	48.6	28.9

Apache-2.0, pip install + API keys to run. Honest weak spot that we have right now: recall bottlenecks on MuSiQue, still working on that one tho. Repo in the comments.

Would love feedback from anyone running RAG on changing data in production!

u/Annual-Commercial563 — 4 days ago

▲ 76 r/databricks

Lakehouse//RT is faster than the FLASH ⚡

🛑 What's Lakehouse// RT?
Lakehouse Real-Time s a serverless compute built for low-latency, high-concurrency use cases. It offers sub-second latency on SQL read queries against your Unity Catalog tables that use Delta Lake or Apache Iceberg formats in cloud storage.

🛑 How can I spin up a Lakehouse//RT compute ?
You create and manage Lakehouse//RT much like you do other SQL warehouses.

🛑 What's Reyden ?
It's name of the Engine powering Lakehouse//RT

Learn more: https://docs.databricks.com/aws/en/compute/sql-warehouse/real-time

u/Youssef_Mrini — 5 days ago

▲ 50 r/databricks

BI platforms ranking

Not off the charts like in AI platforms, but in BI, Databricks is included for the first time and already is second in visionaries #databricks

u/hubert-dudek — 4 days ago

▲ 14 r/databricks

Data Architect - thoughts from my first Data + AI Summit

Finally had some time to gather my thoughts after Data + AI Summit.

Overall, really well produced event. Databricks knows how to put on a conference. But if I’m being honest, the AI stuff started to blur together pretty fast.

Also had very practical conversations about the less sexy problems: cost, reliability, orchestration, observability, governance,etc.

A few things I kept hearing:

Costs are becoming impossible to ignore. Not just for finance teams. Data teams are being forced to care because “we’ll optimize it later” apparently does not count as a strategy.

Favorite booths and companies I’ll be taking demos with :

Astronomer

Good booth. Clear message around orchestration, and Airflow is obviously still not dead, despite how many times people have tried to kill it. They did seem kind of annoyed when I asked if I could recreate the kiss cam photo with my CEO though lol.

Monte Carlo

Data reliability is one of those things everyone forgets about until the dashboard is wrong and you have a major meeting in an hour

Zipher

Favorite booth for me, and one of the busier ones from what I saw. They’re focused on optimizing Databricks workloads around cost, performance, and reliability. The platform learns workload behavior and adjusts infrastructure automatically. Claims up to 60% databricks + cloud savings

Anyone have any other takeaways from this conference or opinions on the vendors?

reddit.com

u/This-You-2737 — 4 days ago

▲ 7 r/databricks

Customer Lake and Zero Ops

Be honest please... Are these actually just vibe coded projects that were created a few weeks before the key note because you were afraid cool stuff like reyden was too technical and you needed some simpler things to present?

Customer lake looks pretty cool for our sales people but my account team isnt signing us up, and usually private previews arent a problem to push some paper work through.

reddit.com

u/Odd-Government8896 — 4 days ago

▲ 6 r/databricks

Differences Databricks as part of SAP BDC vs Databricks proper

hi everyone!

Company is planning to move to SAP S4/HANA. We're currently using MS Fabric but plan to move to Databricks.

Does it make a difference in terms of functionality if we get Databricks through SAP Business Data Cloud vs Databricks proper?

I am wondering if the version we get through SAP is full-blown Databricks or if there are limitations?

Thanks

reddit.com

u/Fun-Highlight1735 — 4 days ago

▲ 3 r/databricks

Do databricks partners need to pay for databricks account?

Hi guys, our company is new to databricks and we want to become marketplace provider, so for that we have become databricks partner.
and now that we want to develop our app/accelarator that we will put on databricks marketplace, do we need to get a paid databricks account or does databricks provide it for free to their partner companies?
We already have free tier account but i don't think it will be possible to develop apps on it and use the free account to deploy app to marketplace.

sorry if it is stupid question, but we are still trying to figure out how things work here.

reddit.com

u/CaptainHawk786 — 4 days ago