
The Lakestream as the Convergence of Open Table Formats & Kafka (featuring Ursa)
In the span of two weeks, I had two different podcast guests call Kafka the TCP/IP of messaging and Iceberg the TCP/IP of tables. The idea being that, for all their imperfections, these systems have gathered a large enough network effect and ecosystem build out that they simply are the easiest and most straightforward thing to adopt when it comes to sharing data. (i.e sharing messages, or tables). It’s a coincidence, but I think there is truth there.
In this context, I’m excited to see deeper integration between Kafka and open table formats. I think it makes sense. I was excited when Bufstream came out (now defunct) namely because of the first-class schema integration/enforcement and the zero-copy Iceberg sink that easily enables.
The most recent entry in this area has been Ursa-for-Kafka by StreamNative (the Pulsar guys who have pivoted to Kafka too). Ursa-for-Kafka (UFK) is a new proprietary Kafka fork (to be open sourced soon) that takes a few interesting architectural choices:
- adds an additional storage layer for “Ursa topics” (their name for diskless topics, backed by their Ursa storage engine), the Ursa storage layer persists topics in a columnar open-table format
- supports different topic types inside the same cluster (fast, classic topics & diskless)
- is a minimally-invasive fork, which means the regular Kafka classic topic path + tiered storage remain the same. It also means there’s full API support since it’s literally the real Kafka
It’s conceptually similar to Aiven’s Inkless, but seemingly with better open table format support and subtle differences in the diskless architecture: Inkless uses Postgres, Ursa uses Oxia, a project I found interesting in of itself; Ursa has separate compaction workers. The great thing these two projects have (alongside RedPanda nowadays) is their different topic profiles - the ability to have a classic, low-latency topic and a cheap diskless topic inside the same cluster serving different workloads.
All else equal, Ursa ought to be a tad more mature because the engine had a year or two head start in front of Inkless.
The write path works like any other diskless Kafka. As a reminder, in diskless/leaderless Kafka implementations, brokers batch data from many partitions and periodically (e.g every 250ms) persist a single file with multi-partition data to S3, alongside each partition’s record coordinates in a metadata store (Oxia here). After a while, these files get “compacted” in a read-optimized single-partition file (very similar to Kafka’s regular segment files).
In the case of Ursa (and Bufstream, previously), the data is instead compacted into per-partition Parquet files and committed in an Iceberg table.
The main question with this type of columnar storage/diskless workflow is - how do reads work?
Tail reads are served from cache, just like every other diskless Kafka. The cache builds off the in-memory write, or the row-based mixed S3 files before they get a chance to get compacted into an open table format.
Historical data is read from the columnar-based per-partition Parquet files, which must apply a CPU conversion tax and higher latency. I am inclined to think this isn’t that important, because non-tail reads are rare. They also must not be very latency sensitive given this is a slow diskless topic anyway and the data is old.
sidenote - I also wonder if systems that need the historical data may be made to read more optimally from the Parquet itself?
The LakeStream
The topic of the post. The definition of the buzzword I take to mean “an architecture that treats event streams as a first-class lakehouse primitive”.
Besides StreamNative’s LakeStream buzzword, Ververica calls their platform (based on Flink + Paimon) a StreamHouse. There is a big technical implementation difference between both, but the core idea I believe is the same - integrate open table formats with real-time data.
Of course, in 2026 most Kafka vendors offer open table integration too:
- IBM Confluent Cloud has Tableflow - the first one to do it
- Aiven has Iceberg Topics (OSS inside the KIP-405 Tiered Storage plugin, so OSS kafka can use this too)
- IBM Confluent WarpStream also has Tableflow, but theirs is allegedly a stand-alone product compatible with any Kafka (good idea)
- Streambased ISK offers an Iceberg API translation layer on top of your Kafka data
- AutoMQ has table topics
- Apache/Iceberg has an OSS Iceberg connector
- Tansu has lake sink.
The devil is in the details with regards to each implementation. My preference, all else equal, is one that’s natively built into the product. The only ones that have this are Ursa and Buf (which doesn’t have it anymore).
I really believe the convergence of open table formats & Kafka data is going to be the defining trend in the next few years. You get
- a) cost-efficient storage (S3)
- b) cost-efficient format (Parquet compresses very well)
- c) very easy ecosystem integration via Iceberg without duplicating the data, without necessarily transforming it and without organizational/operational issues of going through Kafka (eg not placing load on the brokers)
One thing I find cool is how Databricks’ Zerobus allows users to create “table-first topics”, meaning a regular schematized SQL CREATE TABLE is what creates the stream. It’s thinking query-engine first. I wonder if the future holds something similar for Kafka?
What’s your take? Am I falling for the hype train, or does this look like the new exciting thing in data engineering? After two years of Iceberg, I have begun to think more the latter.