r/apachekafka

Kafka Streams at-least-once delivery - How to prevent duplicate calls to non-idempotent services?

Building a Kafka pipeline in K8s. Concerned about duplicate deliveries to non-idempotent downstream services.

My flow:

Kafka Streams → produces to topics → Kafka Connect → destinations

The problem (at-least-once delivery):

1. Kafka Streams processes message
2. Produces to output topic 
3. Kafka Connect writes to MongoDB 
4. Kafka Connect calls backend service API 
5. Pod dies BEFORE offset commit
6. On restart: Kafka redelivers (at-least-once)
7. MongoDB: idempotent upsert (fine)
8. Backend service: Gets called AGAIN (duplicate!)

My question:

With Kafka's at-least-once delivery guarantee, messages can be redelivered on failures.

  • MongoDB/Elasticsearch have idempotent upserts (fine)
  • But I also call other backend services (REST APIs, payment processing, notifications) that are NOT idempotent

How do I prevent duplicate calls to non-idempotent services when Kafka redelivers?

Options I'm considering:

  • A) Outbox pattern with deduplication table?

Requirements:

  • Zero data loss
  • No duplicate API calls to backend services

What's the standard production approach? How do you handle at-least-once delivery with non-idempotent downstream systems?

Is trusting Kafka Streams' built-in reliability enough, or should I add additional safeguards like an outbox pattern?

Looking for real-world experience from folks running Kafka Streams in production Kubernetes environments.

reddit.com
u/Careless_Treacle2713 — 3 days ago

How do you test your integration when the external system isn't ready yet?

I work in a telco company and we regularly work on integrations with external platforms — payment providers, orchestration engines, provisioning systems, partner APIs. Almost always there's parallel development and the other side isn't ready when we are.

For example, our system calls their REST endpoint and then waits for them to publish a Kafka event back to us. To simulate this during development, we use Postman to mock their REST response and then manually produce a Kafka event to our platform to simulate their async callback. Two separate manual steps — no correlation between them and every developer does this locally while QA does the same thing on a shared test environment.

Curious how others handle this. Do you use WireMock, Microcks, something else? Do you write custom stubs? How do you simulate a platform that receives your REST call and automatically fires a Kafka event back — so the whole flow is defined in one place and works the same way for devs locally and for QA on a shared environment?

Does something exist that lets you configure this kind of complex mock once — REST response plus async Kafka callback as a single flow — and share it across the whole team, so nobody is blocked waiting for the external system to be implemented, tested and deployed?

reddit.com
u/Few_Image7384 — 4 days ago

Community for open source contributors?

Hi everyone, I have been using Kafka at work for quite some time now and was wondering if there's any slack or discord community to connect with contributors and maintainers for the project. I have seen that other Apache projects like Airflow have a pretty active slack community for discussion and guidance for beginners interested in open source contributions. So do we have something similar for kafka where I can connect and potentially ask questions if I want to contribute to the code?

reddit.com
u/StrikingStand4346 — 6 days ago
▲ 13 r/apachekafka+2 crossposts

If you're consuming from Kafka and writing into ClickHouse, sync inserts at high message rates will hurt you. Async insert mode helps a lot but the buffering and dedupe behavior isn't always obvious.

Wrote this up from our my experience building a stream processing pipeline.

Curious how others are handling the Kafka → ClickHouse write path.

u/Marksfik — 8 days ago

The Lakestream as the Convergence of Open Table Formats & Kafka (featuring Ursa)

In the span of two weeks, I had two different podcast guests call Kafka the TCP/IP of messaging and Iceberg the TCP/IP of tables. The idea being that, for all their imperfections, these systems have gathered a large enough network effect and ecosystem build out that they simply are the easiest and most straightforward thing to adopt when it comes to sharing data. (i.e sharing messages, or tables). It’s a coincidence, but I think there is truth there.

In this context, I’m excited to see deeper integration between Kafka and open table formats. I think it makes sense. I was excited when Bufstream came out (now defunct) namely because of the first-class schema integration/enforcement and the zero-copy Iceberg sink that easily enables.

The most recent entry in this area has been Ursa-for-Kafka by StreamNative (the Pulsar guys who have pivoted to Kafka too). Ursa-for-Kafka (UFK) is a new proprietary Kafka fork (to be open sourced soon) that takes a few interesting architectural choices:

  • adds an additional storage layer for “Ursa topics” (their name for diskless topics, backed by their Ursa storage engine), the Ursa storage layer persists topics in a columnar open-table format
  • supports different topic types inside the same cluster (fast, classic topics & diskless)
  • is a minimally-invasive fork, which means the regular Kafka classic topic path + tiered storage remain the same. It also means there’s full API support since it’s literally the real Kafka

It’s conceptually similar to Aiven’s Inkless, but seemingly with better open table format support and subtle differences in the diskless architecture: Inkless uses Postgres, Ursa uses Oxia, a project I found interesting in of itself; Ursa has separate compaction workers. The great thing these two projects have (alongside RedPanda nowadays) is their different topic profiles - the ability to have a classic, low-latency topic and a cheap diskless topic inside the same cluster serving different workloads.

All else equal, Ursa ought to be a tad more mature because the engine had a year or two head start in front of Inkless.

The write path works like any other diskless Kafka. As a reminder, in diskless/leaderless Kafka implementations, brokers batch data from many partitions and periodically (e.g every 250ms) persist a single file with multi-partition data to S3, alongside each partition’s record coordinates in a metadata store (Oxia here). After a while, these files get “compacted” in a read-optimized single-partition file (very similar to Kafka’s regular segment files).

In the case of Ursa (and Bufstream, previously), the data is instead compacted into per-partition Parquet files and committed in an Iceberg table.

The main question with this type of columnar storage/diskless workflow is - how do reads work?

Tail reads are served from cache, just like every other diskless Kafka. The cache builds off the in-memory write, or the row-based mixed S3 files before they get a chance to get compacted into an open table format.

Historical data is read from the columnar-based per-partition Parquet files, which must apply a CPU conversion tax and higher latency. I am inclined to think this isn’t that important, because non-tail reads are rare. They also must not be very latency sensitive given this is a slow diskless topic anyway and the data is old.

sidenote - I also wonder if systems that need the historical data may be made to read more optimally from the Parquet itself?

The LakeStream

The topic of the post. The definition of the buzzword I take to mean “an architecture that treats event streams as a first-class lakehouse primitive”.

Besides StreamNative’s LakeStream buzzword, Ververica calls their platform (based on Flink + Paimon) a StreamHouse. There is a big technical implementation difference between both, but the core idea I believe is the same - integrate open table formats with real-time data.

Of course, in 2026 most Kafka vendors offer open table integration too:

  • IBM Confluent Cloud has Tableflow - the first one to do it
  • Aiven has Iceberg Topics (OSS inside the KIP-405 Tiered Storage plugin, so OSS kafka can use this too)
  • IBM Confluent WarpStream also has Tableflow, but theirs is allegedly a stand-alone product compatible with any Kafka (good idea)
  • Streambased ISK offers an Iceberg API translation layer on top of your Kafka data
  • AutoMQ has table topics
  • Apache/Iceberg has an OSS Iceberg connector
  • Tansu has lake sink.

The devil is in the details with regards to each implementation. My preference, all else equal, is one that’s natively built into the product. The only ones that have this are Ursa and Buf (which doesn’t have it anymore).

I really believe the convergence of open table formats & Kafka data is going to be the defining trend in the next few years. You get

  • a) cost-efficient storage (S3)
  • b) cost-efficient format (Parquet compresses very well)
  • c) very easy ecosystem integration via Iceberg without duplicating the data, without necessarily transforming it and without organizational/operational issues of going through Kafka (eg not placing load on the brokers)

One thing I find cool is how Databricks’ Zerobus allows users to create “table-first topics”, meaning a regular schematized SQL CREATE TABLE is what creates the stream. It’s thinking query-engine first. I wonder if the future holds something similar for Kafka?

What’s your take? Am I falling for the hype train, or does this look like the new exciting thing in data engineering? After two years of Iceberg, I have begun to think more the latter.

u/2minutestreaming — 8 days ago

RFC: What do we want to do about AI-generated content on r/apachekafka?

There's been a sharp rise in the quantity of AI-generated content being shared on this sub. This includes blogs, videos, and tools. My concern is that we are frogs in a pan, and the water temperature is approaching boiling without us realising. Low-quality posts dilute visibility of useful content; people stop bothering to even downvote; the sub slowly dies. (Related: AI Slop is Killing Online Communities.)

For some context, visits to this sub are down over the last four months straight, and nearly 50% since October's peak. Perhaps this is unrelated. Perhaps not.

People sharing content built with AI are often not ill-intentioned. They are really excited about the thing they just created. But often it's not actually that novel or useful for the Apache Kafka community, and more of a "hey look what happened when I prompted my AI tool!". Which is cool, but doesn't belong on r/apachekafka.

I'm a member of other subs who have similar challenges, and see various approaches — all with their advantages and disadvantages.

What are the options?

  1. Ban anything AI-created.
  2. Mandate self-disclosure and labelling — poster must include Built with AI flair.
  3. No drive-by link dumping. Contributors must be already active in the sub, and engage with comments on their posts. A first-time post in r/apachekafka may not be sharing one's own content.
  4. Do nothing. Let downvotes do their thing.
  5. Other?

My opinion

Doing nothing is not an option. Downvotes are but a paper cocktail umbrella against a deluge; ineffective at scale. The community will disengage and over time disintegrate.

An outright ban is not the positive engagement environment that we seek to foster on r/apachekafka. StackOverflow tried the absolutist route, and I along with many others simply walked away because it sucks.

My proposal is that we adopt rules 2 and 3 above.

Your opinion?

This post is literally a Request for Comments 😄

Reply with your thoughts and votes for the above options (or other suggestions). I'll summarise next week and review with the rest of the mod team before any further action is taken.

u/rmoff — 10 days ago

Apache Kafka Community Events at Current London

Calling all Apache Kafka users (devs, architects, operators, etc) who are attending Current London

Besides all the amazing talks lined up, we wanted to share 2 Apache focused sessions that will provide you an opportunity to engage with AK committers, PMC members, adn the community at large.

  • The Apache Kafka AMA (Tuesday | 12:30 PM | Expo Hall - Meetup Hub)
    • Come ask tough questions to all the PMC members
  • Office Hours : The Apache Kafka Guildhall (Tuesday | 3:00 PM | Expo Hall - Sponsor Theater)
    • Come and share your Kafka stories with other practitioners and PMC members and learn along the way. This is an open session, not a presentation, located at the Sponsor Theater.
u/LoudCat5895 — 8 days ago

How do you handle SOC 2 / PCI-DSS evidence collection for Kafka?

Genuinely curious how teams here approach this.

For context, I've been spending a lot of time on the audit side of Kafka — SOC 2, PCI-DSS, ISO 27001 — and the recurring pain seems to be:

  1. Inventory: nobody's quite sure how many topics, clusters, or principals exist

  2. ACL audit: someone granted User:* during an incident a year ago and nobody undid it

  3. Inter-broker TLS: enabled on the dev cluster, mysteriously not on prod

  4. Audit logs: enabled, but no retention policy, so the auditor's "who consumed from this topic last quarter" question can't be answered

Some questions I'd love to hear answers to from this community:

- Do you run a pre-audit checklist? If yes, manual or automated?

- How do you prove inter-broker is encrypted, in writing, to an auditor?

- What's your strategy for ACL drift? (Periodic review? Diff against IaC?)

- Has anyone tied control evidence to CI/CD — i.e., the build won't merge if compliance breaks?

I work on an open-source project in this space (KafkaGuard) but I'm asking because the *questions* keep coming up identically across teams and I'd like to know what's working in the wild — tools, scripts, processes, anything.

Will share aggregated patterns from the replies if there's enough discussion.

reddit.com
u/jacksparrowon — 13 days ago