r/apachekafka

Building a lightweight Kafka monitoring tool for small teams — worth paying for it

Been running Kafka in production for a while now and honestly the monitoring situation for small teams sucks. Confluent Control Center is way overkill/expensive, Datadog's Kafka integration is priced like you're a 200-person company, and the open source stuff (AKHQ, Kafdrop, Burrow) works but needs someone to babysit the setup, patch it, and actually understand consumer lag internals to make sense of it. I'm thinking about building a simple hosted tool — just point it at your cluster, get consumer lag alerts, topic health, broker metrics, no Prometheus/Grafana stack to maintain. If you're running Kafka on a small team (like 2-10 devs) — what do you currently use for this? Would you actually pay for something dead simple over self-hosting the OSS stack, or is that a dealbreaker for you? Trying to figure out if this is a real problem or just something that annoys me specifically.

Note : I have used AI for corrections

reddit.com

u/Gloomy-Long-8045 — 3 days ago

▲ 2 r/apachekafka

Does MM2 actually support exactly once semantics?

I have been trying to get a clear answer on whether MM2 supports EOS for cross cluster replication.

I found KIP-618(Exactly once support for source connectors), which was introduced in Kafka 3.3. Since MM2 is a source connector, it should theoretically inherit EOS from it using exactly.once.source.support=enabled at worker level.

However kafka official documentation does not mention anything about MM2 EOS.

So, has anyone successfully used exactly-once with MM2? Has anyone tried this with strimzi as well?

reddit.com

u/Weekly_Diet2715 — 4 days ago

▲ 57 r/apachekafka+1 crossposts

Interesting Kafka Links - June 2026

rmoff.net

u/rmoff — 7 days ago

▲ 6 r/apachekafka

Why does Kafka allow writes when ISR < min.insync.replicas (with acks=all)?

I’m currently learning Kafka, and while learning about ISR (In-Sync Replicas), acks, and min.insync.replicas, I tried to demonstrate the behavior in a local multi-broker setup.

I observed something that doesn’t match my understanding, so I wanted to ask here.

Setup

3 Kafka brokers running in Docker
Topic config:
- partitions = 3
- replication.factor = 3
- min.insync.replicas = 100

Topic description:

./kafka-topics.sh --describe --topic isr-error --bootstrap-server kafka-broker-one:9092

Output:

Topic: isr-error PartitionCount: 3 ReplicationFactor: 3 Configs: min.insync.replicas=100

Partition: 0 Leader: 3 Replicas: 3,1,2 Isr: 3,1,2
Partition: 1 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
Partition: 2 Leader: 2 Replicas: 2,3,1 Isr: 2,3,1

Producer command:

./kafka-console-producer.sh \
  --topic isr-error \
  --bootstrap-server localhost:9092 \
  --command-property acks=all \
  --command-property request.timeout.ms=2000 \
  --command-property delivery.timeout.ms=5000 \
  --command-property retries=0

My understanding

From Kafka documentation and this explanation by Jun Rao (Kafka co-founder / Confluent):

Jun Rao explanation of min.insync.replicas

For writes with acks=all, produce requests should succeed only if:

ISR count &gt;= min.insync.replicas

In my case:

ISR = 3
min.insync.replicas = 100

So:

3 &gt;= 100 → false

Based on this, I expected produce requests to fail immediately with NotEnoughReplicasException.

Actual behavior

Producing succeeded while all 3 brokers were alive.
Consumer successfully received the messages.

Only after stopping one broker did produce requests fail with:

org.apache.kafka.common.errors.NotEnoughReplicasException:
Messages are rejected since there are fewer in-sync replicas than required.

Question

Why did Kafka accept produce requests earlier even though ISR (3) was already less than min.insync.replicas (100)?

Why was enforcement triggered only after a broker failure / ISR shrink event?

Am I misunderstanding how min.insync.replicas is enforced, or could this be specific to certain Kafka versions / KRaft / Docker setups?

For context:

Kafka version: 4.2.0
Mode: KRaft
Docker image: apache/kafka:latest

u/MrDV6 — 6 days ago

▲ 4 r/apachekafka

do you debug local kafka consumer issues by grepping logs manually?

I am a swe working remotely and i have daily things to observe kafka jobs and check if data is flowing well so for that I was trying to go through logs and its messy like i wasn't ab;e to check which consumer is taking which of the messages is this the same for u guys or u have better alternatives to this

reddit.com

u/Weak_Wing9818 — 5 days ago

▲ 1 r/apachekafka

[Design Help] Efficient key-based lookup on a large Kafka topic for a background verification workflow

We are building a background workflow where for a given input, we need to find the corresponding message in Kafka and verify some fields on it.

Our Kafka setup:

- compacted topic, 24 partitions, ~200M messages per partition (~2.5B unique keys total)

- ~700 bytes per message, so roughly 1.75TB of data

The lookup pattern is key-based, ~10k/sec, background process so some latency is fine.

We do have a way to derive the partition from the key and an API to get the offset, so seek+fetch is technically possible — but our Kafka brokers are a shared resource across teams and we don't want to hammer them with random-access reads at this scale.

How would you build the lookup layer here? What would you use, how would you keep it in sync with the topic, and anything to watch out for at this scale?

For context, we're leaning towards RocksDB — consuming the topic, storing only the fields we need for verification, and using Protobuf to keep it compact. But curious if there are better approaches or gotchas we are not thinking about.

reddit.com

u/Initial-Wishbone8884 — 7 days ago

▲ 0 r/apachekafka

How do you handle robust ingestion in your orgs?

Our product needs so scan cloud assets (e.g. from aws account) and product insights after all assets has been saves to our storage.

Currently we scan the account and send every result to Kafka that in turn being consumed by s3 sink writing messages to s3.

The reason we do this is to allow for "fire and forget" ingestion architecture, the message reaches Kafka and we don't need to worry about it anymore.

Problem is it's not really working for us, pods can suffer from OOM issues and retry messages forever (auto commit = false) so we had to make it true. Now we need an external state store that counts how many times a message was acked so we now when to send it to DLQ.

We're also using auto scaling our pods in response to Kafka messages which also caused all sorts of issues in the past.

To me it seems like a super overkill for ingestion pipeline so hence the title, how do you design your robust ingestion pipeline?

Happy to answer more questions

reddit.com

u/Classic_Ad5341 — 8 days ago

▲ 10 r/apachekafka+1 crossposts

Monedula Kafka Simulator

What happens in Apache Kafka during a split brain? What if you run an IBM Confluent stretched 2.5-DC architecture?

We created a Kafka Simulator in which you can simulate failures and check how different settings affect the cluster. The first release focuses on a single-DC setup and includes 13 built-in, step-by-step learning scenarios.

Blogpost describing current release: https://monedula.dev/blog/kafka-simulator-learn-kafka-by-breaking-it
Simulator: https://monedula.dev/kafka-simulator/

u/mmatloka — 10 days ago

▲ 37 r/apachekafka+10 crossposts

Do you actually need Kafka between your OTel collector and ClickHouse?

Kafka → ClickHouse is the default pattern for OTel pipelines, and for org-wide streaming with replay and many consumers it's a great fit. But for a lot of single-sink observability setups, it's a cluster you're babysitting for no reason.

This post compares where the Kafka layer does real work vs. where you can drop it. It also checks what processing the Collector can or can't do alone (stateful dedup, enrichment-conditional filtering, dynamic sampling, etc.)
https://www.glassflow.dev/blog/opentelemetry-to-clickhouse-do-you-need-kafka?utm_source=reddit&utm_medium=socialmedia&utm_campaign=reddit_organic

Curious what others run:

Kafka buffer,
straight from the collector, or
a lighter processor in between

Leave your comments below, I'd like to discuss the options and understand what folks are using these days!

glassflow.dev

u/Marksfik — 14 days ago

▲ 3 r/apachekafka

Distributed transaction mishaps

Hey Everyone,

@Transactional doesn't cover Kafka. Most code assumes it does.
The DB write rolls back fine. The Kafka publish doesn't know the transaction exists — and a successful commit is no guarantee it ever gets sent.

Wrote an article explaining this common misconception and giving food for thought on how to deal with it

medium.com

u/PickleIndividual1073 — 11 days ago