u/Marksfik

Deep dive into Denormalization in ClickHouse: When to use it vs. Joins

Deep dive into Denormalization in ClickHouse: When to use it vs. Joins

Hey everyone,

ClickHouse is incredibly fast, but how you structure your data still makes a massive difference as scale grows. While ClickHouse has made huge strides in handling JOIN operations recently, denormalization is still one of the go-to strategies for squeezing out maximum query performance.

Here's a breakdown analyzing the exact tradeoffs of denormalizing data in ClickHouse, specifically looking at query speeds, storage overhead, and how to handle updates when your flat tables need to change.

How you folks handle this in production: do you lean heavily into flattening your schemas upfront, or are you relying more on Dictionary lookups and standard Joins these days?

Link to the full breakdown: https://www.glassflow.dev/blog/denormalization-clickhouse?utm_source=reddit&utm_medium=socialmedia&utm_campaign=reddit_organic

glassflow.dev
u/Marksfik — 1 day ago
▲ 10 r/OpenTelemetry+1 crossposts

We've been running ClickHouse as our telemetry backend and kept hitting the same wall: getting data in reliably is harder than it looks.

The naive path (OTel Collector → ClickHouse directly) breaks down fast in practice :

  • Collector retries cause duplicate spans
  • PII ends up in trace attributes before storage
  • ClickHouse gets hammered by small, frequent inserts instead of efficient batches
  • No good to apply tail sampling + schema normalization in the same pass

So I put together a working setup using a streaming layer between the collector and ClickHouse. The architecture ended up being:

TelemetryGen → OTel Collector (tail sampling) → GlassFlow → ClickHouse → HyperDX

The collector handles sampling. The streaming layer handles span dedup on a 1-hour window, PII masking before storage, schema mapping to ClickHouse-native types, and batched inserts.

The full guide covers the Kubernetes setup, Helm values, OTel Collector config, pipeline definition, and ClickHouse DDL.

Happy to answer questions about the collector config or the dedup approach... that part took a few iterations to get right.

u/Marksfik — 11 days ago
▲ 13 r/apachekafka+2 crossposts

If you're consuming from Kafka and writing into ClickHouse, sync inserts at high message rates will hurt you. Async insert mode helps a lot but the buffering and dedupe behavior isn't always obvious.

Wrote this up from our my experience building a stream processing pipeline.

Curious how others are handling the Kafka → ClickHouse write path.

u/Marksfik — 8 days ago
▲ 9 r/Clickhouse+2 crossposts

ClickHouse is a beast for observability, but dumping raw, un-enriched OTel data into it can lead to massive storage costs and messy queries. We just launched native OTLP connection for GlassFlow that moves that processing upstream when it comes to enriching and filtering OTel spans before they hit the table.

The goal is to keep the dashboards fast without the overhead of massive background merges or complex SQL views. Check out the setup we’re using for enriched OTel pipelines. What’s your biggest bottleneck when querying raw OTel data in ClickHouse? 🤔

u/Marksfik — 18 days ago